InfiBench: Evaluating the Question-Answering Capabilities of Code LLMs

The InfiCoder Team

InfiBench is a comprehensive benchmark for code large language models evaluating model ability on answering freeform real-world questions in the code domain.

Overview

Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions. To fill this gap, we propose InfiBench, the first large-scale freeform question-answering (QA) benchmark for code to our knowledge, comprising 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. InfiBench uses four types of model-free automatic metrics to evaluate response correctness where domain experts carefully concretize the criterion for each question. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings. Our detailed analyses showcase potential directions for further advancement of code LLMs. InfiBench is fully open source and continuously expanding to foster more scientific and systematic practices for code LLM evaluation.

Statistics and Examples

InfiBench comprises 234 carefully picked high-quality Stack Overflow questions, covering 15 programming languages, and largely following the natural question distribution of Stack Overflow.


We recruited five domain experts to create the benchmark and annotate the correctness evaluation criteria. Specifically, the InfiBench framework integrates four types of model-free metrics for evaluating the correctness: keywords matching, blank filling, unit testing, and dialogue similarity.


Below is the question type, metric type, and length statistics.

Comparison

Existing benchmarks weigh heavily on code generation, unit-test-based evaluation, and a limited set of programming languages. InfiBench processes a much higher diversity to reflect real-world code LLMs’ usage scenarios and is far from saturation.

Prompts and Evaluation Protocol

Each question contains a system prompt and content prompt. For questions whose responses are mainly in natural language, the system prompt is
You are a professional assistant for programmers. By default, questions and answers are in Markdown format. You are chatting with programmers, so please answer as briefly as possible.
For other questions, the system prompt is
You are a professional assistant for programmers. By default, questions and answers are in Markdown format.
We then format the system prompt and content prompt following each model's default instruction template. If no instruction template specified, we use the prompt format
{system prompt}\n{content prompt}

We adopt best@10 as the main evaluation metric, where 10 responses are sampled and evaluated for each question and the best score per question is recorded and summed up. Throughout the evaluation, we set sampling temperature T to be 0.2 and top p cut-off threshold to be 0.9. We leave the exploration of other hyperparameters as the future work.

For score computation, we treat each question equally with one point each. Since the question frequency largely follows the Stack Overflow distribution, this score can be explained as how well the model responds to Stack Overflow questions. Given 234 questions in the benchmark, the full score is 234, and we by default report the percentage score (achieved score divided by the full score which is 234). The one point for each question can be further decomposed into a few scoring points within each question. For example, a question may contain four keywords with weights 2, 1, 1, and 1 each. Then, matching each keyword can contribute to 0.4, 0.2, 0.2, and 0.2 points respectively to the final score.

Leaderboard


Each point corresponds to an open-source model, with error bars for those smaller than 30B. Each dotted segment corresponds to an MoE model. Proprietary models shown as lines with uncertainty ranges.

Notice: we set the max tokens to generate=1024 (since GPT4 generates 662 tokens without the constraint, we provide some wiggle room by setting to 1024 tokens)

For models with >30B parameters, we evaluate once due to resource limit, otherwise we evaluate three times and report the mean and standard deviation.

stands for proprietary models.


Rank Model Name # Params. (in B) Context Length Full Set Score Full Set Std
1 GPT-4/GPT-4-0613 ? 8192 70.64% 0.82%
2 GPT-4/GPT-4-turbo-1106 ? 8192 68.42% 0.38%
3 GPT-4/GPT-4o-2024-05-13 ? 8192 66.19%
4 Claude 3/Claude 3 Opus ? 200000 63.89%
5 Mistral Open/Codestral-22b 22B 32768 62.98% 0.56%
6 DeepSeek Coder/deepseek-coder-33b-instruct 33B 16384 62.96%
7 Phind/Phind-CodeLlama-34B-v2 34B 4096 59.00%
8 Phind/Phind-CodeLlama-34B-v1 34B 4096 58.47%
9 Mistral/mistral-large ? 32768 58.22%
10 Claude 3/Claude 3 Sonnet ? 200000 58.20%
11 Claude 3/Claude 3 Haiku ? 200000 57.57%
12 DeepSeek LLM/deepseek-llm-67b-chat 67B 4096 57.41%
13 GPT-3.5/GPT-3.5-turbo-0613 ? 4096 56.47% 1.34%
14 Mistral/mistral-small ? 32768 55.62% 0.46%
15 Mistral Open/mixtral-8x7B-Instruct 46.7B / 12.9B 32768 55.55%
16 Qwen/Qwen-72B 72B 32768 55.34%
17 DeepSeek Coder/deepseek-coder-6.7b-instruct 6.7B 16384 53.25% 0.40%
18 Qwen/Qwen-72B-Chat 72B 32768 52.97%
19 Magicoder/Magicoder-S-CL-7B 7B 16384 52.71% 0.72%
20 WizardLM/WizardCoder-Python-34B-V1.0 34B 16384 52.59%
21 Phind/Phind-CodeLlama-34B-Python-v1 34B 4096 52.17%
22 Magicoder/Magicoder-S-DS-6.7B 6.7B 16384 51.46% 1.09%
23 Code Llama/CodeLlama-34b-Instruct 34B 16384 50.45%
24 01.AI/Yi-34B-Chat 34B 4096 49.58%
25 WizardLM/WizardCoder-Python-7B-V1.0 7B 16384 49.10% 1.59%
26 WizardLM/WizardCoder-Python-13B-V1.0 13B 16384 48.99% 0.92%
27 Code Llama/CodeLlama-34b 34B 16384 47.36%
28 Code Llama/CodeLlama-13b-Instruct 13B 16384 46.37% 1.26%
29 Zephyr/Zephyr 7B beta 7B 32768 46.31% 1.11%
30 StarCoder2/15B-Instruct 15B 16384 45.89% 0.95%
31 DeepSeek MoE/deepseek-moe-16b-chat 16B / 2.8B 16384 45.18% 1.65%
32 OctoPack/OctoCoder 15.5B 8192 44.55% 0.79%
33 Qwen/Qwen-14B 14B 8192 43.69% 1.09%
34 Qwen/Qwen-14B-Chat 14B 8192 43.49% 0.63%
35 Magicoder/Magicoder-DS-6.7B 6.7B 16384 43.47% 0.21%
36 Code Llama/CodeLlama-34b-Python 34B 16384 43.13%
37 Code Llama/CodeLlama-70b-Instruct 70B 4096 42.82%
38 StarCoder2/15B 15B 16384 42.52% 1.24%
39 Magicoder/Magicoder-CL-7B 7B 16384 41.71% 0.76%
40 Code Llama/CodeLlama-13b 13B 16384 41.66% 0.84%
41 DeepSeek Coder/deepseek-coder-1.3b-instruct 1.3B 16384 41.32% 1.12%
42 Code Llama/CodeLlama-13b-Python 13B 16384 41.31% 0.90%
43 WizardLM/WizardCoder-15B-V1.0 15B 2048 41.01% 0.22%
44 Mistral/mistral-medium ? 32768 40.95% 0.41%
45 gemma/gemma-7b-it 7B 8192 40.68% 1.23%
46 Code Llama/CodeLlama-70b 70B 4096 40.60%
47 Code Llama/CodeLlama-70b-Python 70B 4096 40.29%
48 OctoPack/OctoGeeX 6B 8192 40.14% 1.55%
49 DeepSeek LLM/deepseek-llm-67b-base 67B 4096 39.87%
50 Llama 2/Llama2-70B-Chat 70B 4096 39.30%
51 DeepSeek Coder/deepseek-coder-33b-base 33B 16384 38.75%
52 01.AI/Yi-6B-Chat 6B 4096 38.14% 0.58%
53 Llama 2/Llama2-70B 70B 4096 37.69%
54 Code Llama/CodeLlama-7b 7B 16384 37.62% 1.28%
55 Mistral Open/Mistral-7B-Instruct-v0.1 7B 32768 37.55% 1.10%
56 InternLM/InternLM-Chat-20B 20B 16384 37.41% 0.75%
57 Qwen/Qwen-7B-Chat 7B 32768 37.36% 1.29%
58 DeepSeek LLM/deepseek-llm-7b-chat 7B 4096 36.75% 1.40%
59 Llama 2/Llama2-7B-Chat 7B 4096 36.14% 1.05%
60 WizardLM/WizardCoder-3B-V1.0 3B 2048 35.61% 0.42%
61 Code Llama/CodeLlama-7b-Instruct 7B 16384 35.15% 1.02%
62 StarCoder2/7B 7B 16384 34.90% 0.97%
63 InternLM/InternLM-Chat-7B 7B 8192 34.86% 0.90%
64 Baichuan2/Baichuan2-13B-Chat 13B 4096 34.40% 1.34%
65 DeepSeek Coder/deepseek-coder-6.7b-base 6.7B 16384 33.66% 1.24%
66 Code Llama/CodeLlama-7b-Python 7B 16384 32.89% 0.45%
67 Llama 2/Llama2-13B-Chat 13B 4096 32.29% 1.66%
68 WizardLM/WizardCoder-1B-V1.0 1B 2048 31.94% 0.70%
69 Qwen/Qwen-7B 7B 32768 31.69% 0.29%
70 StarCoder2/3B 3B 16384 31.44% 1.92%
71 StarCoder/StarCode+ 15.5B 8192 30.67% 1.57%
72 StarCoder/StarCoder 15.5B 8192 30.66% 0.69%
73 CodeGen2.5/CodeGen2.5-7B-Instruct 7B 2048 29.57% 1.53%
74 Mistral/mistral-tiny ? 32768 29.41% 0.26%
75 InternLM/InternLM-20B 20B 16384 29.41% 0.76%
76 DeepSeek Coder/deepseek-coder-5.7bmqa-base 5.7B 16384 28.92% 1.12%
77 ChatGLM/ChatGLM3-6B 6B 8192 28.23% 0.58%
78 Baichuan2/Baichuan2-7B-Chat 7B 4096 27.53% 1.07%
79 gemma/gemma-2b-it 2B 8192 27.49% 0.52%
80 Qwen/Qwen-1.8B-Chat 1.8B 32768 26.84% 1.08%
81 DeepSeek MoE/deepseek-moe-16b-base 16B / 2.8B 16384 26.65% 0.97%
82 01.AI/Yi-9B 9B 4096 26.39% 0.42%
83 Baichuan2/Baichuan2-13B-Base 13B 4096 26.32% 1.23%
84 DeepSeek LLM/deepseek-llm-7b-base 7B 4096 25.34% 1.08%
85 Llama 2/Llama2-13B 13B 4096 24.50% 0.73%
86 Baichuan2/Baichuan2-7B-Base 7B 4096 23.50% 1.56%
87 DeepSeek Coder/deepseek-coder-1.3b-base 1.3B 16384 23.17% 1.47%
88 Qwen/Qwen-1.8B 1.8B 32768 23.12% 1.13%
89 Mistral Open/Mistral-7B-v0.1 7B 32768 22.72% 1.51%
90 Llama 2/Llama2-7B 7B 4096 22.35% 1.70%
91 01.AI/Yi-34B 34B 4096 22.01%
92 davinci/davinci-002 ? 16384 21.25% 1.17%
93 Mistral Open/mixtral-8x7B 46.7B / 12.9B 32768 21.21%
94 Phi/Phi1.5 1.5B 2048 20.56% 0.09%
95 01.AI/Yi-6B 6B 4096 19.93% 1.24%
96 CodeGeeX/CodeGeeX2-6B 6B 8192 19.88% 0.36%
97 CodeGen2/CodeGen2-16B 16B 2048 16.97% 1.15%
98 Phi/Phi2 1.3B 2048 16.74% 0.64%
99 InternLM/InternLM-7B 7B 8192 16.26% 2.21%
100 gemma/gemma-7b 7B 8192 16.05% 0.80%
101 IEITYuan/Yuan2-51B-hf 51B 4096 15.25%
102 gemma/gemma-2b 2B 8192 14.62% 0.50%
103 Phi/Phi1 2.7B 2048 14.28% 0.99%
104 CodeGen/CodeGen-16B-multi 16B 2048 13.62% 1.18%
105 IEITYuan/Yuan2-102B-hf 102B 4096 10.48%
106 IEITYuan/Yuan2-2B-hf 2B 8192 7.28% 1.01%

Try the Benchmark!

Note: we only support Linux environment yet.

  1. Convert or save your model weights in Hugging Face Transformers format.
  2. Clone our code repository.
  3. Follow the short tutorial to generate responses and evaluate on InfiBench!

Feedback


You can also give us feedback in the discussion issue posts of our repositories:

  • Static Badge

BibTeX

@misc{inficodereval,
  author = {InfiCoderTeam},
  title = {InfiBench: Evaluating the Question-Answering Capabilities of Code LLMs},
  year = {2024},
  publisher = {Github Pages},
  howpublished = "\url{https://infi-coder.github.io/infibench/}"
}