Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions. To fill this gap, we propose InfiBench, the first large-scale freeform question-answering (QA) benchmark for code to our knowledge, comprising 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. InfiBench uses four types of model-free automatic metrics to evaluate response correctness where domain experts carefully concretize the criterion for each question. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings. Our detailed analyses showcase potential directions for further advancement of code LLMs. InfiBench is fully open source and continuously expanding to foster more scientific and systematic practices for code LLM evaluation.
InfiBench comprises 234 carefully picked high-quality Stack Overflow questions, covering 15 programming languages, and largely following the natural question distribution of Stack Overflow.
We recruited five domain experts to create the benchmark and annotate the correctness evaluation criteria.
Specifically, the InfiBench framework integrates four types of model-free metrics for evaluating the correctness: keywords matching, blank filling, unit testing, and dialogue similarity.
Below is the question type, metric type, and length statistics.
Existing benchmarks weigh heavily on code generation, unit-test-based evaluation, and a limited set of programming languages. InfiBench processes a much higher diversity to reflect real-world code LLMs’ usage scenarios and is far from saturation.
You are a professional assistant for programmers. By default, questions and answers are in Markdown format. You are chatting with programmers, so please answer as briefly as possible.
You are a professional assistant for programmers. By default, questions and answers are in Markdown format.
{system prompt}\n{content prompt}
We adopt best@10 as the main evaluation metric, where 10 responses are sampled and evaluated for each question and the best score per question is recorded and summed up. Throughout the evaluation, we set sampling temperature T to be 0.2 and top p cut-off threshold to be 0.9. We leave the exploration of other hyperparameters as the future work.
For score computation, we treat each question equally with one point each. Since the question frequency largely follows the Stack Overflow distribution, this score can be explained as how well the model responds to Stack Overflow questions. Given 234 questions in the benchmark, the full score is 234, and we by default report the percentage score (achieved score divided by the full score which is 234). The one point for each question can be further decomposed into a few scoring points within each question. For example, a question may contain four keywords with weights 2, 1, 1, and 1 each. Then, matching each keyword can contribute to 0.4, 0.2, 0.2, and 0.2 points respectively to the final score.
Each point corresponds to an open-source model, with error bars for those smaller than 30B. Each dotted segment corresponds to an MoE model. Proprietary models shown as lines with uncertainty ranges.
Rank | Model Name | # Params. (in B) | Context Length | Full Set Score | Full Set Std | |
---|---|---|---|---|---|---|
1 | GPT-4/GPT-4-0613 | ? | 8192 | 70.64% | 0.82% | |
2 | GPT-4/GPT-4-turbo-1106 | ? | 8192 | 68.42% | 0.38% | |
3 | GPT-4/GPT-4o-2024-05-13 | ? | 8192 | 66.19% | ||
4 | Claude 3/Claude 3 Opus | ? | 200000 | 63.89% | ||
5 | Mistral Open/Codestral-22b | 22B | 32768 | 62.98% | 0.56% | |
6 | DeepSeek Coder/deepseek-coder-33b-instruct | 33B | 16384 | 62.96% | ||
7 | Phind/Phind-CodeLlama-34B-v2 | 34B | 4096 | 59.00% | ||
8 | Phind/Phind-CodeLlama-34B-v1 | 34B | 4096 | 58.47% | ||
9 | Mistral/mistral-large | ? | 32768 | 58.22% | ||
10 | Claude 3/Claude 3 Sonnet | ? | 200000 | 58.20% | ||
11 | Claude 3/Claude 3 Haiku | ? | 200000 | 57.57% | ||
12 | DeepSeek LLM/deepseek-llm-67b-chat | 67B | 4096 | 57.41% | ||
13 | GPT-3.5/GPT-3.5-turbo-0613 | ? | 4096 | 56.47% | 1.34% | |
14 | Mistral/mistral-small | ? | 32768 | 55.62% | 0.46% | |
15 | Mistral Open/mixtral-8x7B-Instruct | 46.7B / 12.9B | 32768 | 55.55% | ||
16 | Qwen/Qwen-72B | 72B | 32768 | 55.34% | ||
17 | DeepSeek Coder/deepseek-coder-6.7b-instruct | 6.7B | 16384 | 53.25% | 0.40% | |
18 | Qwen/Qwen-72B-Chat | 72B | 32768 | 52.97% | ||
19 | Magicoder/Magicoder-S-CL-7B | 7B | 16384 | 52.71% | 0.72% | |
20 | WizardLM/WizardCoder-Python-34B-V1.0 | 34B | 16384 | 52.59% | ||
21 | Phind/Phind-CodeLlama-34B-Python-v1 | 34B | 4096 | 52.17% | ||
22 | Magicoder/Magicoder-S-DS-6.7B | 6.7B | 16384 | 51.46% | 1.09% | |
23 | Code Llama/CodeLlama-34b-Instruct | 34B | 16384 | 50.45% | ||
24 | 01.AI/Yi-34B-Chat | 34B | 4096 | 49.58% | ||
25 | WizardLM/WizardCoder-Python-7B-V1.0 | 7B | 16384 | 49.10% | 1.59% | |
26 | WizardLM/WizardCoder-Python-13B-V1.0 | 13B | 16384 | 48.99% | 0.92% | |
27 | Code Llama/CodeLlama-34b | 34B | 16384 | 47.36% | ||
28 | Code Llama/CodeLlama-13b-Instruct | 13B | 16384 | 46.37% | 1.26% | |
29 | Zephyr/Zephyr 7B beta | 7B | 32768 | 46.31% | 1.11% | |
30 | StarCoder2/15B-Instruct | 15B | 16384 | 45.89% | 0.95% | |
31 | DeepSeek MoE/deepseek-moe-16b-chat | 16B / 2.8B | 16384 | 45.18% | 1.65% | |
32 | OctoPack/OctoCoder | 15.5B | 8192 | 44.55% | 0.79% | |
33 | Qwen/Qwen-14B | 14B | 8192 | 43.69% | 1.09% | |
34 | Qwen/Qwen-14B-Chat | 14B | 8192 | 43.49% | 0.63% | |
35 | Magicoder/Magicoder-DS-6.7B | 6.7B | 16384 | 43.47% | 0.21% | |
36 | Code Llama/CodeLlama-34b-Python | 34B | 16384 | 43.13% | ||
37 | Code Llama/CodeLlama-70b-Instruct | 70B | 4096 | 42.82% | ||
38 | StarCoder2/15B | 15B | 16384 | 42.52% | 1.24% | |
39 | Magicoder/Magicoder-CL-7B | 7B | 16384 | 41.71% | 0.76% | |
40 | Code Llama/CodeLlama-13b | 13B | 16384 | 41.66% | 0.84% | |
41 | DeepSeek Coder/deepseek-coder-1.3b-instruct | 1.3B | 16384 | 41.32% | 1.12% | |
42 | Code Llama/CodeLlama-13b-Python | 13B | 16384 | 41.31% | 0.90% | |
43 | WizardLM/WizardCoder-15B-V1.0 | 15B | 2048 | 41.01% | 0.22% | |
44 | Mistral/mistral-medium | ? | 32768 | 40.95% | 0.41% | |
45 | gemma/gemma-7b-it | 7B | 8192 | 40.68% | 1.23% | |
46 | Code Llama/CodeLlama-70b | 70B | 4096 | 40.60% | ||
47 | Code Llama/CodeLlama-70b-Python | 70B | 4096 | 40.29% | ||
48 | OctoPack/OctoGeeX | 6B | 8192 | 40.14% | 1.55% | |
49 | DeepSeek LLM/deepseek-llm-67b-base | 67B | 4096 | 39.87% | ||
50 | Llama 2/Llama2-70B-Chat | 70B | 4096 | 39.30% | ||
51 | DeepSeek Coder/deepseek-coder-33b-base | 33B | 16384 | 38.75% | ||
52 | 01.AI/Yi-6B-Chat | 6B | 4096 | 38.14% | 0.58% | |
53 | Llama 2/Llama2-70B | 70B | 4096 | 37.69% | ||
54 | Code Llama/CodeLlama-7b | 7B | 16384 | 37.62% | 1.28% | |
55 | Mistral Open/Mistral-7B-Instruct-v0.1 | 7B | 32768 | 37.55% | 1.10% | |
56 | InternLM/InternLM-Chat-20B | 20B | 16384 | 37.41% | 0.75% | |
57 | Qwen/Qwen-7B-Chat | 7B | 32768 | 37.36% | 1.29% | |
58 | DeepSeek LLM/deepseek-llm-7b-chat | 7B | 4096 | 36.75% | 1.40% | |
59 | Llama 2/Llama2-7B-Chat | 7B | 4096 | 36.14% | 1.05% | |
60 | WizardLM/WizardCoder-3B-V1.0 | 3B | 2048 | 35.61% | 0.42% | |
61 | Code Llama/CodeLlama-7b-Instruct | 7B | 16384 | 35.15% | 1.02% | |
62 | StarCoder2/7B | 7B | 16384 | 34.90% | 0.97% | |
63 | InternLM/InternLM-Chat-7B | 7B | 8192 | 34.86% | 0.90% | |
64 | Baichuan2/Baichuan2-13B-Chat | 13B | 4096 | 34.40% | 1.34% | |
65 | DeepSeek Coder/deepseek-coder-6.7b-base | 6.7B | 16384 | 33.66% | 1.24% | |
66 | Code Llama/CodeLlama-7b-Python | 7B | 16384 | 32.89% | 0.45% | |
67 | Llama 2/Llama2-13B-Chat | 13B | 4096 | 32.29% | 1.66% | |
68 | WizardLM/WizardCoder-1B-V1.0 | 1B | 2048 | 31.94% | 0.70% | |
69 | Qwen/Qwen-7B | 7B | 32768 | 31.69% | 0.29% | |
70 | StarCoder2/3B | 3B | 16384 | 31.44% | 1.92% | |
71 | StarCoder/StarCode+ | 15.5B | 8192 | 30.67% | 1.57% | |
72 | StarCoder/StarCoder | 15.5B | 8192 | 30.66% | 0.69% | |
73 | CodeGen2.5/CodeGen2.5-7B-Instruct | 7B | 2048 | 29.57% | 1.53% | |
74 | Mistral/mistral-tiny | ? | 32768 | 29.41% | 0.26% | |
75 | InternLM/InternLM-20B | 20B | 16384 | 29.41% | 0.76% | |
76 | DeepSeek Coder/deepseek-coder-5.7bmqa-base | 5.7B | 16384 | 28.92% | 1.12% | |
77 | ChatGLM/ChatGLM3-6B | 6B | 8192 | 28.23% | 0.58% | |
78 | Baichuan2/Baichuan2-7B-Chat | 7B | 4096 | 27.53% | 1.07% | |
79 | gemma/gemma-2b-it | 2B | 8192 | 27.49% | 0.52% | |
80 | Qwen/Qwen-1.8B-Chat | 1.8B | 32768 | 26.84% | 1.08% | |
81 | DeepSeek MoE/deepseek-moe-16b-base | 16B / 2.8B | 16384 | 26.65% | 0.97% | |
82 | 01.AI/Yi-9B | 9B | 4096 | 26.39% | 0.42% | |
83 | Baichuan2/Baichuan2-13B-Base | 13B | 4096 | 26.32% | 1.23% | |
84 | DeepSeek LLM/deepseek-llm-7b-base | 7B | 4096 | 25.34% | 1.08% | |
85 | Llama 2/Llama2-13B | 13B | 4096 | 24.50% | 0.73% | |
86 | Baichuan2/Baichuan2-7B-Base | 7B | 4096 | 23.50% | 1.56% | |
87 | DeepSeek Coder/deepseek-coder-1.3b-base | 1.3B | 16384 | 23.17% | 1.47% | |
88 | Qwen/Qwen-1.8B | 1.8B | 32768 | 23.12% | 1.13% | |
89 | Mistral Open/Mistral-7B-v0.1 | 7B | 32768 | 22.72% | 1.51% | |
90 | Llama 2/Llama2-7B | 7B | 4096 | 22.35% | 1.70% | |
91 | 01.AI/Yi-34B | 34B | 4096 | 22.01% | ||
92 | davinci/davinci-002 | ? | 16384 | 21.25% | 1.17% | |
93 | Mistral Open/mixtral-8x7B | 46.7B / 12.9B | 32768 | 21.21% | ||
94 | Phi/Phi1.5 | 1.5B | 2048 | 20.56% | 0.09% | |
95 | 01.AI/Yi-6B | 6B | 4096 | 19.93% | 1.24% | |
96 | CodeGeeX/CodeGeeX2-6B | 6B | 8192 | 19.88% | 0.36% | |
97 | CodeGen2/CodeGen2-16B | 16B | 2048 | 16.97% | 1.15% | |
98 | Phi/Phi2 | 1.3B | 2048 | 16.74% | 0.64% | |
99 | InternLM/InternLM-7B | 7B | 8192 | 16.26% | 2.21% | |
100 | gemma/gemma-7b | 7B | 8192 | 16.05% | 0.80% | |
101 | IEITYuan/Yuan2-51B-hf | 51B | 4096 | 15.25% | ||
102 | gemma/gemma-2b | 2B | 8192 | 14.62% | 0.50% | |
103 | Phi/Phi1 | 2.7B | 2048 | 14.28% | 0.99% | |
104 | CodeGen/CodeGen-16B-multi | 16B | 2048 | 13.62% | 1.18% | |
105 | IEITYuan/Yuan2-102B-hf | 102B | 4096 | 10.48% | ||
106 | IEITYuan/Yuan2-2B-hf | 2B | 8192 | 7.28% | 1.01% |
Note: we only support Linux environment yet.
@misc{inficodereval,
author = {InfiCoderTeam},
title = {InfiBench: Evaluating the Question-Answering Capabilities of Code LLMs},
year = {2024},
publisher = {Github Pages},
howpublished = "\url{https://infi-coder.github.io/infibench/}"
}