InfiBench: Evaluating the Question-Answering Capabilities of Code LLMs

Linyi Li Shijie Geng *Zhenwen Li *Yibo He *Hao Yu
*Ziyue Hua Guanghan Ning Siwei Wang Tao Xie Hongxia Yang

Simon Fraser University Rutgers University Peking University
ByteDance Inc The Hong Kong Polytechnic University
(* denotes to equal contribution)

Overview

Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions. To fill this gap, we propose InfiBench, the first large-scale freeform question-answering (QA) benchmark for code to our knowledge, comprising 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. InfiBench uses four types of model-free automatic metrics to evaluate response correctness where domain experts carefully concretize the criterion for each question. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings. Our detailed analyses showcase potential directions for further advancement of code LLMs. InfiBench is fully open source and continuously expanding to foster more scientific and systematic practices for code LLM evaluation.

Statistics and Examples

InfiBench comprises 234 carefully picked high-quality Stack Overflow questions, covering 15 programming languages, and largely following the natural question distribution of Stack Overflow.

We recruited five domain experts to create the benchmark and annotate the correctness evaluation criteria. Specifically, the InfiBench framework integrates four types of model-free metrics for evaluating the correctness: keywords matching, blank filling, unit testing, and dialogue similarity.

Below is the question type, metric type, and length statistics.

Prompts and Evaluation Protocol

Each question contains a system prompt and content prompt. For questions whose responses are mainly in natural language, the system prompt is

                You are a professional assistant for programmers. By default, questions and answers are in Markdown format. You are chatting with programmers, so please answer as briefly as possible.
                
              

For other questions, the system prompt is

                You are a professional assistant for programmers. By default, questions and answers are in Markdown format.
                
              

We then format the system prompt and content prompt following each model's default instruction template. If no instruction template specified, we use the prompt format

{system prompt}\n{content prompt}

We adopt best@10 as the main evaluation metric, where 10 responses are sampled and evaluated for each question and the best score per question is recorded and summed up. Throughout the evaluation, we set sampling temperature T to be 0.2 and top p cut-off threshold to be 0.9. We leave the exploration of other hyperparameters as the future work.

For score computation, we treat each question equally with one point each. Since the question frequency largely follows the Stack Overflow distribution, this score can be explained as how well the model responds to Stack Overflow questions. Given 234 questions in the benchmark, the full score is 234, and we by default report the percentage score (achieved score divided by the full score which is 234). The one point for each question can be further decomposed into a few scoring points within each question. For example, a question may contain four keywords with weights 2, 1, 1, and 1 each. Then, matching each keyword can contribute to 0.4, 0.2, 0.2, and 0.2 points respectively to the final score.

Rank	Model Name	# Params. (in B)	Context Length	Full Set Score	Full Set Std
1	GPT-4/GPT-4-0613	?	8192	70.64%	0.82%
2	GPT-4/GPT-4-turbo-1106	?	8192	68.42%	0.38%
3	GPT-4/GPT-4o-2024-05-13	?	8192	66.19%
4	Claude 3/Claude 3 Opus	?	200000	63.89%
5	Mistral Open/Codestral-22b	22B	32768	62.98%	0.56%
6	DeepSeek Coder/deepseek-coder-33b-instruct	33B	16384	62.96%
7	Phind/Phind-CodeLlama-34B-v2	34B	4096	59.00%
8	Phind/Phind-CodeLlama-34B-v1	34B	4096	58.47%
9	Mistral/mistral-large	?	32768	58.22%
10	Claude 3/Claude 3 Sonnet	?	200000	58.20%
11	Claude 3/Claude 3 Haiku	?	200000	57.57%
12	DeepSeek LLM/deepseek-llm-67b-chat	67B	4096	57.41%
13	GPT-3.5/GPT-3.5-turbo-0613	?	4096	56.47%	1.34%
14	Mistral/mistral-small	?	32768	55.62%	0.46%
15	Mistral Open/mixtral-8x7B-Instruct	46.7B / 12.9B	32768	55.55%
16	Qwen/Qwen-72B	72B	32768	55.34%
17	DeepSeek Coder/deepseek-coder-6.7b-instruct	6.7B	16384	53.25%	0.40%
18	Qwen/Qwen-72B-Chat	72B	32768	52.97%
19	Magicoder/Magicoder-S-CL-7B	7B	16384	52.71%	0.72%
20	WizardLM/WizardCoder-Python-34B-V1.0	34B	16384	52.59%
21	Phind/Phind-CodeLlama-34B-Python-v1	34B	4096	52.17%
22	Magicoder/Magicoder-S-DS-6.7B	6.7B	16384	51.46%	1.09%
23	Code Llama/CodeLlama-34b-Instruct	34B	16384	50.45%
24	01.AI/Yi-34B-Chat	34B	4096	49.58%
25	WizardLM/WizardCoder-Python-7B-V1.0	7B	16384	49.10%	1.59%
26	WizardLM/WizardCoder-Python-13B-V1.0	13B	16384	48.99%	0.92%
27	Code Llama/CodeLlama-34b	34B	16384	47.36%
28	Code Llama/CodeLlama-13b-Instruct	13B	16384	46.37%	1.26%
29	Zephyr/Zephyr 7B beta	7B	32768	46.31%	1.11%
30	StarCoder2/15B-Instruct	15B	16384	45.89%	0.95%
31	DeepSeek MoE/deepseek-moe-16b-chat	16B / 2.8B	16384	45.18%	1.65%
32	OctoPack/OctoCoder	15.5B	8192	44.55%	0.79%
33	Qwen/Qwen-14B	14B	8192	43.69%	1.09%
34	Qwen/Qwen-14B-Chat	14B	8192	43.49%	0.63%
35	Magicoder/Magicoder-DS-6.7B	6.7B	16384	43.47%	0.21%
36	Code Llama/CodeLlama-34b-Python	34B	16384	43.13%
37	Code Llama/CodeLlama-70b-Instruct	70B	4096	42.82%
38	StarCoder2/15B	15B	16384	42.52%	1.24%
39	Magicoder/Magicoder-CL-7B	7B	16384	41.71%	0.76%
40	Code Llama/CodeLlama-13b	13B	16384	41.66%	0.84%
41	DeepSeek Coder/deepseek-coder-1.3b-instruct	1.3B	16384	41.32%	1.12%
42	Code Llama/CodeLlama-13b-Python	13B	16384	41.31%	0.90%
43	WizardLM/WizardCoder-15B-V1.0	15B	2048	41.01%	0.22%
44	Mistral/mistral-medium	?	32768	40.95%	0.41%
45	gemma/gemma-7b-it	7B	8192	40.68%	1.23%
46	Code Llama/CodeLlama-70b	70B	4096	40.60%
47	Code Llama/CodeLlama-70b-Python	70B	4096	40.29%
48	OctoPack/OctoGeeX	6B	8192	40.14%	1.55%
49	DeepSeek LLM/deepseek-llm-67b-base	67B	4096	39.87%
50	Llama 2/Llama2-70B-Chat	70B	4096	39.30%
51	DeepSeek Coder/deepseek-coder-33b-base	33B	16384	38.75%
52	01.AI/Yi-6B-Chat	6B	4096	38.14%	0.58%
53	Llama 2/Llama2-70B	70B	4096	37.69%
54	Code Llama/CodeLlama-7b	7B	16384	37.62%	1.28%
55	Mistral Open/Mistral-7B-Instruct-v0.1	7B	32768	37.55%	1.10%
56	InternLM/InternLM-Chat-20B	20B	16384	37.41%	0.75%
57	Qwen/Qwen-7B-Chat	7B	32768	37.36%	1.29%
58	DeepSeek LLM/deepseek-llm-7b-chat	7B	4096	36.75%	1.40%
59	Llama 2/Llama2-7B-Chat	7B	4096	36.14%	1.05%
60	WizardLM/WizardCoder-3B-V1.0	3B	2048	35.61%	0.42%
61	Code Llama/CodeLlama-7b-Instruct	7B	16384	35.15%	1.02%
62	StarCoder2/7B	7B	16384	34.90%	0.97%
63	InternLM/InternLM-Chat-7B	7B	8192	34.86%	0.90%
64	Baichuan2/Baichuan2-13B-Chat	13B	4096	34.40%	1.34%
65	DeepSeek Coder/deepseek-coder-6.7b-base	6.7B	16384	33.66%	1.24%
66	Code Llama/CodeLlama-7b-Python	7B	16384	32.89%	0.45%
67	Llama 2/Llama2-13B-Chat	13B	4096	32.29%	1.66%
68	WizardLM/WizardCoder-1B-V1.0	1B	2048	31.94%	0.70%
69	Qwen/Qwen-7B	7B	32768	31.69%	0.29%
70	StarCoder2/3B	3B	16384	31.44%	1.92%
71	StarCoder/StarCode+	15.5B	8192	30.67%	1.57%
72	StarCoder/StarCoder	15.5B	8192	30.66%	0.69%
73	CodeGen2.5/CodeGen2.5-7B-Instruct	7B	2048	29.57%	1.53%
74	Mistral/mistral-tiny	?	32768	29.41%	0.26%
75	InternLM/InternLM-20B	20B	16384	29.41%	0.76%
76	DeepSeek Coder/deepseek-coder-5.7bmqa-base	5.7B	16384	28.92%	1.12%
77	ChatGLM/ChatGLM3-6B	6B	8192	28.23%	0.58%
78	Baichuan2/Baichuan2-7B-Chat	7B	4096	27.53%	1.07%
79	gemma/gemma-2b-it	2B	8192	27.49%	0.52%
80	Qwen/Qwen-1.8B-Chat	1.8B	32768	26.84%	1.08%
81	DeepSeek MoE/deepseek-moe-16b-base	16B / 2.8B	16384	26.65%	0.97%
82	01.AI/Yi-9B	9B	4096	26.39%	0.42%
83	Baichuan2/Baichuan2-13B-Base	13B	4096	26.32%	1.23%
84	DeepSeek LLM/deepseek-llm-7b-base	7B	4096	25.34%	1.08%
85	Llama 2/Llama2-13B	13B	4096	24.50%	0.73%
86	Baichuan2/Baichuan2-7B-Base	7B	4096	23.50%	1.56%
87	DeepSeek Coder/deepseek-coder-1.3b-base	1.3B	16384	23.17%	1.47%
88	Qwen/Qwen-1.8B	1.8B	32768	23.12%	1.13%
89	Mistral Open/Mistral-7B-v0.1	7B	32768	22.72%	1.51%
90	Llama 2/Llama2-7B	7B	4096	22.35%	1.70%
91	01.AI/Yi-34B	34B	4096	22.01%
92	davinci/davinci-002	?	16384	21.25%	1.17%
93	Mistral Open/mixtral-8x7B	46.7B / 12.9B	32768	21.21%
94	Phi/Phi1.5	1.5B	2048	20.56%	0.09%
95	01.AI/Yi-6B	6B	4096	19.93%	1.24%
96	CodeGeeX/CodeGeeX2-6B	6B	8192	19.88%	0.36%
97	CodeGen2/CodeGen2-16B	16B	2048	16.97%	1.15%
98	Phi/Phi2	1.3B	2048	16.74%	0.64%
99	InternLM/InternLM-7B	7B	8192	16.26%	2.21%
100	gemma/gemma-7b	7B	8192	16.05%	0.80%
101	IEITYuan/Yuan2-51B-hf	51B	4096	15.25%
102	gemma/gemma-2b	2B	8192	14.62%	0.50%
103	Phi/Phi1	2.7B	2048	14.28%	0.99%
104	CodeGen/CodeGen-16B-multi	16B	2048	13.62%	1.18%
105	IEITYuan/Yuan2-102B-hf	102B	4096	10.48%
106	IEITYuan/Yuan2-2B-hf	2B	8192	7.28%	1.01%

Rank

Model Name

# Params. (in B)

Context Length

Full Set Score

Full Set Std

GPT-4/GPT-4-0613

8192

70.64%

0.82%

GPT-4/GPT-4-turbo-1106

8192

68.42%

0.38%

GPT-4/GPT-4o-2024-05-13

8192

66.19%

Claude 3/Claude 3 Opus

200000

63.89%

Mistral Open/Codestral-22b

22B

32768

62.98%

0.56%

DeepSeek Coder/deepseek-coder-33b-instruct

33B

16384

62.96%

Phind/Phind-CodeLlama-34B-v2

34B

4096

59.00%

Phind/Phind-CodeLlama-34B-v1

34B

4096

58.47%

Mistral/mistral-large

32768

58.22%

Claude 3/Claude 3 Sonnet

200000

58.20%

Claude 3/Claude 3 Haiku

200000

57.57%

DeepSeek LLM/deepseek-llm-67b-chat

67B

4096

57.41%

GPT-3.5/GPT-3.5-turbo-0613

4096

56.47%

1.34%

Mistral/mistral-small

32768

55.62%

0.46%

Mistral Open/mixtral-8x7B-Instruct

46.7B / 12.9B

32768

55.55%

Qwen/Qwen-72B

72B

32768

55.34%

DeepSeek Coder/deepseek-coder-6.7b-instruct

6.7B

16384

53.25%

0.40%

Qwen/Qwen-72B-Chat

72B

32768

52.97%

Magicoder/Magicoder-S-CL-7B

16384

52.71%

0.72%

WizardLM/WizardCoder-Python-34B-V1.0

34B

16384

52.59%

Phind/Phind-CodeLlama-34B-Python-v1

34B

4096

52.17%

Magicoder/Magicoder-S-DS-6.7B

6.7B

16384

51.46%

1.09%

Code Llama/CodeLlama-34b-Instruct

34B

16384

50.45%

01.AI/Yi-34B-Chat

34B

4096

49.58%

WizardLM/WizardCoder-Python-7B-V1.0

16384

49.10%

1.59%

WizardLM/WizardCoder-Python-13B-V1.0

13B

16384

48.99%

0.92%

Code Llama/CodeLlama-34b

34B

16384

47.36%

Code Llama/CodeLlama-13b-Instruct

13B

16384

46.37%

1.26%

Zephyr/Zephyr 7B beta

32768

46.31%

1.11%

StarCoder2/15B-Instruct

15B

16384

45.89%

0.95%

DeepSeek MoE/deepseek-moe-16b-chat

16B / 2.8B

16384

45.18%

1.65%

OctoPack/OctoCoder

15.5B

8192

44.55%

0.79%

Qwen/Qwen-14B

14B

8192

43.69%

1.09%

Qwen/Qwen-14B-Chat

14B

8192

43.49%

0.63%

Magicoder/Magicoder-DS-6.7B

6.7B

16384

43.47%

0.21%

Code Llama/CodeLlama-34b-Python

34B

16384

43.13%

Code Llama/CodeLlama-70b-Instruct

70B

4096

42.82%

StarCoder2/15B

15B

16384

42.52%

1.24%

Magicoder/Magicoder-CL-7B

16384

41.71%

0.76%

Code Llama/CodeLlama-13b

13B

16384

41.66%

0.84%

DeepSeek Coder/deepseek-coder-1.3b-instruct

1.3B

16384

41.32%

1.12%

Code Llama/CodeLlama-13b-Python

13B

16384

41.31%

0.90%

WizardLM/WizardCoder-15B-V1.0

15B

2048

41.01%

0.22%

Mistral/mistral-medium

32768

40.95%

0.41%

gemma/gemma-7b-it

8192

40.68%

1.23%

Code Llama/CodeLlama-70b

70B

4096

40.60%

Code Llama/CodeLlama-70b-Python

70B

4096

40.29%

OctoPack/OctoGeeX

8192

40.14%

1.55%

DeepSeek LLM/deepseek-llm-67b-base

67B

4096

39.87%

Llama 2/Llama2-70B-Chat

70B

4096

39.30%

DeepSeek Coder/deepseek-coder-33b-base

33B

16384

38.75%

01.AI/Yi-6B-Chat

4096

38.14%

0.58%

Llama 2/Llama2-70B

70B

4096

37.69%

Code Llama/CodeLlama-7b

16384

37.62%

1.28%

Mistral Open/Mistral-7B-Instruct-v0.1

32768

37.55%

1.10%

InternLM/InternLM-Chat-20B

20B

16384

37.41%

0.75%

Qwen/Qwen-7B-Chat

32768

37.36%

1.29%

DeepSeek LLM/deepseek-llm-7b-chat

4096

36.75%

1.40%

Llama 2/Llama2-7B-Chat

4096

36.14%

1.05%

WizardLM/WizardCoder-3B-V1.0

2048

35.61%

0.42%

Code Llama/CodeLlama-7b-Instruct

16384

35.15%

1.02%

StarCoder2/7B

16384

34.90%

0.97%

InternLM/InternLM-Chat-7B

8192

34.86%

0.90%

Baichuan2/Baichuan2-13B-Chat

13B

4096

34.40%

1.34%

DeepSeek Coder/deepseek-coder-6.7b-base

6.7B

16384

33.66%

1.24%

Code Llama/CodeLlama-7b-Python

16384

32.89%

0.45%

Llama 2/Llama2-13B-Chat

13B

4096

32.29%

1.66%

WizardLM/WizardCoder-1B-V1.0

2048

31.94%

0.70%

Qwen/Qwen-7B

32768

31.69%

0.29%

StarCoder2/3B

16384

31.44%

1.92%

StarCoder/StarCode+

15.5B

8192

30.67%

1.57%

StarCoder/StarCoder

15.5B

8192

30.66%

0.69%

CodeGen2.5/CodeGen2.5-7B-Instruct

2048

29.57%

1.53%

Mistral/mistral-tiny

32768

29.41%

0.26%

InternLM/InternLM-20B

20B

16384

29.41%

0.76%

DeepSeek Coder/deepseek-coder-5.7bmqa-base

5.7B

16384

28.92%

1.12%

ChatGLM/ChatGLM3-6B

8192

28.23%

0.58%

Baichuan2/Baichuan2-7B-Chat

4096

27.53%

1.07%

gemma/gemma-2b-it

8192

27.49%

0.52%

Qwen/Qwen-1.8B-Chat

1.8B

32768

26.84%

1.08%

DeepSeek MoE/deepseek-moe-16b-base

16B / 2.8B

16384

26.65%

0.97%

01.AI/Yi-9B

4096

26.39%

0.42%

Baichuan2/Baichuan2-13B-Base

13B

4096

26.32%

1.23%

DeepSeek LLM/deepseek-llm-7b-base

4096

25.34%

1.08%

Llama 2/Llama2-13B

13B

4096

24.50%

0.73%

Baichuan2/Baichuan2-7B-Base

4096

23.50%

1.56%

DeepSeek Coder/deepseek-coder-1.3b-base

1.3B

16384

23.17%

1.47%

Qwen/Qwen-1.8B

1.8B

32768

23.12%

1.13%

Mistral Open/Mistral-7B-v0.1

32768

22.72%

1.51%

Llama 2/Llama2-7B

4096

22.35%

1.70%

01.AI/Yi-34B

34B

4096

22.01%

davinci/davinci-002

16384

21.25%

1.17%

Mistral Open/mixtral-8x7B

46.7B / 12.9B

32768

21.21%

Phi/Phi1.5

1.5B

2048

20.56%

0.09%

01.AI/Yi-6B

4096

19.93%

1.24%

CodeGeeX/CodeGeeX2-6B

8192

19.88%

0.36%

CodeGen2/CodeGen2-16B

16B

2048

16.97%

1.15%

Phi/Phi2

1.3B

2048

16.74%

0.64%

InternLM/InternLM-7B

8192

16.26%

2.21%

100

gemma/gemma-7b

8192

16.05%

0.80%

101

IEITYuan/Yuan2-51B-hf

51B

4096

15.25%

102

gemma/gemma-2b

8192

14.62%

0.50%

103

Phi/Phi1

2.7B

2048

14.28%

0.99%

104

CodeGen/CodeGen-16B-multi

16B

2048

13.62%

1.18%

105

IEITYuan/Yuan2-102B-hf

102B

4096

10.48%

106

IEITYuan/Yuan2-2B-hf

8192

7.28%

1.01%

Try the Benchmark!

We only support Linux environment yet.

Convert or save your model weights in Hugging Face Transformers format.
Clone our code repository.
Follow the short tutorial to generate responses and evaluate on InfiBench!

BibTeX

@inproceedings{ li2024infibench, title={InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models}, author={Linyi Li and Shijie Geng and Zhenwen Li and Yibo He and Hao Yu and Ziyue Hua and Guanghan Ning and Siwei Wang and Tao Xie and Hongxia Yang}, booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2024}, url={https://openreview.net/forum?id=E8EAeyTxOy} }

InfiBench: Evaluating the Question-Answering Capabilities of Code LLMs

InfiBench is a comprehensive benchmark for code large language models evaluating model ability on answering freeform real-world questions in the code domain.

Overview

Statistics and Examples

Comparison

Prompts and Evaluation Protocol

Leaderboard

Try the Benchmark!

Feedback

BibTeX

More Leaderboards