InfiCoder-Eval: Systematically Evaluating Question-Answering for Code Large Language Models

InfiCoder Team @ ByteDance Ltd. and Peking University

InfiCoder-Eval is a systematic benchmark and evaluation framework for the free-form question-answering ability of code language models.

Overview

Large language models for code (code LLMs) have made huge progress. Evaluation benchmarks for code LLMs, such as HumanEval, DS-1000, and MBPP, predominantly focus on code generation. But they are insufficient to evaluate code LLMs’ multifaceted ability. To fill this gap, we propose InfiCoder-Eval, a large-scale free-form question-answering (QA) benchmark for code. InfiCoder-Eval comprises 270 carefully picked high-quality StackOverflow questions, covering 18 programming languages. To tackle the evaluation challenge, InfiCoder-Eval includes an evaluation framework integrating four types of model-free metrics, and domain experts design the concrete criteria for each question. As confirmed with human experiments, InfiCoder-Eval evaluation aligns with humans better than model-based evaluation and runs much faster at the same time. We conduct a systematic evaluation with InfiCoder-Eval for more than 30 code LLMs, leading to several interesting findings. For example, though open-source code LLMs show competitive performance with proprietary models in code generation (e.g., HumanEval), they still have a large gap compared to proprietary ones in InfiCoder-Eval and even the best proprietary LLM (GPT4) is still far from perfect (best open-source model Deepseek-Coder 33B Instruct achieves 50.34% and GPT4 achieves 59.13%). Furthermore, our detailed analysis reveals several weaknesses of current code LLMs. Benchmark, evaluation tools, and detailed results are all publicly available.

Statistics and Examples

InfiCoder-Eval comprises 270 carefully picked high-quality Stack Overflow questions, covering 18 programming languages, and largely following the natural question distribution of Stack Overflow.


We recruited five domain experts to create the benchmark and annotate the correctness evaluation criteria. Specifically, the InfiCoder-Eval framework integrates four types of model-free metrics for evaluating the correctness: keywords matching, blank filling, unit testing, and dialogue similarity.


Below is the question type, metric type, and length statistics.

Comparison

Existing benchmarks weigh heavily on code generation, unit-test-based evaluation, and a limited set of programming languages. InfiCoder-Eval processes a much higher diversity to reflect real-world code LLMs’ usage scenarios and is far from saturation.

Prompts and Evaluation Protocol

Each question contains a system prompt and content prompt. For questions whose responses are mainly in natural language, the system prompt is
You are a professional assistant for programmers. By default, questions and answers are in Markdown format. You are chatting with programmers, so please answer as briefly as possible.
For other questions, the system prompt is
You are a professional assistant for programmers. By default, questions and answers are in Markdown format.
We then format the system prompt and content prompt following each model's default instruction template. If no instruction template specified, we use the prompt format
{system prompt}\n{content prompt}

We adopt best@10 as the main evaluation metric, where 10 responses are sampled and evaluated for each question and the best score per question is recorded and summed up. Throughout the evaluation, we set sampling temperature T to be 0.2 and top p cut-off threshold to be 0.9. We leave the exploration of other hyperparameters as the future work.

For score computation, we treat each question equally with one point each. Since the question frequency largely follows the Stack Overflow distribution, this score can be explained as how well the model responds to Stack Overflow questions. Given 270 questions in the benchmark, the full score is 270, and we by default report the percentage score (achieved score divided by the full score which is 270). The one point for each question can be further decomposed into a few scoring points within each question. For example, a question may contain four keywords with weights 2, 1, 1, and 1 each. Then, matching each keyword can contribute to 0.4, 0.2, 0.2, and 0.2 points respectively to the final score.

Leaderboard


Each blue point corresponds to one open-source model, with error bars for those smaller than 30B parameters. Proprietary models are plotted as lines with uncertainty ranges.

Notice: we set the max tokens to generate=1024 (since GPT4 generates 662 tokens without the constraint, we provide some wiggle room by setting to 1024 tokens)

We evenly split the 270 benchmark questions to 135-question dev set and 135-question test set. Dev set is publicly available, and the test set is on held where evaluation is available upon request (see below for instructions). Models are ranked according to full set scores.

For models with >30B parameters, we evaluate once due to resource limit, otherwise we evaluate three times and report the mean and standard deviation.


Rank Model Name # Params. (in B) Full Set Score Full Set Std Dev Set Score Dev Set Std Test Set Score Test Set Std
1 GPT-4 (0613) / 59.13% 0.58% 61.03% 1.14% 57.23% 0.19%
2 deepseek-coder-33b-instruct 33 50.34% 50.13% 50.55%
3 GPT-3.5-turbo (0613) / 46.84% 0.70% 46.45% 0.72% 47.24% 1.75%
4 WizardCoder-Python-34B-V1.0 34 45.16% 46.52% 43.80%
5 CodeLlama-34B-Instruct 34 43.71% 44.03% 43.39%
6 deepseek-coder-6.7b-instruct 6.7 42.97% 0.22% 42.28% 0.24% 43.66% 0.26%
7 WizardCoder-Python-13B-V1.0 13 41.22% 0.75% 41.46% 1.13% 40.97% 0.42%
8 WizardCoder-Python-7B-V1.0 7 40.30% 1.15% 43.70% 2.09% 36.90% 0.22%
9 CodeLlama-34B 34 39.75% 39.71% 39.79%
10 Zypher-7b-beta 7 39.59% 0.68% 41.20% 0.25% 37.97% 1.29%
11 OctoCoder 15.5 37.72% 0.58% 35.45% 0.56% 40.00% 1.69%
12 CodeLlama-13B-Instruct 13 37.18% 0.51% 36.03% 1.37% 38.34% 0.36%
13 Qwen-14B-Chat 14 36.93% 0.12% 36.19% 0.75% 37.68% 0.91%
14 CodeLlama-34B-Python 34 36.36% 34.75% 37.98%
15 CodeLlama-13B 13 33.92% 0.64% 31.78% 1.01% 36.06% 0.49%
16 WizardCoder-15B-V1.0 15 33.34% 0.74% 31.85% 0.60% 34.83% 1.86%
17 OctoGeeX 6 32.60% 1.02% 31.55% 2.01% 33.65% 0.91%
18 Qwen-7B-Chat 7 32.48% 0.71% 31.87% 0.66% 33.10% 0.77%
19 CodeLlama-13B-Python 13 32.43% 0.42% 31.09% 1.85% 33.78% 1.02%
20 CodeLlama-7B 7 31.07% 0.87% 30.05% 1.79% 32.09% 0.05%
21 WizardCoder-3B-V1.0 3 30.94% 0.60% 31.33% 0.80% 30.56% 0.58%
22 Baichuan2-13B-Chat 13 30.34% 0.76% 30.15% 0.73% 30.52% 1.32%
23 CodeLlama-7B-Instruct 7 29.51% 0.97% 27.96% 0.81% 31.06% 1.18%
24 CodeLlama-7B-Python 7 28.88% 0.45% 27.51% 1.09% 30.24% 1.97%
25 WizardCoder-1B-V1.0 1 27.11% 0.85% 26.53% 1.12% 27.70% 0.72%
26 StarCoder 15.5 26.79% 0.18% 28.74% 0.78% 24.84% 0.96%
27 StarCoderPlus 15.5 26.07% 1.25% 24.45% 1.55% 27.69% 2.13%
28 CodeGen2.5-7B-instruct 7 25.67% 1.57% 23.36% 1.77% 27.98% 1.46%
29 Baichuan2-7B-Chat 7 24.39% 0.25% 25.44% 0.02% 23.34% 0.50%
30 davinci-002 / 19.08% 1.00% 17.47% 1.56% 20.70% 0.70%
31 phi-1.5 1.5 16.63% 0.03% 14.10% 0.84% 19.16% 0.87%
32 CodeGeeX2 6 16.50% 0.39% 16.20% 0.65% 16.80% 0.33%
33 phi-1 1.3 12.84% 0.73% 11.45% 0.19% 14.23% 1.28%

Try the Benchmark!

Step 0: Setup

  1. Convert or save your model weights in Hugging Face Transformers format.
  2. Clone the two repositories: Inference Repo and Evaluation Repo.
  3. No requirement on the local directory paths.

  4. Set global environment variable:
  5. export INFERENCE_REPO_PATH=[evaluation repo]/batched_prompts/suite_v2.0.0_dev.csv

Step 1: Generate Response for Your Model

  1. Set the working directory to Inference Repo.
  2. The inference repo is forked and slightly modified from bigcode-evaluation-harness framework. We leverage its function for inference.

  3. Determine the prompt format to use, which corresponds to task name.
  4. We support these format for now: code-ffqa-v2 (the default one, system + '\n' + content), code-ffqa-v2-endn (system + '\n' + content + '\n'), code-ffqa-v2-deepseek-chat (deepseek-coder-instruct format), code-ffqa-v2-baichuan2 (baichuan2 models format), code-ffqa-v2-zypher (zypher-7b-beta format), code-ffqa-v2-octo (octopack model format), code-ffqa-v2-wizard (wizard-python model format), code-ffqa-v2-phi (phi-1.5 model format), and code-ffqa-v2-inficoder (our InfiCoder model format).

    Feel free to contribute by adding your model format, which is easy - just slightly modify bigcode_eval/tasks/code_ffqa_v200.py a bit.

  5. Run batch inference to generate responses for question prompts.
  6. accelerate launch [inference repo dir]/main.py --model [your model path / hugging face hub path] --tasks [determined task name above] --batch_size [batch_size] --precision bf16 --n_samples 30 --do_sample True --temperature 0.2 --top_p 0.9 --save_generations --save_references --trust_remote_code --generation_only --max_new_tokens 1024 --save_generations_path [output raw response file path].json --eos='[EOS string]'

    This command will output two files in your working directory: [output raw response file path].json which stores responses and references.json which stores case names as the index.

  7. Export responses and case names to evaluation-capatible csv file.
  8. python3 [inference repo dir]/ffqa_processor.py [output raw response file path].json references.json [response csv file].csv

    This command will join the two output files above into one csv file [response csv file].csv which can be processed by the evaluation framework below.


Step 2: Evaluation (Dev Set)

  1. Setup the evaluation framework: Evaluation Repo.
  2. At this point, we only support Linux environment.

    Run pip3 install -r requirements.txt, then ./setup.sh (time costly, usually 1-2 hours) which installs necessary compilers and packages for multi-lingual execution environment.

  3. Check the evaluation environment.
  4. Run python3 env_check.py to check and fix the environment incompatibility according to the console output. If the console output is "You're good to go.", then we can proceed.

  5. Unpack the csv output.
  6. Unpack the csv output file from the previous inference step into a directory where each response is stored in a separate txt.

    python3 adaptors/csv_response_unpacker.py [response csv file].csv [response save dir]

    We recommend to save the responses in a directory in responses/, i.e., let [response save dir]=responses/.... The above script will create the [response save dir] directory if it does not exist.

  7. Run main evaluation.
  8. python3 grader_main.py suite_v2.0.0_dev.yaml [response save dir]

    The evaluation takes around 15 min - 45 min.

    When it finishes, there are two output files: results/suite_v2.0.0_dev_[response save dir base name].txt (short summary) and results/suite_v2.0.0_dev_[response save dir base name].yaml (all details).

    You can also customized the output paths by --result_summary_path and --result_detail_path arguments respectively.

  9. Get statistics and print the results.
  10. python3 print_result_stat.py [result detail path] [summary txt path]

    In console output and [summary txt path], a nice table will be printed, including the overall score and percentage and the sub-scores for each question type, metric type, and programming language.


Step 3: Evaluate (Test Set)

Available upon request (email us).

Feedback


You can also give us feedback in the discussion issue posts of our repositories:

  • Static Badge
  • Static Badge
  • Static Badge

BibTeX

@misc{li2023inficodereval,
  author = {InfiCoderTeam},
  title = {InfiCoder-Eval: Systematically Evaluating Question-Answering for Code Large Language Models},
  year = {2023},
  publisher = {Github Pages},
  howpublished = "\url{https://infi-coder.github.io/inficoder-eval/}"
}