No new submissions to the benchmark will be accepted. However, we would like to encourage practitioners and researchers to continue using the dataset and the human relevance annotations. Please see the main README for more information.
The Weights & Biases (W&B) benchmark tracks and compares models trained on the CodeSearchNet dataset by the global machine learning research community. Anyone is welcome to submit their results for review.
The leaderboard is available at https://app.wandb.ai/github/codesearchnet/benchmark/leaderboard.
There are a few requirements for submitting a model to the benchmark.
- You must a have a run logged to W&B.
- Your run must have attached inference results in a file named
model_predictions.csv
. You can view all the files attached to a given run in the browser by clicking the "Files" icon from that run's main page. - The schema outlined in the submission format section below must be strictly followed.
To submit from our baseline model, skip to the training the baseline model section below.
A valid submission to the CodeSeachNet Challenge requires a file named model_predictions.csv with the following fields: query
, language
, identifier
, and url
:
query
: the textual representation of the query, e.g. "int to string" .language
: the programming language for the given query, e.g. "python". This information is available as a field in the data to be scored.identifier
: this is an optional field that can help you track your dataurl
: the unique GitHub URL to the returned results, e.g. "https://github.com/JamesClonk/vultr/blob/fed59ad207c9bda0a5dfe4d18de53ccbb3d80c91/cmd/commands.go#L12-L190" . This information is available as a field in the data to be scored.
For further background and instructions on the submission process, see the root README.
The row order corresponds to the result ranking in the search task. For example, if in row 5 there is an entry for the Python query "read properties file", and in row 60 another result for the Python query "read properties file", then the URL in row 5 is considered to be ranked higher than the URL in row 60 for that query and language.
Here is an example:
You can submit your results to the benchmark as follows:
- Run a training job with any script (your own or the baseline example provided, with or without W&B logging).
- Generate your own file of model predictions following the format above and name it `model_predictions.csv`.
- Upload a run to wandb with this `model_predictions.csv` file attached.
Our example script src/predict.py takes care of steps 2 and 3 for a model training run that has already been logged to W&B, given the corresponding W&B run id, which you can find on the /overview page in the browser or by clicking the 'info' icon on a given run.
You've now generated all the content required to submit a run to the CodeSearchNet benchmark. Using the W&B GitHub integration you can now submit your model for review via the web app.
You can submit your runs by visiting the run page and clicking on the overview tab:
or by visiting the project page and selecting a run from the runs table:
Once you upload your `model_predictions.csv` file, W&B will compute the normalized discounted cumulative gain (NDCG) of your model's predictions against the human-annotated relevance scores. Further details on the evaluation process and metrics are in the root README. For transparency, we include the script used to evaluate submissions: src/relevanceeval.py
Replicating our results for the CodeSearchNet baseline is optional, as we encourage the community to create their own models and methods for ranking search results. To replicate our baseline submission, you can start with the "Quickstart" instructions in the CodeSearchNet GitHub repository. This baseline model uses src/predict.py to generate the submission file.
Your run will be logged to W&B, within a project that will be automatically linked to this benchmark.
Only 1 submission to the benchmark leaderboard is allowed every 2 weeks. Our intention is not for participants to make many submissions to the leaderboard with different parameters -- as this kind of overfitting is counterproductive. There are no cash prizes and the idea is to learn from this dataset, for example, to apply the learned representations or utilize new techniques.