Open
Description
Currently, datasets are given as markdown files with lots of unused columns:
Project name | Domain | Source code available (yes/no)? | Is it a git repository (yes/no)? | Repository URL | Clone URL | Estimated number of commits |
---|---|---|---|---|---|---|
apache-httpd | web server | y | y | https://github.com/apache/httpd | https://github.com/DiffDetective/httpd.git | 32,927 |
berkeley-db-libdb | database system | y | y | https://github.com/berkeleydb/libdb | https://github.com/DiffDetective/libdb.git | 7 |
Our dataset loader in fact only uses the project name and clone URL. Hence, dataset files and the loading should be simplified. The columns for Domain
, and Repository URL
are interesting but not essential. So maybe these could stay in the files but be the last two columns.
Also, except for line 2 of the file, markdown files with just a single table like this are actually CSV files with |
as separator instead of ,
or ;
.So maybe we could reuse our CSV IO classes here.