Skip to content

Simpler Dataset Files #120

Open
Open
@pmbittner

Description

@pmbittner

Currently, datasets are given as markdown files with lots of unused columns:

Project name Domain Source code available (yes/no)? Is it a git repository (yes/no)? Repository URL Clone URL Estimated number of commits
apache-httpd web server y y https://github.com/apache/httpd https://github.com/DiffDetective/httpd.git 32,927
berkeley-db-libdb database system y y https://github.com/berkeleydb/libdb https://github.com/DiffDetective/libdb.git 7

Our dataset loader in fact only uses the project name and clone URL. Hence, dataset files and the loading should be simplified. The columns for Domain, and Repository URL are interesting but not essential. So maybe these could stay in the files but be the last two columns.

Also, except for line 2 of the file, markdown files with just a single table like this are actually CSV files with | as separator instead of , or ;.So maybe we could reuse our CSV IO classes here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions