-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize software mentions (a.k.a data cleaning 🧹) #1
Comments
In a hurry, we may do it once in a way that included manual edits, but ideally this should be automated, so we can run the same procedure for the next version of the dataset. For example, Python script which you can re-run multiple times, it should report entries that it can't handle, you adapt it and then re-run again until it passes with no anomalies left. |
This should also take into account different capitalisations (e.g. Matlab/MATLAB) |
Hello! @alex-konovalov @sdruskat I did a fair clean, but not all of the clean (there's so many options + typos) 🤯 . I started with this file: CORD19_software_popularity.csv I did this in R so I have a script to share. Where do you want this? There are still minor rows that have each 6 to 1 occurrences that I didn't merge, yet. I am aware, but writing all the |
looking at this again, I noticed GraphPad Prism, Graphpad and Prism to be three distinct items, but maybe they also should be merged. |
What do we have?
The software packages that have been used for COVID-19 research are contained in the CORD-19 dataset.
The issue
A lot of the mentions are different names for the same thing (e.g.,
['Statistical Package for Social Sciences (SPSS)', 'SPSS', 'SPSS Statistics']
(different cells in column I)).What do we really need?
How can we achieve this?
['Statistical Package for Social Sciences (SPSS)', 'SPSS', 'SPSS Statistics']
, e.g.,SPSS
.Ideas for how exactly we can achieve this
The text was updated successfully, but these errors were encountered: