Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize software mentions (a.k.a data cleaning 🧹) #1

Open
1 task
sdruskat opened this issue Mar 31, 2021 · 4 comments
Open
1 task

Normalize software mentions (a.k.a data cleaning 🧹) #1

sdruskat opened this issue Mar 31, 2021 · 4 comments
Labels
required Something that needs to be done to make the hack successful

Comments

@sdruskat
Copy link
Collaborator

sdruskat commented Mar 31, 2021

What do we have?

The software packages that have been used for COVID-19 research are contained in the CORD-19 dataset.

The issue

A lot of the mentions are different names for the same thing (e.g., ['Statistical Package for Social Sciences (SPSS)', 'SPSS', 'SPSS Statistics'] (different cells in column I)).

What do we really need?

  • For all the mentions to be identifiable as meaning the same thing.

How can we achieve this?

  • We need to normalize (if that's the right term) the data, so that there is only one term we use for ['Statistical Package for Social Sciences (SPSS)', 'SPSS', 'SPSS Statistics'], e.g., SPSS.

Ideas for how exactly we can achieve this

@sdruskat sdruskat added this to the Habeas useful corpus milestone Mar 31, 2021
@sdruskat sdruskat changed the title Normalize & deduplicate software mentions (a.k.a data cleaning 🧹) Normalize software mentions (a.k.a data cleaning 🧹) Mar 31, 2021
@sdruskat sdruskat added the required Something that needs to be done to make the hack successful label Mar 31, 2021
@olexandr-konovalov
Copy link
Collaborator

In a hurry, we may do it once in a way that included manual edits, but ideally this should be automated, so we can run the same procedure for the next version of the dataset.

For example, Python script which you can re-run multiple times, it should report entries that it can't handle, you adapt it and then re-run again until it passes with no anomalies left.

@olexandr-konovalov
Copy link
Collaborator

This should also take into account different capitalisations (e.g. Matlab/MATLAB)

@orchid00
Copy link

orchid00 commented Apr 7, 2021

Hello! @alex-konovalov @sdruskat I did a fair clean, but not all of the clean (there's so many options + typos) 🤯 .
Top 10:
1 Statistical Package for the Social Sciences (SPSS) 10308
2 R Programming Language (R) 7521
3 GraphPad Prism 4089
4 Excel 3800
5 stata 3271
6 sas 2448
7 blast 2233
8 graphpad 2143
9 matlab 1780
10 googlescholar 1688

I started with this file: CORD19_software_popularity.csv
(I should have probably asked which file to start with. Is this file one entry per paper? probably not) Should be easy to use another file with names in a column.
It had 102644 rows to start with and the clean file has 84661 🎉

I did this in R so I have a script to share. Where do you want this?
Examples to continue the clean---
There are many mentions of R packages, and R studio, and Bioconductor, that I did not merged into R Programming Language. Should these be merged?.
I did not merge python with biopython and python libraries for the same question above (Should these be merged?)
-- at this point I need someone to look at the result and say this and this might be good to merge too.

There are still minor rows that have each 6 to 1 occurrences that I didn't merge, yet. I am aware, but writing all the
cases takes a while.
There might be similar cases with languages I do not know.
I did make everything lower case and remove extra spaces and special characters.

Rplot01

@orchid00
Copy link

orchid00 commented Apr 8, 2021

looking at this again, I noticed GraphPad Prism, Graphpad and Prism to be three distinct items, but maybe they also should be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
required Something that needs to be done to make the hack successful
Projects
None yet
Development

No branches or pull requests

3 participants