Normalize software mentions (a.k.a data cleaning 🧹) #1

sdruskat · 2021-03-31T17:28:37Z

What do we have?

The software packages that have been used for COVID-19 research are contained in the CORD-19 dataset.

The issue

A lot of the mentions are different names for the same thing (e.g., ['Statistical Package for Social Sciences (SPSS)', 'SPSS', 'SPSS Statistics'] (different cells in column I)).

What do we really need?

For all the mentions to be identifiable as meaning the same thing.

How can we achieve this?

We need to normalize (if that's the right term) the data, so that there is only one term we use for ['Statistical Package for Social Sciences (SPSS)', 'SPSS', 'SPSS Statistics'], e.g., SPSS.

Ideas for how exactly we can achieve this

By using OpenRefine (which is partially manual I believe), for which there are Carpentry lessons for at least librarians, the social sciences, ecologists and the humanities.
By using a yet-to-be-found or newly created tool that makes this easier

The text was updated successfully, but these errors were encountered:

olexandr-konovalov · 2021-03-31T20:36:01Z

In a hurry, we may do it once in a way that included manual edits, but ideally this should be automated, so we can run the same procedure for the next version of the dataset.

For example, Python script which you can re-run multiple times, it should report entries that it can't handle, you adapt it and then re-run again until it passes with no anomalies left.

olexandr-konovalov · 2021-04-01T16:17:27Z

This should also take into account different capitalisations (e.g. Matlab/MATLAB)

orchid00 · 2021-04-07T13:27:49Z

Hello! @alex-konovalov @sdruskat I did a fair clean, but not all of the clean (there's so many options + typos) 🤯 .
Top 10:
1 Statistical Package for the Social Sciences (SPSS) 10308
2 R Programming Language (R) 7521
3 GraphPad Prism 4089
4 Excel 3800
5 stata 3271
6 sas 2448
7 blast 2233
8 graphpad 2143
9 matlab 1780
10 googlescholar 1688

I started with this file: CORD19_software_popularity.csv
(I should have probably asked which file to start with. Is this file one entry per paper? probably not) Should be easy to use another file with names in a column.
It had 102644 rows to start with and the clean file has 84661 🎉

I did this in R so I have a script to share. Where do you want this?
Examples to continue the clean---
There are many mentions of R packages, and R studio, and Bioconductor, that I did not merged into R Programming Language. Should these be merged?.
I did not merge python with biopython and python libraries for the same question above (Should these be merged?)
-- at this point I need someone to look at the result and say this and this might be good to merge too.

There are still minor rows that have each 6 to 1 occurrences that I didn't merge, yet. I am aware, but writing all the
cases takes a while.
There might be similar cases with languages I do not know.
I did make everything lower case and remove extra spaces and special characters.

orchid00 · 2021-04-08T04:13:46Z

looking at this again, I noticed GraphPad Prism, Graphpad and Prism to be three distinct items, but maybe they also should be merged.

sdruskat added this to the Habeas useful corpus milestone Mar 31, 2021

sdruskat changed the title ~~Normalize & deduplicate software mentions (a.k.a data cleaning 🧹)~~ Normalize software mentions (a.k.a data cleaning 🧹) Mar 31, 2021

sdruskat added the required Something that needs to be done to make the hack successful label Mar 31, 2021

This was referenced Mar 31, 2021

Decide on how to count software mentions 🧮 #2

Closed

Pull together the normalization information in a dataset ⛙ #3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize software mentions (a.k.a data cleaning 🧹) #1

Normalize software mentions (a.k.a data cleaning 🧹) #1

sdruskat commented Mar 31, 2021 •

edited

Loading

olexandr-konovalov commented Mar 31, 2021

olexandr-konovalov commented Apr 1, 2021

orchid00 commented Apr 7, 2021 •

edited

Loading

orchid00 commented Apr 8, 2021

Normalize software mentions (a.k.a data cleaning 🧹) #1

Normalize software mentions (a.k.a data cleaning 🧹) #1

Comments

sdruskat commented Mar 31, 2021 • edited Loading

What do we have?

The issue

What do we really need?

How can we achieve this?

Ideas for how exactly we can achieve this

olexandr-konovalov commented Mar 31, 2021

olexandr-konovalov commented Apr 1, 2021

orchid00 commented Apr 7, 2021 • edited Loading

orchid00 commented Apr 8, 2021

sdruskat commented Mar 31, 2021 •

edited

Loading

orchid00 commented Apr 7, 2021 •

edited

Loading