You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: OpenAlex.qmd
+54-54Lines changed: 54 additions & 54 deletions
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ execute:
25
25
warning: false
26
26
error: false
27
27
webr:
28
-
packages: ['tidyverse']
28
+
packages: ['dplyr','tidyr','ggplot2']
29
29
lightbox: true
30
30
---
31
31
@@ -154,7 +154,7 @@ The first argument within the `head` function is our data frame *df* and the sec
154
154
155
155
Before analysing the OpenAlex data, we will have a closer look at the publisher and apc columns and perform some data wrangling tasks.
156
156
157
-
To get an overview of the publishers present in our dataframe and the number of articles per publisher, we will use three tidyverse functions and the pipe operator `%>%` ([magrittr pipe](https://magrittr.tidyverse.org/reference/pipe.html)) or `|>` (base R pipe).
157
+
To get an overview of the publishers present in our dataframe and the number of articles per publisher, we will use three tidyverse functions and the base R pipe operator `|>`. The [magrittr package](https://magrittr.tidyverse.org/reference/pipe.html), which is part of the tidyverse, provides another pipe operator (`%>%`), however, we will use the base R pipe since you won't need to install an additional package to use it.
158
158
159
159
The pipe operator allows us to take e.g. a data frame or the result of a function and pass it to another function. If we type the name of our data frame followed by the pipe operator that means we don't have to specify which data we want a function to be performed on, when we call it after the pipe operator.
160
160
@@ -166,9 +166,9 @@ The `arrange` function is used to sort the data based on the n variable and the
@@ -180,32 +180,32 @@ The output shows us some important things about the data:
180
180
As an example we will look at the publisher names in our data frame that contain *Springer* or *Nature*. We will use the `filter` function from the tidyverse that lets us filter articles that fulfil the condition we specify in combination with the `grepl` function that allows to search for patterns. In this case we will use a regular expression as the pattern to search for.
If we want to replace the Springer name variants with *Springer Nature* as publisher name, we can use the `mutate` function that lets us transform a column in our data frame in combination with the `str_replace_all` function that lets us replace string values that follow a pattern we specify to do so.
Be aware that this data transformation only applies to the publisher name column and not the OpenAlex provided publisher id column. We would need to perform a separate transformation step to align both. Additionally, we did not permanently rename the publisher name values in our data frame. To do this we can either override our data frame *df* or create a new data frame by assigning the output with the `<-` operator.
@@ -215,26 +215,26 @@ To access the apc values in our data frame we will use the `unnest_wider`, `unne
215
215
The `unnest_wider` function allows us to turn each element of a list-column into a column. We will further use the `select` function to print only the resulting apc columns.
216
216
217
217
```{r}
218
-
df %>%
219
-
unnest_wider(apc, names_sep = ".") %>%
218
+
df |>
219
+
unnest_wider(apc, names_sep = ".") |>
220
220
select(starts_with("apc"))
221
221
```
222
222
223
223
The output shows that the values in the apc columns are lists. This is because OpenAlex provides information for APC list prices and prices of APCs that were actually paid, when available. To transform the lists into single value cells we will use the `unnest_longer` function that does precisely that.
The output shows that we now have multiple rows for the same id. To transform the data frame to single rows for each id we will use the `pivot_wider` function that allows us to increase the number of columns and decrease the number of rows.
To analyse the distribution of open access articles across journals we will calculate the total number of articles (n_articles), the total number of open access articles (n_oa_articles) and the total number of closed articles (n_closed_articles) per journal. We will again use the `group_by`, `summarise` and `arrange` functions. However, since we noticed that `r sum(is.na(df$is_oa))` articles have an undetermined open access status, we will first filter out all rows with *NA* values in the *is_oa* column. We will also group the data by the *source_display_name* column to generate aggregate statistics for the journals.
281
281
282
282
```{r}
283
-
df %>%
284
-
filter(!is.na(is_oa)) %>%
285
-
group_by(source_display_name) %>%
283
+
df |>
284
+
filter(!is.na(is_oa)) |>
285
+
group_by(source_display_name) |>
286
286
summarise(
287
287
n_articles = n(),
288
288
n_oa_articles = sum(is_oa),
289
289
n_closed_articles = n_articles - n_oa_articles
290
-
) %>%
290
+
) |>
291
291
arrange(desc(n_oa_articles))
292
292
```
293
293
@@ -296,9 +296,9 @@ The results show that 414 articles have no journal assigned within our data fram
296
296
We can further explore the open access status distribution for the articles. We will do this on the example of the top three journals in terms of open access publication volume.
297
297
298
298
```{r}
299
-
df %>%
300
-
filter(source_display_name %in% c("Scientific Reports", "Astronomy and Astrophysics", "Journal of High Energy Physics")) %>%
@@ -311,9 +311,9 @@ We can combine functions from the tidyverse and ggplot2 to visualise the develop
311
311
First, we group the data by publication year and open access status. We will then compute the number of articles in each group. In the ggplot function, we assign the publication year column to the x axis and the number of articles to the y axis. We choose point (geom_point) and line (geom_line) graph types to mark the distinct values of n and have them connected by lines. Both are provided with a colour aesthetic which is set to our open access status column. This will result in different colours being assigned to the points and lines for the different open access status values. With the *theme_minimal* option we are applying a minimal theme for the plot appearance.
312
312
313
313
```{r}
314
-
df %>%
315
-
group_by(publication_year, oa_status) %>%
316
-
summarise(n = n(), .groups = "keep") %>%
314
+
df |>
315
+
group_by(publication_year, oa_status) |>
316
+
summarise(n = n(), .groups = "keep") |>
317
317
ggplot(aes(x = publication_year, y = n)) +
318
318
geom_line(aes(colour = oa_status)) +
319
319
geom_point(aes(colour = oa_status)) +
@@ -325,10 +325,10 @@ The plot shows use the total number of publications for each publication year an
325
325
We can further choose to visualise the open access distribution over time in terms of percentages. For this we first calculate the share of each open access status per publication year and create a new column using the `mutate` function that stores these values. We pipe the result through to the `ggplot` function assigning the publication year column to the x axis and the share to the y axis. We assign the *oa_status* column to the fill argument which will plot a different colour for each open access status. For this plot we choose a bar chart (geom_bar) graph type and again apply the minimal theme.
326
326
327
327
```{r}
328
-
df %>%
329
-
group_by(publication_year, oa_status) %>%
330
-
summarise(count = n()) %>%
331
-
mutate(perc = count / sum(count) * 100) %>%
328
+
df |>
329
+
group_by(publication_year, oa_status) |>
330
+
summarise(count = n()) |>
331
+
mutate(perc = count / sum(count) * 100) |>
332
332
ggplot(aes(x = publication_year, y = perc, fill = oa_status)) +
333
333
geom_bar(stat = "identity") +
334
334
theme_minimal()
@@ -363,28 +363,28 @@ Before, we analysed the distribution of open access articles across journals. Be
0 commit comments