You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: OpenAlex.en.qmd
+57-55Lines changed: 57 additions & 55 deletions
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ execute:
25
25
warning: false
26
26
error: false
27
27
webr:
28
-
packages: ['tidyverse']
28
+
packages: ['dplyr','tidyr','ggplot2']
29
29
lightbox: true
30
30
---
31
31
@@ -81,14 +81,16 @@ OpenAlex offers a lot more information than what we are exploring in this notebo
81
81
82
82
## Loading packages
83
83
84
-
We will load the *openalexR* package [@openalexR] that allows us to query the OpenAlex API from within our notebook and the *tidyverse* package [@tidyverse]that provides a lot of additional functionalities for data wrangling and visualization.
84
+
We will load the *openalexR* package [@openalexR] that allows us to query the OpenAlex API from within our notebook. We will also load the packages *dplyr*, *tidyr* and *ggplot2*that provide a lot of additional functionalities for data wrangling and visualization and are part of the *tidyverse* package [@tidyverse].
85
85
86
86
87
87
```{r}
88
88
# Installation of packages if not already installed with
89
89
# install.packages(c("openalexR","tidyverse"))
90
90
library(openalexR)
91
-
library(tidyverse)
91
+
library(dplyr)
92
+
library(tidyr)
93
+
library(ggplot2)
92
94
```
93
95
94
96
## Loading data
@@ -162,9 +164,9 @@ The `arrange` function is used to sort the data based on the n variable and the
@@ -176,32 +178,32 @@ The output shows us some important things about the data:
176
178
As an example we will look at the publisher names in our data frame that contain *Springer* or *Nature*. We will use the `filter` function from the tidyverse that lets us filter articles that fulfil the condition we specify in combination with the `grepl` function that allows to search for patterns. In this case we will use a regular expression as the pattern to search for.
If we want to replace the Springer name variants with *Springer Nature* as publisher name, we can use the `mutate` function that lets us transform a column in our data frame in combination with the `str_replace_all` function that lets us replace string values that follow a pattern we specify to do so.
Be aware that this data transformation only applies to the publisher name column and not the OpenAlex provided publisher id column. We would need to perform a separate transformation step to align both. Additionally, we did not permanently rename the publisher name values in our data frame. To do this we can either override our data frame *df* or create a new data frame by assigning the output with the `<-` operator.
@@ -211,26 +213,26 @@ To access the apc values in our data frame we will use the `unnest_wider`, `unne
211
213
The `unnest_wider` function allows us to turn each element of a list-column into a column. We will further use the `select` function to print only the resulting apc columns.
212
214
213
215
```{r}
214
-
df %>%
215
-
unnest_wider(apc, names_sep = ".") %>%
216
+
df |>
217
+
unnest_wider(apc, names_sep = ".") |>
216
218
select(starts_with("apc"))
217
219
```
218
220
219
221
The output shows that the values in the apc columns are lists. This is because OpenAlex provides information for APC list prices and prices of APCs that were actually paid, when available. To transform the lists into single value cells we will use the `unnest_longer` function that does precisely that.
The output shows that we now have multiple rows for the same id. To transform the data frame to single rows for each id we will use the `pivot_wider` function that allows us to increase the number of columns and decrease the number of rows.
To analyse the distribution of open access articles across journals we will calculate the total number of articles (n_articles), the total number of open access articles (n_oa_articles) and the total number of closed articles (n_closed_articles) per journal. We will again use the `group_by`, `summarise` and `arrange` functions. However, since we noticed that `r sum(is.na(df$is_oa))` articles have an undetermined open access status, we will first filter out all rows with *NA* values in the *is_oa* column. We will also group the data by the *source_display_name* column to generate aggregate statistics for the journals.
277
279
278
280
```{r}
279
-
df %>%
280
-
filter(!is.na(is_oa)) %>%
281
-
group_by(source_display_name) %>%
281
+
df |>
282
+
filter(!is.na(is_oa)) |>
283
+
group_by(source_display_name) |>
282
284
summarise(
283
285
n_articles = n(),
284
286
n_oa_articles = sum(is_oa),
285
287
n_closed_articles = n_articles - n_oa_articles
286
-
) %>%
288
+
) |>
287
289
arrange(desc(n_oa_articles))
288
290
```
289
291
@@ -292,9 +294,9 @@ The results show that 414 articles have no journal assigned within our data fram
292
294
We can further explore the open access status distribution for the articles. We will do this on the example of the top three journals in terms of open access publication volume.
293
295
294
296
```{r}
295
-
df %>%
296
-
filter(source_display_name %in% c("Scientific Reports", "Astronomy and Astrophysics", "Journal of High Energy Physics")) %>%
@@ -307,9 +309,9 @@ We can combine functions from the tidyverse and ggplot2 to visualise the develop
307
309
First, we group the data by publication year and open access status. We will then compute the number of articles in each group. In the ggplot function, we assign the publication year column to the x axis and the number of articles to the y axis. We choose point (geom_point) and line (geom_line) graph types to mark the distinct values of n and have them connected by lines. Both are provided with a colour aesthetic which is set to our open access status column. This will result in different colours being assigned to the points and lines for the different open access status values. With the *theme_minimal* option we are applying a minimal theme for the plot appearance.
308
310
309
311
```{r}
310
-
df %>%
311
-
group_by(publication_year, oa_status) %>%
312
-
summarise(n = n(), .groups = "keep") %>%
312
+
df |>
313
+
group_by(publication_year, oa_status) |>
314
+
summarise(n = n(), .groups = "keep") |>
313
315
ggplot(aes(x = publication_year, y = n)) +
314
316
geom_line(aes(colour = oa_status)) +
315
317
geom_point(aes(colour = oa_status)) +
@@ -321,10 +323,10 @@ The plot shows use the total number of publications for each publication year an
321
323
We can further choose to visualise the open access distribution over time in terms of percentages. For this we first calculate the share of each open access status per publication year and create a new column using the `mutate` function that stores these values. We pipe the result through to the `ggplot` function assigning the publication year column to the x axis and the share to the y axis. We assign the *oa_status* column to the fill argument which will plot a different colour for each open access status. For this plot we choose a bar chart (geom_bar) graph type and again apply the minimal theme.
322
324
323
325
```{r}
324
-
df %>%
325
-
group_by(publication_year, oa_status) %>%
326
-
summarise(count = n()) %>%
327
-
mutate(perc = count / sum(count) * 100) %>%
326
+
df |>
327
+
group_by(publication_year, oa_status) |>
328
+
summarise(count = n()) |>
329
+
mutate(perc = count / sum(count) * 100) |>
328
330
ggplot(aes(x = publication_year, y = perc, fill = oa_status)) +
329
331
geom_bar(stat = "identity") +
330
332
theme_minimal()
@@ -359,28 +361,28 @@ Before, we analysed the distribution of open access articles across journals. Be
0 commit comments