Skip to content

Commit 4af2a7e

Browse files
authored
Update OpenAlex.en.qmd
1 parent 4401f4c commit 4af2a7e

File tree

1 file changed

+57
-55
lines changed

1 file changed

+57
-55
lines changed

OpenAlex.en.qmd

Lines changed: 57 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ execute:
2525
warning: false
2626
error: false
2727
webr:
28-
packages: ['tidyverse']
28+
packages: ['dplyr','tidyr','ggplot2']
2929
lightbox: true
3030
---
3131

@@ -81,14 +81,16 @@ OpenAlex offers a lot more information than what we are exploring in this notebo
8181

8282
## Loading packages
8383

84-
We will load the *openalexR* package [@openalexR] that allows us to query the OpenAlex API from within our notebook and the *tidyverse* package [@tidyverse] that provides a lot of additional functionalities for data wrangling and visualization.
84+
We will load the *openalexR* package [@openalexR] that allows us to query the OpenAlex API from within our notebook. We will also load the packages *dplyr*, *tidyr* and *ggplot2* that provide a lot of additional functionalities for data wrangling and visualization and are part of the *tidyverse* package [@tidyverse].
8585

8686

8787
```{r}
8888
# Installation of packages if not already installed with
8989
# install.packages(c("openalexR","tidyverse"))
9090
library(openalexR)
91-
library(tidyverse)
91+
library(dplyr)
92+
library(tidyr)
93+
library(ggplot2)
9294
```
9395

9496
## Loading data
@@ -162,9 +164,9 @@ The `arrange` function is used to sort the data based on the n variable and the
162164

163165

164166
```{r}
165-
df %>%
166-
group_by(host_organization, host_organization_name) %>%
167-
summarise(n = n()) %>%
167+
df |>
168+
group_by(host_organization, host_organization_name) |>
169+
summarise(n = n()) |>
168170
arrange(desc(n))
169171
```
170172

@@ -176,32 +178,32 @@ The output shows us some important things about the data:
176178
As an example we will look at the publisher names in our data frame that contain *Springer* or *Nature*. We will use the `filter` function from the tidyverse that lets us filter articles that fulfil the condition we specify in combination with the `grepl` function that allows to search for patterns. In this case we will use a regular expression as the pattern to search for.
177179

178180
```{r}
179-
df %>%
180-
group_by(host_organization, host_organization_name) %>%
181-
summarise(n = n()) %>%
182-
arrange(desc(n)) %>%
181+
df |>
182+
group_by(host_organization, host_organization_name) |>
183+
summarise(n = n()) |>
184+
arrange(desc(n)) |>
183185
filter(grepl("^Springer|Nature", host_organization_name))
184186
```
185187

186188
If we want to replace the Springer name variants with *Springer Nature* as publisher name, we can use the `mutate` function that lets us transform a column in our data frame in combination with the `str_replace_all` function that lets us replace string values that follow a pattern we specify to do so.
187189

188190
```{r}
189-
df %>%
190-
mutate(host_organization_name = str_replace_all(host_organization_name, "^Springer.*$|Nature.*$", "Springer Nature")) %>%
191-
group_by(host_organization_name) %>%
192-
summarise(n = n()) %>%
191+
df |>
192+
mutate(host_organization_name = str_replace_all(host_organization_name, "^Springer.*$|Nature.*$", "Springer Nature")) |>
193+
group_by(host_organization_name) |>
194+
summarise(n = n()) |>
193195
arrange(desc(n))
194196
```
195197

196198
Be aware that this data transformation only applies to the publisher name column and not the OpenAlex provided publisher id column. We would need to perform a separate transformation step to align both. Additionally, we did not permanently rename the publisher name values in our data frame. To do this we can either override our data frame *df* or create a new data frame by assigning the output with the `<-` operator.
197199

198200
```{r, eval=FALSE}
199201
# overriding our data frame
200-
df <- df %>%
202+
df <- df |>
201203
mutate(host_organization_name = str_replace_all(host_organization_name, "^Springer.*$|Nature.*$", "Springer Nature"))
202204
203205
# creating a new data frame df2
204-
df2 <- df %>%
206+
df2 <- df |>
205207
mutate(host_organization_name = str_replace_all(host_organization_name, "^Springer.*$|Nature.*$", "Springer Nature"))
206208
```
207209

@@ -211,26 +213,26 @@ To access the apc values in our data frame we will use the `unnest_wider`, `unne
211213
The `unnest_wider` function allows us to turn each element of a list-column into a column. We will further use the `select` function to print only the resulting apc columns.
212214

213215
```{r}
214-
df %>%
215-
unnest_wider(apc, names_sep = ".") %>%
216+
df |>
217+
unnest_wider(apc, names_sep = ".") |>
216218
select(starts_with("apc"))
217219
```
218220

219221
The output shows that the values in the apc columns are lists. This is because OpenAlex provides information for APC list prices and prices of APCs that were actually paid, when available. To transform the lists into single value cells we will use the `unnest_longer` function that does precisely that.
220222

221223
```{r}
222-
df %>%
223-
unnest_wider(apc, names_sep = ".") %>%
224-
unnest_longer(c(apc.type, apc.value, apc.currency, apc.value_usd, apc.provenance)) %>%
224+
df |>
225+
unnest_wider(apc, names_sep = ".") |>
226+
unnest_longer(c(apc.type, apc.value, apc.currency, apc.value_usd, apc.provenance)) |>
225227
select(id, starts_with("apc"))
226228
```
227229

228230
The output shows that we now have multiple rows for the same id. To transform the data frame to single rows for each id we will use the `pivot_wider` function that allows us to increase the number of columns and decrease the number of rows.
229231

230232
```{r}
231-
df %>%
232-
unnest_wider(apc, names_sep = ".") %>%
233-
unnest_longer(c(apc.type, apc.value, apc.currency, apc.value_usd, apc.provenance)) %>%
233+
df |>
234+
unnest_wider(apc, names_sep = ".") |>
235+
unnest_longer(c(apc.type, apc.value, apc.currency, apc.value_usd, apc.provenance)) |>
234236
pivot_wider(id_cols = id, names_from = apc.type, values_from = c(apc.value, apc.currency, apc.value_usd, apc.provenance))
235237
```
236238

@@ -276,14 +278,14 @@ round(sum(df$is_oa, na.rm = T) / n_distinct(df$id) * 100, 2)
276278
To analyse the distribution of open access articles across journals we will calculate the total number of articles (n_articles), the total number of open access articles (n_oa_articles) and the total number of closed articles (n_closed_articles) per journal. We will again use the `group_by`, `summarise` and `arrange` functions. However, since we noticed that `r sum(is.na(df$is_oa))` articles have an undetermined open access status, we will first filter out all rows with *NA* values in the *is_oa* column. We will also group the data by the *source_display_name* column to generate aggregate statistics for the journals.
277279

278280
```{r}
279-
df %>%
280-
filter(!is.na(is_oa)) %>%
281-
group_by(source_display_name) %>%
281+
df |>
282+
filter(!is.na(is_oa)) |>
283+
group_by(source_display_name) |>
282284
summarise(
283285
n_articles = n(),
284286
n_oa_articles = sum(is_oa),
285287
n_closed_articles = n_articles - n_oa_articles
286-
) %>%
288+
) |>
287289
arrange(desc(n_oa_articles))
288290
```
289291

@@ -292,9 +294,9 @@ The results show that 414 articles have no journal assigned within our data fram
292294
We can further explore the open access status distribution for the articles. We will do this on the example of the top three journals in terms of open access publication volume.
293295

294296
```{r}
295-
df %>%
296-
filter(source_display_name %in% c("Scientific Reports", "Astronomy and Astrophysics", "Journal of High Energy Physics")) %>%
297-
group_by(source_display_name, is_oa, oa_status) %>%
297+
df |>
298+
filter(source_display_name %in% c("Scientific Reports", "Astronomy and Astrophysics", "Journal of High Energy Physics")) |>
299+
group_by(source_display_name, is_oa, oa_status) |>
298300
summarise(n = n())
299301
```
300302

@@ -307,9 +309,9 @@ We can combine functions from the tidyverse and ggplot2 to visualise the develop
307309
First, we group the data by publication year and open access status. We will then compute the number of articles in each group. In the ggplot function, we assign the publication year column to the x axis and the number of articles to the y axis. We choose point (geom_point) and line (geom_line) graph types to mark the distinct values of n and have them connected by lines. Both are provided with a colour aesthetic which is set to our open access status column. This will result in different colours being assigned to the points and lines for the different open access status values. With the *theme_minimal* option we are applying a minimal theme for the plot appearance.
308310

309311
```{r}
310-
df %>%
311-
group_by(publication_year, oa_status) %>%
312-
summarise(n = n(), .groups = "keep") %>%
312+
df |>
313+
group_by(publication_year, oa_status) |>
314+
summarise(n = n(), .groups = "keep") |>
313315
ggplot(aes(x = publication_year, y = n)) +
314316
geom_line(aes(colour = oa_status)) +
315317
geom_point(aes(colour = oa_status)) +
@@ -321,10 +323,10 @@ The plot shows use the total number of publications for each publication year an
321323
We can further choose to visualise the open access distribution over time in terms of percentages. For this we first calculate the share of each open access status per publication year and create a new column using the `mutate` function that stores these values. We pipe the result through to the `ggplot` function assigning the publication year column to the x axis and the share to the y axis. We assign the *oa_status* column to the fill argument which will plot a different colour for each open access status. For this plot we choose a bar chart (geom_bar) graph type and again apply the minimal theme.
322324

323325
```{r}
324-
df %>%
325-
group_by(publication_year, oa_status) %>%
326-
summarise(count = n()) %>%
327-
mutate(perc = count / sum(count) * 100) %>%
326+
df |>
327+
group_by(publication_year, oa_status) |>
328+
summarise(count = n()) |>
329+
mutate(perc = count / sum(count) * 100) |>
328330
ggplot(aes(x = publication_year, y = perc, fill = oa_status)) +
329331
geom_bar(stat = "identity") +
330332
theme_minimal()
@@ -359,28 +361,28 @@ Before, we analysed the distribution of open access articles across journals. Be
359361
## {{< iconify proicons:code >}}&ensp;Interactive editor
360362

361363
```{webr-r}
362-
df %>%
363-
filter(!is.na(is_oa)) %>%
364-
group_by(source_display_name) %>%
364+
df |>
365+
filter(!is.na(is_oa)) |>
366+
group_by(source_display_name) |>
365367
summarise(
366368
n_articles = n(),
367369
n_oa_articles = sum(is_oa),
368370
n_closed_articles = n_articles - n_oa_articles
369-
) %>%
371+
) |>
370372
arrange(desc(n_oa_articles))
371373
```
372374

373375
## {{< iconify proicons:checkmark-circle >}}&ensp;Solution
374376

375377
```{webr-r}
376-
df %>%
377-
filter(!is.na(is_oa)) %>%
378-
group_by(host_organization_name) %>%
378+
df |>
379+
filter(!is.na(is_oa)) |>
380+
group_by(host_organization_name) |>
379381
summarise(
380382
n_articles = n(),
381383
n_oa_articles = sum(is_oa),
382384
n_closed_articles = n_articles - n_oa_articles
383-
) %>%
385+
) |>
384386
arrange(desc(n_oa_articles))
385387
```
386388

@@ -398,10 +400,10 @@ Now, see if you can adapt the code we used to visualise the open access share ov
398400
## {{< iconify proicons:code >}}&ensp;Interactive editor
399401

400402
```{webr-r}
401-
df %>%
402-
group_by(publication_year, oa_status) %>%
403-
summarise(count = n()) %>%
404-
mutate(perc = count / sum(count) * 100) %>%
403+
df |>
404+
group_by(publication_year, oa_status) |>
405+
summarise(count = n()) |>
406+
mutate(perc = count / sum(count) * 100) |>
405407
ggplot(aes(x = publication_year, y = perc, fill = oa_status)) +
406408
geom_bar(stat = "identity") +
407409
theme_minimal()
@@ -414,11 +416,11 @@ df %>%
414416
## {{< iconify proicons:checkmark-circle >}}&ensp;Solution
415417

416418
```{webr-r}
417-
df %>%
418-
filter(host_organization_name %in% c("Wiley", "Elsevier BV", "Springer Science+Business Media")) %>%
419-
group_by(host_organization_name, oa_status) %>%
420-
summarise(count = n()) %>%
421-
mutate(perc = count / sum(count) * 100) %>%
419+
df |>
420+
filter(host_organization_name %in% c("Wiley", "Elsevier BV", "Springer Science+Business Media")) |>
421+
group_by(host_organization_name, oa_status) |>
422+
summarise(count = n()) |>
423+
mutate(perc = count / sum(count) * 100) |>
422424
ggplot(aes(x = host_organization_name, y = perc, fill = oa_status)) +
423425
geom_bar(stat = "identity") +
424426
theme_minimal()

0 commit comments

Comments
 (0)