Skip to content

Commit e25bbb3

Browse files
authored
Update OpenAPC.en.qmd
1 parent 5fab1cd commit e25bbb3

File tree

1 file changed

+26
-21
lines changed

1 file changed

+26
-21
lines changed

OpenAPC.en.qmd

Lines changed: 26 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ execute:
2424
warning: false
2525
error: false
2626
webr:
27-
packages: ['tidyverse']
27+
packages: ['dplyr','tidyr','ggplot2']
2828
lightbox: true
2929
---
3030

@@ -72,10 +72,15 @@ OpenAPC offers comprehensive information on APC payments for participating resea
7272

7373
We will load the *dplyr* package [@dplyr] that provides a lot of additional functionality for data wrangling, and *ggplot2* [@ggplot2] for data visualization.
7474

75+
We will load the *openalexR* package [@openalexR] that allows us to query the OpenAlex API from within our notebook. We will also load the packages *dplyr*, *tidyr* and *ggplot2* that provide a lot of additional functionalities for data wrangling and visualization and are part of the *tidyverse* package [@tidyverse].
76+
77+
7578
```{r}
7679
# Installation of packages if not already installed with
77-
# install.packages(c("dplyr","ggplot2"))
80+
# install.packages(c("openalexR","tidyverse"))
81+
library(openalexR)
7882
library(dplyr)
83+
library(tidyr)
7984
library(ggplot2)
8085
```
8186

@@ -136,17 +141,17 @@ Because it takes research organizations time to collect and process data on APC
136141

137142
The R package *dplyr* provides a lot of functionality for data wrangling. We can't go into much detail here, but we will demonstrate how to use some of the most useful functions.
138143

139-
One of them is the `group_by` function. Here, we use it to group the dataset by the column *journal_full_title*. This will allow us to generate aggregate statistics for articles published within a journal. We then use the pipe operator `%>%`, which takes output from one function as input for the next function. The input is passed to the `summarise` function, where we calculate aggregate statistics. For each journal in the dataset, we calculate the total number of articles published (*n_articles*), the total of APCs paid (*sum_apc*), the average APC paid per article (*avg_apc*), and the standard deviation (*sd_apc*). The last two statistics are rounded to two decimal places. To get a better overview of the data, we then use `arrange` to sort the result by the total number of articles.
144+
One of them is the `group_by` function. Here, we use it to group the dataset by the column *journal_full_title*. This will allow us to generate aggregate statistics for articles published within a journal. We then use the pipe operator `|>`, which takes output from one function as input for the next function. The input is passed to the `summarise` function, where we calculate aggregate statistics. For each journal in the dataset, we calculate the total number of articles published (*n_articles*), the total of APCs paid (*sum_apc*), the average APC paid per article (*avg_apc*), and the standard deviation (*sd_apc*). The last two statistics are rounded to two decimal places. To get a better overview of the data, we then use `arrange` to sort the result by the total number of articles.
140145

141146
```{r}
142-
df %>%
143-
group_by(journal_full_title) %>%
147+
df |>
148+
group_by(journal_full_title) |>
144149
summarise(
145150
n_articles = n(),
146151
sum_apc = sum(euro),
147152
avg_apc = round(sum_apc / n_articles, 2),
148153
sd_apc = round(sd(euro), 2)
149-
) %>%
154+
) |>
150155
arrange(desc(n_articles))
151156
```
152157

@@ -156,8 +161,8 @@ We can explore this variance further by visualizing the distribution of APC paym
156161
First, we `filter` the data to focus on the three journals with the most APC payments. We then pass this output to the `ggplot` function. We assign the column *euro* to the x axis and the column *journal_full_title* to the y axis. Next, we define the type of plot we want - here, we choose a box plot to visualize a distribution. Finally, we choose the theme *theme_minimal* for a simple layout.
157162

158163
```{r}
159-
df %>%
160-
filter(journal_full_title %in% c("Scientific Reports", "PLOS ONE", "Frontiers in Immunology")) %>%
164+
df |>
165+
filter(journal_full_title %in% c("Scientific Reports", "PLOS ONE", "Frontiers in Immunology")) |>
161166
ggplot(aes(x = euro, y = journal_full_title)) +
162167
geom_boxplot() +
163168
theme_minimal()
@@ -172,9 +177,9 @@ We can combine *dplyr* and *ggplot2* to visualize the development of APC payment
172177
First, we group the data by *period* ("Year of APC payment (YYYY)") and *is_hybrid* ("Determines if the article has been published in a hybrid journal (TRUE) or in fully/Gold OA journal (FALSE)"). Just like above, we will generate the aggregate statistic *n_articles*. In the `ggplot` function, we assign the column *period* to the x axis, *n_articles* to the y axis, and *is_hybrid* to fill - this means that different colours will be assigned to the open access types. We choose the graph type *geom_col*, a simple bar chart.
173178

174179
```{r}
175-
df %>%
176-
group_by(period, is_hybrid) %>%
177-
summarise(n_articles = n()) %>%
180+
df |>
181+
group_by(period, is_hybrid) |>
182+
summarise(n_articles = n()) |>
178183
ggplot(aes(x = period, y = n_articles, fill = is_hybrid)) +
179184
geom_col() +
180185
theme_minimal()
@@ -210,28 +215,28 @@ Above, we analyzed APC payments by journal. Here is a copy of that code block. S
210215
## {{< iconify proicons:code >}}&ensp;Interactive editor
211216

212217
```{webr-r}
213-
df %>%
214-
group_by(journal_full_title) %>%
218+
df |>
219+
group_by(journal_full_title) |>
215220
summarise(
216221
n_articles = n(),
217222
sum_apc = sum(euro),
218223
avg_apc = round(sum_apc / n_articles, 2),
219224
sd_apc = round(sd(euro), 2)
220-
) %>%
225+
) |>
221226
arrange(desc(n_articles))
222227
```
223228

224229
## {{< iconify proicons:checkmark-circle >}}&ensp;Solution
225230

226231
```{webr-r}
227-
df %>%
228-
group_by(publisher) %>%
232+
df |>
233+
group_by(publisher) |>
229234
summarise(
230235
n_articles = n(),
231236
sum_apc = sum(euro),
232237
avg_apc = round(sum_apc / n_articles, 2),
233238
sd_apc = round(sd(euro), 2)
234-
) %>%
239+
) |>
235240
arrange(desc(n_articles))
236241
```
237242

@@ -248,8 +253,8 @@ Now, see if you can adapt the code above to show you the distribution of APC pay
248253
## {{< iconify proicons:code >}}&ensp;Interactive editor
249254

250255
```{webr-r}
251-
df %>%
252-
filter(journal_full_title %in% c("Scientific Reports", "PLOS ONE", "Frontiers in Immunology")) %>%
256+
df |>
257+
filter(journal_full_title %in% c("Scientific Reports", "PLOS ONE", "Frontiers in Immunology")) |>
253258
ggplot(aes(x = euro, y = journal_full_title)) +
254259
geom_boxplot() +
255260
theme_minimal()
@@ -258,8 +263,8 @@ df %>%
258263
## {{< iconify proicons:checkmark-circle >}}&ensp;Solution
259264

260265
```{webr-r}
261-
df %>%
262-
filter(publisher %in% c("Springer Nature", "Wiley-Blackwell", "Frontiers Media SA")) %>%
266+
df |>
267+
filter(publisher %in% c("Springer Nature", "Wiley-Blackwell", "Frontiers Media SA")) |>
263268
ggplot(aes(x = euro, y = publisher)) +
264269
geom_boxplot() +
265270
theme_minimal()

0 commit comments

Comments
 (0)