Skip to content

Commit 5fab1cd

Browse files
authored
Update OpenAPC.qmd
1 parent 4355511 commit 5fab1cd

File tree

1 file changed

+26
-21
lines changed

1 file changed

+26
-21
lines changed

OpenAPC.qmd

Lines changed: 26 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ execute:
2424
warning: false
2525
error: false
2626
webr:
27-
packages: ['tidyverse']
27+
packages: ['dplyr','tidyr','ggplot2']
2828
lightbox: true
2929
---
3030

@@ -76,10 +76,15 @@ OpenAPC offers comprehensive information on APC payments for participating resea
7676

7777
We will load the *dplyr* package [@dplyr] that provides a lot of additional functionality for data wrangling, and *ggplot2* [@ggplot2] for data visualization.
7878

79+
We will load the *openalexR* package [@openalexR] that allows us to query the OpenAlex API from within our notebook. We will also load the packages *dplyr*, *tidyr* and *ggplot2* that provide a lot of additional functionalities for data wrangling and visualization and are part of the *tidyverse* package [@tidyverse].
80+
81+
7982
```{r}
8083
# Installation of packages if not already installed with
81-
# install.packages(c("dplyr","ggplot2"))
84+
# install.packages(c("openalexR","tidyverse"))
85+
library(openalexR)
8286
library(dplyr)
87+
library(tidyr)
8388
library(ggplot2)
8489
```
8590

@@ -140,17 +145,17 @@ Because it takes research organizations time to collect and process data on APC
140145

141146
The R package *dplyr* provides a lot of functionality for data wrangling. We can't go into much detail here, but we will demonstrate how to use some of the most useful functions.
142147

143-
One of them is the `group_by` function. Here, we use it to group the dataset by the column *journal_full_title*. This will allow us to generate aggregate statistics for articles published within a journal. We then use the pipe operator `%>%`, which takes output from one function as input for the next function. The input is passed to the `summarise` function, where we calculate aggregate statistics. For each journal in the dataset, we calculate the total number of articles published (*n_articles*), the total of APCs paid (*sum_apc*), the average APC paid per article (*avg_apc*), and the standard deviation (*sd_apc*). The last two statistics are rounded to two decimal places. To get a better overview of the data, we then use `arrange` to sort the result by the total number of articles.
148+
One of them is the `group_by` function. Here, we use it to group the dataset by the column *journal_full_title*. This will allow us to generate aggregate statistics for articles published within a journal. We then use the pipe operator `|>`, which takes output from one function as input for the next function. The input is passed to the `summarise` function, where we calculate aggregate statistics. For each journal in the dataset, we calculate the total number of articles published (*n_articles*), the total of APCs paid (*sum_apc*), the average APC paid per article (*avg_apc*), and the standard deviation (*sd_apc*). The last two statistics are rounded to two decimal places. To get a better overview of the data, we then use `arrange` to sort the result by the total number of articles.
144149

145150
```{r}
146-
df %>%
147-
group_by(journal_full_title) %>%
151+
df |>
152+
group_by(journal_full_title) |>
148153
summarise(
149154
n_articles = n(),
150155
sum_apc = sum(euro),
151156
avg_apc = round(sum_apc / n_articles, 2),
152157
sd_apc = round(sd(euro), 2)
153-
) %>%
158+
) |>
154159
arrange(desc(n_articles))
155160
```
156161

@@ -160,8 +165,8 @@ We can explore this variance further by visualizing the distribution of APC paym
160165
First, we `filter` the data to focus on the three journals with the most APC payments. We then pass this output to the `ggplot` function. We assign the column *euro* to the x axis and the column *journal_full_title* to the y axis. Next, we define the type of plot we want - here, we choose a box plot to visualize a distribution. Finally, we choose the theme *theme_minimal* for a simple layout.
161166

162167
```{r}
163-
df %>%
164-
filter(journal_full_title %in% c("Scientific Reports", "PLOS ONE", "Frontiers in Immunology")) %>%
168+
df |>
169+
filter(journal_full_title %in% c("Scientific Reports", "PLOS ONE", "Frontiers in Immunology")) |>
165170
ggplot(aes(x = euro, y = journal_full_title)) +
166171
geom_boxplot() +
167172
theme_minimal()
@@ -176,9 +181,9 @@ We can combine *dplyr* and *ggplot2* to visualize the development of APC payment
176181
First, we group the data by *period* ("Year of APC payment (YYYY)") and *is_hybrid* ("Determines if the article has been published in a hybrid journal (TRUE) or in fully/Gold OA journal (FALSE)"). Just like above, we will generate the aggregate statistic *n_articles*. In the `ggplot` function, we assign the column *period* to the x axis, *n_articles* to the y axis, and *is_hybrid* to fill - this means that different colours will be assigned to the open access types. We choose the graph type *geom_col*, a simple bar chart.
177182

178183
```{r}
179-
df %>%
180-
group_by(period, is_hybrid) %>%
181-
summarise(n_articles = n()) %>%
184+
df |>
185+
group_by(period, is_hybrid) |>
186+
summarise(n_articles = n()) |>
182187
ggplot(aes(x = period, y = n_articles, fill = is_hybrid)) +
183188
geom_col() +
184189
theme_minimal()
@@ -214,28 +219,28 @@ Above, we analyzed APC payments by journal. Here is a copy of that code block. S
214219
## {{< iconify proicons:code >}}&ensp;Interactive editor
215220

216221
```{webr-r}
217-
df %>%
218-
group_by(journal_full_title) %>%
222+
df |>
223+
group_by(journal_full_title) |>
219224
summarise(
220225
n_articles = n(),
221226
sum_apc = sum(euro),
222227
avg_apc = round(sum_apc / n_articles, 2),
223228
sd_apc = round(sd(euro), 2)
224-
) %>%
229+
) |>
225230
arrange(desc(n_articles))
226231
```
227232

228233
## {{< iconify proicons:checkmark-circle >}}&ensp;Solution
229234

230235
```{webr-r}
231-
df %>%
232-
group_by(publisher) %>%
236+
df |>
237+
group_by(publisher) |>
233238
summarise(
234239
n_articles = n(),
235240
sum_apc = sum(euro),
236241
avg_apc = round(sum_apc / n_articles, 2),
237242
sd_apc = round(sd(euro), 2)
238-
) %>%
243+
) |>
239244
arrange(desc(n_articles))
240245
```
241246

@@ -252,8 +257,8 @@ Now, see if you can adapt the code above to show you the distribution of APC pay
252257
## {{< iconify proicons:code >}}&ensp;Interactive editor
253258

254259
```{webr-r}
255-
df %>%
256-
filter(journal_full_title %in% c("Scientific Reports", "PLOS ONE", "Frontiers in Immunology")) %>%
260+
df |>
261+
filter(journal_full_title %in% c("Scientific Reports", "PLOS ONE", "Frontiers in Immunology")) |>
257262
ggplot(aes(x = euro, y = journal_full_title)) +
258263
geom_boxplot() +
259264
theme_minimal()
@@ -262,8 +267,8 @@ df %>%
262267
## {{< iconify proicons:checkmark-circle >}}&ensp;Solution
263268

264269
```{webr-r}
265-
df %>%
266-
filter(publisher %in% c("Springer Nature", "Wiley-Blackwell", "Frontiers Media SA")) %>%
270+
df |>
271+
filter(publisher %in% c("Springer Nature", "Wiley-Blackwell", "Frontiers Media SA")) |>
267272
ggplot(aes(x = euro, y = publisher)) +
268273
geom_boxplot() +
269274
theme_minimal()

0 commit comments

Comments
 (0)