You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: OpenAPC.en.qmd
+26-21Lines changed: 26 additions & 21 deletions
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ execute:
24
24
warning: false
25
25
error: false
26
26
webr:
27
-
packages: ['tidyverse']
27
+
packages: ['dplyr','tidyr','ggplot2']
28
28
lightbox: true
29
29
---
30
30
@@ -72,10 +72,15 @@ OpenAPC offers comprehensive information on APC payments for participating resea
72
72
73
73
We will load the *dplyr* package [@dplyr] that provides a lot of additional functionality for data wrangling, and *ggplot2*[@ggplot2] for data visualization.
74
74
75
+
We will load the *openalexR* package [@openalexR] that allows us to query the OpenAlex API from within our notebook. We will also load the packages *dplyr*, *tidyr* and *ggplot2* that provide a lot of additional functionalities for data wrangling and visualization and are part of the *tidyverse* package [@tidyverse].
76
+
77
+
75
78
```{r}
76
79
# Installation of packages if not already installed with
77
-
# install.packages(c("dplyr","ggplot2"))
80
+
# install.packages(c("openalexR","tidyverse"))
81
+
library(openalexR)
78
82
library(dplyr)
83
+
library(tidyr)
79
84
library(ggplot2)
80
85
```
81
86
@@ -136,17 +141,17 @@ Because it takes research organizations time to collect and process data on APC
136
141
137
142
The R package *dplyr* provides a lot of functionality for data wrangling. We can't go into much detail here, but we will demonstrate how to use some of the most useful functions.
138
143
139
-
One of them is the `group_by` function. Here, we use it to group the dataset by the column *journal_full_title*. This will allow us to generate aggregate statistics for articles published within a journal. We then use the pipe operator `%>%`, which takes output from one function as input for the next function. The input is passed to the `summarise` function, where we calculate aggregate statistics. For each journal in the dataset, we calculate the total number of articles published (*n_articles*), the total of APCs paid (*sum_apc*), the average APC paid per article (*avg_apc*), and the standard deviation (*sd_apc*). The last two statistics are rounded to two decimal places. To get a better overview of the data, we then use `arrange` to sort the result by the total number of articles.
144
+
One of them is the `group_by` function. Here, we use it to group the dataset by the column *journal_full_title*. This will allow us to generate aggregate statistics for articles published within a journal. We then use the pipe operator `|>`, which takes output from one function as input for the next function. The input is passed to the `summarise` function, where we calculate aggregate statistics. For each journal in the dataset, we calculate the total number of articles published (*n_articles*), the total of APCs paid (*sum_apc*), the average APC paid per article (*avg_apc*), and the standard deviation (*sd_apc*). The last two statistics are rounded to two decimal places. To get a better overview of the data, we then use `arrange` to sort the result by the total number of articles.
140
145
141
146
```{r}
142
-
df %>%
143
-
group_by(journal_full_title) %>%
147
+
df |>
148
+
group_by(journal_full_title) |>
144
149
summarise(
145
150
n_articles = n(),
146
151
sum_apc = sum(euro),
147
152
avg_apc = round(sum_apc / n_articles, 2),
148
153
sd_apc = round(sd(euro), 2)
149
-
) %>%
154
+
) |>
150
155
arrange(desc(n_articles))
151
156
```
152
157
@@ -156,8 +161,8 @@ We can explore this variance further by visualizing the distribution of APC paym
156
161
First, we `filter` the data to focus on the three journals with the most APC payments. We then pass this output to the `ggplot` function. We assign the column *euro* to the x axis and the column *journal_full_title* to the y axis. Next, we define the type of plot we want - here, we choose a box plot to visualize a distribution. Finally, we choose the theme *theme_minimal* for a simple layout.
157
162
158
163
```{r}
159
-
df %>%
160
-
filter(journal_full_title %in% c("Scientific Reports", "PLOS ONE", "Frontiers in Immunology")) %>%
164
+
df |>
165
+
filter(journal_full_title %in% c("Scientific Reports", "PLOS ONE", "Frontiers in Immunology")) |>
161
166
ggplot(aes(x = euro, y = journal_full_title)) +
162
167
geom_boxplot() +
163
168
theme_minimal()
@@ -172,9 +177,9 @@ We can combine *dplyr* and *ggplot2* to visualize the development of APC payment
172
177
First, we group the data by *period* ("Year of APC payment (YYYY)") and *is_hybrid* ("Determines if the article has been published in a hybrid journal (TRUE) or in fully/Gold OA journal (FALSE)"). Just like above, we will generate the aggregate statistic *n_articles*. In the `ggplot` function, we assign the column *period* to the x axis, *n_articles* to the y axis, and *is_hybrid* to fill - this means that different colours will be assigned to the open access types. We choose the graph type *geom_col*, a simple bar chart.
173
178
174
179
```{r}
175
-
df %>%
176
-
group_by(period, is_hybrid) %>%
177
-
summarise(n_articles = n()) %>%
180
+
df |>
181
+
group_by(period, is_hybrid) |>
182
+
summarise(n_articles = n()) |>
178
183
ggplot(aes(x = period, y = n_articles, fill = is_hybrid)) +
179
184
geom_col() +
180
185
theme_minimal()
@@ -210,28 +215,28 @@ Above, we analyzed APC payments by journal. Here is a copy of that code block. S
0 commit comments