You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: OpenAPC.qmd
+26-21Lines changed: 26 additions & 21 deletions
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ execute:
24
24
warning: false
25
25
error: false
26
26
webr:
27
-
packages: ['tidyverse']
27
+
packages: ['dplyr','tidyr','ggplot2']
28
28
lightbox: true
29
29
---
30
30
@@ -76,10 +76,15 @@ OpenAPC offers comprehensive information on APC payments for participating resea
76
76
77
77
We will load the *dplyr* package [@dplyr] that provides a lot of additional functionality for data wrangling, and *ggplot2*[@ggplot2] for data visualization.
78
78
79
+
We will load the *openalexR* package [@openalexR] that allows us to query the OpenAlex API from within our notebook. We will also load the packages *dplyr*, *tidyr* and *ggplot2* that provide a lot of additional functionalities for data wrangling and visualization and are part of the *tidyverse* package [@tidyverse].
80
+
81
+
79
82
```{r}
80
83
# Installation of packages if not already installed with
81
-
# install.packages(c("dplyr","ggplot2"))
84
+
# install.packages(c("openalexR","tidyverse"))
85
+
library(openalexR)
82
86
library(dplyr)
87
+
library(tidyr)
83
88
library(ggplot2)
84
89
```
85
90
@@ -140,17 +145,17 @@ Because it takes research organizations time to collect and process data on APC
140
145
141
146
The R package *dplyr* provides a lot of functionality for data wrangling. We can't go into much detail here, but we will demonstrate how to use some of the most useful functions.
142
147
143
-
One of them is the `group_by` function. Here, we use it to group the dataset by the column *journal_full_title*. This will allow us to generate aggregate statistics for articles published within a journal. We then use the pipe operator `%>%`, which takes output from one function as input for the next function. The input is passed to the `summarise` function, where we calculate aggregate statistics. For each journal in the dataset, we calculate the total number of articles published (*n_articles*), the total of APCs paid (*sum_apc*), the average APC paid per article (*avg_apc*), and the standard deviation (*sd_apc*). The last two statistics are rounded to two decimal places. To get a better overview of the data, we then use `arrange` to sort the result by the total number of articles.
148
+
One of them is the `group_by` function. Here, we use it to group the dataset by the column *journal_full_title*. This will allow us to generate aggregate statistics for articles published within a journal. We then use the pipe operator `|>`, which takes output from one function as input for the next function. The input is passed to the `summarise` function, where we calculate aggregate statistics. For each journal in the dataset, we calculate the total number of articles published (*n_articles*), the total of APCs paid (*sum_apc*), the average APC paid per article (*avg_apc*), and the standard deviation (*sd_apc*). The last two statistics are rounded to two decimal places. To get a better overview of the data, we then use `arrange` to sort the result by the total number of articles.
144
149
145
150
```{r}
146
-
df %>%
147
-
group_by(journal_full_title) %>%
151
+
df |>
152
+
group_by(journal_full_title) |>
148
153
summarise(
149
154
n_articles = n(),
150
155
sum_apc = sum(euro),
151
156
avg_apc = round(sum_apc / n_articles, 2),
152
157
sd_apc = round(sd(euro), 2)
153
-
) %>%
158
+
) |>
154
159
arrange(desc(n_articles))
155
160
```
156
161
@@ -160,8 +165,8 @@ We can explore this variance further by visualizing the distribution of APC paym
160
165
First, we `filter` the data to focus on the three journals with the most APC payments. We then pass this output to the `ggplot` function. We assign the column *euro* to the x axis and the column *journal_full_title* to the y axis. Next, we define the type of plot we want - here, we choose a box plot to visualize a distribution. Finally, we choose the theme *theme_minimal* for a simple layout.
161
166
162
167
```{r}
163
-
df %>%
164
-
filter(journal_full_title %in% c("Scientific Reports", "PLOS ONE", "Frontiers in Immunology")) %>%
168
+
df |>
169
+
filter(journal_full_title %in% c("Scientific Reports", "PLOS ONE", "Frontiers in Immunology")) |>
165
170
ggplot(aes(x = euro, y = journal_full_title)) +
166
171
geom_boxplot() +
167
172
theme_minimal()
@@ -176,9 +181,9 @@ We can combine *dplyr* and *ggplot2* to visualize the development of APC payment
176
181
First, we group the data by *period* ("Year of APC payment (YYYY)") and *is_hybrid* ("Determines if the article has been published in a hybrid journal (TRUE) or in fully/Gold OA journal (FALSE)"). Just like above, we will generate the aggregate statistic *n_articles*. In the `ggplot` function, we assign the column *period* to the x axis, *n_articles* to the y axis, and *is_hybrid* to fill - this means that different colours will be assigned to the open access types. We choose the graph type *geom_col*, a simple bar chart.
177
182
178
183
```{r}
179
-
df %>%
180
-
group_by(period, is_hybrid) %>%
181
-
summarise(n_articles = n()) %>%
184
+
df |>
185
+
group_by(period, is_hybrid) |>
186
+
summarise(n_articles = n()) |>
182
187
ggplot(aes(x = period, y = n_articles, fill = is_hybrid)) +
183
188
geom_col() +
184
189
theme_minimal()
@@ -214,28 +219,28 @@ Above, we analyzed APC payments by journal. Here is a copy of that code block. S
0 commit comments