Skip to content

Commit 1d1d431

Browse files
committed
change pipe and install of packages in OpenAlex notebook
1 parent e85939d commit 1d1d431

File tree

1 file changed

+54
-54
lines changed

1 file changed

+54
-54
lines changed

OpenAlex.qmd

Lines changed: 54 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ execute:
2525
warning: false
2626
error: false
2727
webr:
28-
packages: ['tidyverse']
28+
packages: ['dplyr','tidyr','ggplot2']
2929
lightbox: true
3030
---
3131

@@ -154,7 +154,7 @@ The first argument within the `head` function is our data frame *df* and the sec
154154

155155
Before analysing the OpenAlex data, we will have a closer look at the publisher and apc columns and perform some data wrangling tasks.
156156

157-
To get an overview of the publishers present in our dataframe and the number of articles per publisher, we will use three tidyverse functions and the pipe operator `%>%` ([magrittr pipe](https://magrittr.tidyverse.org/reference/pipe.html)) or `|>` (base R pipe).
157+
To get an overview of the publishers present in our dataframe and the number of articles per publisher, we will use three tidyverse functions and the base R pipe operator `|>`. The [magrittr package](https://magrittr.tidyverse.org/reference/pipe.html), which is part of the tidyverse, provides another pipe operator (`%>%`), however, we will use the base R pipe since you won't need to install an additional package to use it.
158158

159159
The pipe operator allows us to take e.g. a data frame or the result of a function and pass it to another function. If we type the name of our data frame followed by the pipe operator that means we don't have to specify which data we want a function to be performed on, when we call it after the pipe operator.
160160

@@ -166,9 +166,9 @@ The `arrange` function is used to sort the data based on the n variable and the
166166

167167

168168
```{r}
169-
df %>%
170-
group_by(host_organization, host_organization_name) %>%
171-
summarise(n = n()) %>%
169+
df |>
170+
group_by(host_organization, host_organization_name) |>
171+
summarise(n = n()) |>
172172
arrange(desc(n))
173173
```
174174

@@ -180,32 +180,32 @@ The output shows us some important things about the data:
180180
As an example we will look at the publisher names in our data frame that contain *Springer* or *Nature*. We will use the `filter` function from the tidyverse that lets us filter articles that fulfil the condition we specify in combination with the `grepl` function that allows to search for patterns. In this case we will use a regular expression as the pattern to search for.
181181

182182
```{r}
183-
df %>%
184-
group_by(host_organization, host_organization_name) %>%
185-
summarise(n = n()) %>%
186-
arrange(desc(n)) %>%
183+
df |>
184+
group_by(host_organization, host_organization_name) |>
185+
summarise(n = n()) |>
186+
arrange(desc(n)) |>
187187
filter(grepl("^Springer|Nature", host_organization_name))
188188
```
189189

190190
If we want to replace the Springer name variants with *Springer Nature* as publisher name, we can use the `mutate` function that lets us transform a column in our data frame in combination with the `str_replace_all` function that lets us replace string values that follow a pattern we specify to do so.
191191

192192
```{r}
193-
df %>%
194-
mutate(host_organization_name = str_replace_all(host_organization_name, "^Springer.*$|Nature.*$", "Springer Nature")) %>%
195-
group_by(host_organization_name) %>%
196-
summarise(n = n()) %>%
193+
df |>
194+
mutate(host_organization_name = str_replace_all(host_organization_name, "^Springer.*$|Nature.*$", "Springer Nature")) |>
195+
group_by(host_organization_name) |>
196+
summarise(n = n()) |>
197197
arrange(desc(n))
198198
```
199199

200200
Be aware that this data transformation only applies to the publisher name column and not the OpenAlex provided publisher id column. We would need to perform a separate transformation step to align both. Additionally, we did not permanently rename the publisher name values in our data frame. To do this we can either override our data frame *df* or create a new data frame by assigning the output with the `<-` operator.
201201

202202
```{r, eval=FALSE}
203203
# overriding our data frame
204-
df <- df %>%
204+
df <- df |>
205205
mutate(host_organization_name = str_replace_all(host_organization_name, "^Springer.*$|Nature.*$", "Springer Nature"))
206206
207207
# creating a new data frame df2
208-
df2 <- df %>%
208+
df2 <- df |>
209209
mutate(host_organization_name = str_replace_all(host_organization_name, "^Springer.*$|Nature.*$", "Springer Nature"))
210210
```
211211

@@ -215,26 +215,26 @@ To access the apc values in our data frame we will use the `unnest_wider`, `unne
215215
The `unnest_wider` function allows us to turn each element of a list-column into a column. We will further use the `select` function to print only the resulting apc columns.
216216

217217
```{r}
218-
df %>%
219-
unnest_wider(apc, names_sep = ".") %>%
218+
df |>
219+
unnest_wider(apc, names_sep = ".") |>
220220
select(starts_with("apc"))
221221
```
222222

223223
The output shows that the values in the apc columns are lists. This is because OpenAlex provides information for APC list prices and prices of APCs that were actually paid, when available. To transform the lists into single value cells we will use the `unnest_longer` function that does precisely that.
224224

225225
```{r}
226-
df %>%
227-
unnest_wider(apc, names_sep = ".") %>%
228-
unnest_longer(c(apc.type, apc.value, apc.currency, apc.value_usd, apc.provenance)) %>%
226+
df |>
227+
unnest_wider(apc, names_sep = ".") |>
228+
unnest_longer(c(apc.type, apc.value, apc.currency, apc.value_usd, apc.provenance)) |>
229229
select(id, starts_with("apc"))
230230
```
231231

232232
The output shows that we now have multiple rows for the same id. To transform the data frame to single rows for each id we will use the `pivot_wider` function that allows us to increase the number of columns and decrease the number of rows.
233233

234234
```{r}
235-
df %>%
236-
unnest_wider(apc, names_sep = ".") %>%
237-
unnest_longer(c(apc.type, apc.value, apc.currency, apc.value_usd, apc.provenance)) %>%
235+
df |>
236+
unnest_wider(apc, names_sep = ".") |>
237+
unnest_longer(c(apc.type, apc.value, apc.currency, apc.value_usd, apc.provenance)) |>
238238
pivot_wider(id_cols = id, names_from = apc.type, values_from = c(apc.value, apc.currency, apc.value_usd, apc.provenance))
239239
```
240240

@@ -280,14 +280,14 @@ round(sum(df$is_oa, na.rm = T) / n_distinct(df$id) * 100, 2)
280280
To analyse the distribution of open access articles across journals we will calculate the total number of articles (n_articles), the total number of open access articles (n_oa_articles) and the total number of closed articles (n_closed_articles) per journal. We will again use the `group_by`, `summarise` and `arrange` functions. However, since we noticed that `r sum(is.na(df$is_oa))` articles have an undetermined open access status, we will first filter out all rows with *NA* values in the *is_oa* column. We will also group the data by the *source_display_name* column to generate aggregate statistics for the journals.
281281

282282
```{r}
283-
df %>%
284-
filter(!is.na(is_oa)) %>%
285-
group_by(source_display_name) %>%
283+
df |>
284+
filter(!is.na(is_oa)) |>
285+
group_by(source_display_name) |>
286286
summarise(
287287
n_articles = n(),
288288
n_oa_articles = sum(is_oa),
289289
n_closed_articles = n_articles - n_oa_articles
290-
) %>%
290+
) |>
291291
arrange(desc(n_oa_articles))
292292
```
293293

@@ -296,9 +296,9 @@ The results show that 414 articles have no journal assigned within our data fram
296296
We can further explore the open access status distribution for the articles. We will do this on the example of the top three journals in terms of open access publication volume.
297297

298298
```{r}
299-
df %>%
300-
filter(source_display_name %in% c("Scientific Reports", "Astronomy and Astrophysics", "Journal of High Energy Physics")) %>%
301-
group_by(source_display_name, is_oa, oa_status) %>%
299+
df |>
300+
filter(source_display_name %in% c("Scientific Reports", "Astronomy and Astrophysics", "Journal of High Energy Physics")) |>
301+
group_by(source_display_name, is_oa, oa_status) |>
302302
summarise(n = n())
303303
```
304304

@@ -311,9 +311,9 @@ We can combine functions from the tidyverse and ggplot2 to visualise the develop
311311
First, we group the data by publication year and open access status. We will then compute the number of articles in each group. In the ggplot function, we assign the publication year column to the x axis and the number of articles to the y axis. We choose point (geom_point) and line (geom_line) graph types to mark the distinct values of n and have them connected by lines. Both are provided with a colour aesthetic which is set to our open access status column. This will result in different colours being assigned to the points and lines for the different open access status values. With the *theme_minimal* option we are applying a minimal theme for the plot appearance.
312312

313313
```{r}
314-
df %>%
315-
group_by(publication_year, oa_status) %>%
316-
summarise(n = n(), .groups = "keep") %>%
314+
df |>
315+
group_by(publication_year, oa_status) |>
316+
summarise(n = n(), .groups = "keep") |>
317317
ggplot(aes(x = publication_year, y = n)) +
318318
geom_line(aes(colour = oa_status)) +
319319
geom_point(aes(colour = oa_status)) +
@@ -325,10 +325,10 @@ The plot shows use the total number of publications for each publication year an
325325
We can further choose to visualise the open access distribution over time in terms of percentages. For this we first calculate the share of each open access status per publication year and create a new column using the `mutate` function that stores these values. We pipe the result through to the `ggplot` function assigning the publication year column to the x axis and the share to the y axis. We assign the *oa_status* column to the fill argument which will plot a different colour for each open access status. For this plot we choose a bar chart (geom_bar) graph type and again apply the minimal theme.
326326

327327
```{r}
328-
df %>%
329-
group_by(publication_year, oa_status) %>%
330-
summarise(count = n()) %>%
331-
mutate(perc = count / sum(count) * 100) %>%
328+
df |>
329+
group_by(publication_year, oa_status) |>
330+
summarise(count = n()) |>
331+
mutate(perc = count / sum(count) * 100) |>
332332
ggplot(aes(x = publication_year, y = perc, fill = oa_status)) +
333333
geom_bar(stat = "identity") +
334334
theme_minimal()
@@ -363,28 +363,28 @@ Before, we analysed the distribution of open access articles across journals. Be
363363
## {{< iconify proicons:code >}}&ensp;Interactive editor
364364

365365
```{webr-r}
366-
df %>%
367-
filter(!is.na(is_oa)) %>%
368-
group_by(source_display_name) %>%
366+
df |>
367+
filter(!is.na(is_oa)) |>
368+
group_by(source_display_name) |>
369369
summarise(
370370
n_articles = n(),
371371
n_oa_articles = sum(is_oa),
372372
n_closed_articles = n_articles - n_oa_articles
373-
) %>%
373+
) |>
374374
arrange(desc(n_oa_articles))
375375
```
376376

377377
## {{< iconify proicons:checkmark-circle >}}&ensp;Solution
378378

379379
```{webr-r}
380-
df %>%
381-
filter(!is.na(is_oa)) %>%
382-
group_by(host_organization_name) %>%
380+
df |>
381+
filter(!is.na(is_oa)) |>
382+
group_by(host_organization_name) |>
383383
summarise(
384384
n_articles = n(),
385385
n_oa_articles = sum(is_oa),
386386
n_closed_articles = n_articles - n_oa_articles
387-
) %>%
387+
) |>
388388
arrange(desc(n_oa_articles))
389389
```
390390

@@ -402,10 +402,10 @@ Now, see if you can adapt the code we used to visualise the open access share ov
402402
## {{< iconify proicons:code >}}&ensp;Interactive editor
403403

404404
```{webr-r}
405-
df %>%
406-
group_by(publication_year, oa_status) %>%
407-
summarise(count = n()) %>%
408-
mutate(perc = count / sum(count) * 100) %>%
405+
df |>
406+
group_by(publication_year, oa_status) |>
407+
summarise(count = n()) |>
408+
mutate(perc = count / sum(count) * 100) |>
409409
ggplot(aes(x = publication_year, y = perc, fill = oa_status)) +
410410
geom_bar(stat = "identity") +
411411
theme_minimal()
@@ -418,11 +418,11 @@ df %>%
418418
## {{< iconify proicons:checkmark-circle >}}&ensp;Solution
419419

420420
```{webr-r}
421-
df %>%
422-
filter(host_organization_name %in% c("Wiley", "Elsevier BV", "Springer Science+Business Media")) %>%
423-
group_by(host_organization_name, oa_status) %>%
424-
summarise(count = n()) %>%
425-
mutate(perc = count / sum(count) * 100) %>%
421+
df |>
422+
filter(host_organization_name %in% c("Wiley", "Elsevier BV", "Springer Science+Business Media")) |>
423+
group_by(host_organization_name, oa_status) |>
424+
summarise(count = n()) |>
425+
mutate(perc = count / sum(count) * 100) |>
426426
ggplot(aes(x = host_organization_name, y = perc, fill = oa_status)) +
427427
geom_bar(stat = "identity") +
428428
theme_minimal()

0 commit comments

Comments
 (0)