Skip to content

Commit

Permalink
incorporated updates/fixes from coreR 10/2024
Browse files Browse the repository at this point in the history
  • Loading branch information
oharac committed Oct 24, 2024
1 parent 47618e1 commit 3d99f23
Show file tree
Hide file tree
Showing 8 changed files with 246 additions and 223 deletions.
74 changes: 53 additions & 21 deletions materials/sections/clean-wrangle-data.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@ Suppose you have the following `data.frame` called `length_data` with data about

| year| length\_cm|
|-----:|-----------:|
| 1990| 5.673318|
| 1991| 3.081224|
| 1991| 4.592696|
| 1992| 4.381523|
| 1992| 5.597777|
| 1992| 4.900052|
| 1990| 5.6|
| 1991| 3.0|
| 1991| 4.5|
| 1992| 4.3|
| 1992| 5.5|
| 1992| 4.9|

Before thinking about the code, let's think about the steps we need to take to get to the answer (aka pseudocode).

Expand All @@ -50,10 +50,10 @@ length_data %>%

| site | 1990 | 1991 | ... | 1993 |
|--------|------|------|-----|------|
| gold | 100 | 118 | ... | 112 |
| lake | 100 | 118 | ... | 112 |
| gold | 101 | 109 | ... | 112 |
| lake | 104 | 98 | ... | 102 |
| ... | ... | ... | ... | ... |
| dredge | 100 | 118 | ... | 112 |
| dredge | 144 | 118 | ... | 145 |

You are probably familiar with data in the above format, where values of the variable being observed are spread out across columns.
In this example we have a different column per year.
Expand Down Expand Up @@ -178,14 +178,15 @@ The code chunk you use to read in the data should look something like this:
catch_original <- read_csv("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1")
```

**Note for Windows users:** Keep in mind, if you want to replicate this workflow in your local computer you also need to use the `url()` function here with the argument `method = "libcurl"`.
<!-- I think this is not true, at least on my Windows machine - is this a holdover from `read.csv` instead of `read_csv`? -->
<!-- **Note for Windows users:** Keep in mind, if you want to replicate this workflow in your local computer you also need to use the `url()` function here with the argument `method = "libcurl"`. -->

It would look like this:
<!-- It would look like this: -->

```{r}
#| eval: false
catch_original <- read.csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1", method = "libcurl"))
```
<!-- ```{r} -->
<!-- #| eval: false -->
<!-- catch_original <- read.csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1", method = "libcurl")) -->
<!-- ``` -->

:::

Expand Down Expand Up @@ -271,7 +272,7 @@ If you think of the assignment operator (`<-`) as reading like "gets", then the

So you might think of the above chunk being translated as:

> The cleaned data frame gets the original data, and then a filter (of the original data), and then a select (of the filtered data).
> The cleaned data frame **gets** the original data, and **then** a filter (of the original data), and **then** a select (of the filtered data).
The benefits to using pipes are that you don't have to keep track of (or overwrite) intermediate data frames. The drawbacks are that it can be more difficult to explain the reasoning behind each step, especially when many operations are chained together. It is good to strike a balance between writing efficient code (chaining operations), while ensuring that you are still clearly explaining, both to your future self and others, what you are doing and why you are doing it.

Expand Down Expand Up @@ -565,6 +566,28 @@ sse_catch <- catch_long %>%
head(sse_catch)
```

::: {.callout-important}

## `==` and `%in%` operators

The `filter()` function performs a logical test across all the rows of a dataframe, and if that test is `TRUE` for a given row, it keeps that row. The `==` operator tests whether the left hand side and right hand side match - in the example above, does the value of the `Region` variable match the value `"SSE"`?

But if you want to test whether a variable's value is within a set of possible values, *do not* use the `==` operator - it will very likely give false results! Instead, use the `%in%` operator:
```{r}
catch_long %>%
filter(Region == c("SSE", "ALU")) %>%
nrow()
catch_long %>%
filter(Region %in% c("SSE", "ALU")) %>%
nrow()
```

This is because the `==` version "recycles" the vector of allowed values, so it tests whether the first row matches `"SSE"` (yep!), whether the second matches `"ALU"` (nope! this row gets dropped!), and then whether the third is `"SSE"` again and so on.

Note that the `%in%` operator actually works for single values too, so you can never go wrong with that!
:::

::: {.callout-note icon=false}
## Exercise

Expand All @@ -581,13 +604,13 @@ catch_million <- catch_long %>%
filter(catch > 1000000)
## Chinook from SSE data
chinook_see <- catch_long %>%
chinook_sse <- catch_long %>%
filter(Region == "SSE",
species == "Chinook")
## OR
chinook_see <- catch_long %>%
filter(Region == "SSE" & species == "Chinook")
## OR combine tests with & ("and") or | ("or")... also, we can swap == for %in%
chinook_sse <- catch_long %>%
filter(Region %in% "SSE" & species %in% "Chinook")
```
:::

Expand Down Expand Up @@ -700,14 +723,23 @@ mean_region <- catch_original %>%
pivot_longer(-c(Region, Year),
names_to = "species",
values_to = "catch") %>%
mutate(catch = catch*1000) %>%
mutate(catch = catch * 1000) %>%
group_by(Region) %>%
summarize(mean_catch = mean(catch)) %>%
arrange(desc(mean_catch))
head(mean_region)
```

## Write out the results with `readr::write_csv()`

Now that we have performed all this data wrangling, we can save out the results for future use using `readr::write_csv()`.

```{r}
#| eval: false
write_csv(mean_region, here::here("data/mean_catch_by_region.csv"))
```


We have completed our lesson on Cleaning and Wrangling data. Before we break, let's practice our Git workflow.

Expand Down
6 changes: 3 additions & 3 deletions materials/sections/data-management-essentials.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -139,11 +139,11 @@ The article *Ten Simple Rules for Creating a Good Data Management Plan* (@michen

#### Define how the data will be organized

- Once you know the data you will be using (rule #2) it is time to define how are you going to work with your data. Where will the raw data live? How are the different collaborators going to access the data? The needs vary widely from one project to another depending on the data. When drafting your DMP is helpful to focus on identifying what products and software you will be using. When collaborating with a team it is important to identify f there are any limitations to accessing any software or tool.
- Once you know the data you will be using (rule #2) it is time to define how are you going to work with your data. Where will the raw data live? How are the different collaborators going to access the data? The needs vary widely from one project to another depending on the data. When drafting your DMP is helpful to focus on identifying what products and software you will be using. When collaborating with a team it is important to identify whether there are any limitations to accessing any software or tool.

- Resource

- [Here is an example](https://nceas.github.io/scicomp.github.io/tutorial_server.html) from the LTER Scientific Computing Support Team on working on NCEAS Server.
- [Here is an example](https://lter.github.io/workshop-github/server.html) from the LTER Scientific Computing Support Team on working on NCEAS Server.

#### Explain how the data will be documented

Expand Down Expand Up @@ -283,7 +283,7 @@ So, **how does a computer organize all this information?** There are a number of
- [Ecological Metadata Language (EML)](https://eml.ecoinformatics.org/)
- [Geospatial Metadata Standards (ISO 19115 and ISO 19139)](https://www.fgdc.gov/metadata/iso-standards)
- See [NOAA's ISO Workbook](http://www.ncei.noaa.gov/sites/default/files/2020-04/ISO%2019115-2%20Workbook_Part%20II%20Extentions%20for%20imagery%20and%20Gridded%20Data.pdf)
- [Biological Data Profile (BDP)](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.fgdc.gov/standards/projects/FGDC-standards-projects/metadata/biometadata/biodatap.pdf)
- [Biological Data Profile (BDP)](https://www.fgdc.gov/standards/projects/FGDC-standards-projects/metadata/biometadata/biodatap.pdf)
- [Dublin Core](https://www.dublincore.org/)
- [Darwin Core](https://dwc.tdwg.org/)
- [PREservation Metadata: Implementation Strategies (PREMIS)](https://www.loc.gov/standards/premis/)
Expand Down
Loading

0 comments on commit 3d99f23

Please sign in to comment.