diff --git a/images/XML_series_pbject.png b/images/XML_series_pbject.png new file mode 100644 index 00000000..cc6ffc3a Binary files /dev/null and b/images/XML_series_pbject.png differ diff --git a/images/physical_screenshot.png b/images/physical_screenshot.png new file mode 100644 index 00000000..403bccef Binary files /dev/null and b/images/physical_screenshot.png differ diff --git a/training/01_introduction.Rmd b/training/01_introduction.Rmd index 53db01d4..012e4df8 100644 --- a/training/01_introduction.Rmd +++ b/training/01_introduction.Rmd @@ -22,13 +22,13 @@ Read Matt Jones et al.'s paper on education resources related to data management. +You may also want to explore the DataONE education resources related to data management. ## Using DataONE **Data Observation Network for Earth** (DataONE) is a community driven initiative that provides access to data across multiple member repositories, supporting enhanced search and discovery of Earth and environmental data. -Read more about what DataONE is here and about DataONE member node (MN) guidelines here. Please feel free to ask Jeanette any questions you have about DataONE. +Read more about what DataONE is here and about DataONE member node (MN) guidelines here. Please feel free to ask Jeanette any questions you have about DataONE. We will be applying these concepts in the next chapter. @@ -36,10 +36,13 @@ We will be applying these concepts in the next chapter. All of the work that we do at NCEAS is done on our remote server, datateam.nceas.ucsb.edu. If you have never worked on a remote server before, you can think of it like working on a different computer via the internet. -We access RStudio on our server through this link. This is the same as your desktop version of RStudio with one main difference is that files are on the server. Please do all your work here. This way you can share your code with the rest of us. +We access RStudio on our server through this link. This is the same as your desktop version of RStudio with one main difference is that files are on the server. **Please do all your work here, and bookmark this link. Do not use RStudio on your local computer.** By only using the RStudio server, it is easier to share your code with the rest of us. + +### Check your understanding {.exercise} +* Open a new tab in your browser and try logging into the [remote server](https://datateam.nceas.ucsb.edu/rstudio/) using your SSH credentials. ```{block, type = "note"} -If you R session is frozen and unresponsive check out [the guide](https://help.nceas.ucsb.edu/NCEAS/Computing/rstudio_server.html) on how to fix it. +If your R session is frozen and unresponsive check out [the guide](https://help.nceas.ucsb.edu/NCEAS/Computing/rstudio_server.html) on how to fix it. ``` ## A note on paths @@ -52,14 +55,17 @@ When you write scripts, try to avoid writing relative paths (which rely on what ## A note on R -This training assumes basic knowledge of R and RStudio. If you want a quick R refresher, walk through Jenny Bryan's excellent materials [here](http://stat545.com/block002_hello-r-workspace-wd-project.html). +This training assumes basic knowledge of R and RStudio. Spend at least 30 minutes walking through Jenny Bryan's excellent materials [here](http://stat545.com/block002_hello-r-workspace-wd-project.html) for a refresher. Throughout this training we will occasionally use the namespace syntax `package_name::function_name()` when writing a function. This syntax denotes which package a function came from. For example `dataone::getSystemMetadata` selects the `getSystemMetadata` function from the `dataone` R package. More detailed information on namespaces can be found [here](http://r-pkgs.had.co.nz/namespace.html). ## A note on effective troubleshooting in R -We suggest using a combination of **m**inimal **r**eproducible **e**xamples (MRE) and the package `reprex` to create **rep**roducible **ex**amples. This will allow others to better help you if we can run the code on our own computers. -A MRE is stripping down your code to only the parts that cause the bug. +One of the advantages with using the R programming language is the extensive documentation that is available for R packages. The R help operator `?` can be used to learn more about functions from all of the R packages we use. You can put the operator before the name of any function to view its documentation in RStudio: `?function` + +When asking for help in the `#datateam` channel in Slack, we suggest using a combination of **m**inimal **r**eproducible **e**xamples (MRE) and the package `reprex` to create **rep**roducible **ex**amples. This will allow others to better help you if we can run the code on our own computers. + +A MRE is stripping down your code to only the parts that cause the bug. When troubleshooting errors over Slack, send the code that returned an error **and** the error message itself. How to generate a reprex: @@ -90,7 +96,7 @@ att_list <- set_attributes(attributes) doc_ex <- list(packageId = "id", system = "system", - dataset = list(title = "A Mimimal Valid EML Dataset", + dataset = list(title = "A Minimal Valid EML Dataset", creator = me, contact = me, dataTable = list(entityName = "data table", attributeList = att_list)) @@ -103,17 +109,19 @@ The rest of the training has a series of exercises. These are meant to take you Please note that you will be completing everything on the site for the training. In the future if you are unsure about doing anything with a dataset. The test site is a good place to try things out! ## Exercise 1 {.exercise} -This part of the exercise walks you through submitting data through the web form on "test.arcticdata.io" +This part of the exercise walks you through submitting data through the web form on "test.arcticdata.io". In addition to learning to use the webform, this exercise will also help you practice sleuthing for information in order to provide complete metadata. Most datasets do not come with all contextual information, so you will need to skim cited literature and search google for definitions of discipline-specific jargon. Don't be afraid to use the internet as a resource! ### Part 1 * Download the [csv](data/Loranty_2016_Environ._Res._Lett._11_095008.csv) of Table 1 from this paper. -* Reformat the table to meet the guidelines outlined in the journal article on effective data management (this might be easier to do in an interactive environment like Excel). -* Note - we usually don't edit the content in data submissions so don't stress over this part too much +* Reformat the table to meet the guidelines outlined in the journal article on effective data management (this might be easier to do in an interactive environment like Excel). + + Hint: This table is in wide format and can be made [longer](https://arcticdata.io/submit/#file-content-guidelines). +* Note: we usually don't edit the content in data submissions so don't stress over this part too much ### Part 2 * Go to "test.arcticdata.io" and submit your reformatted file with appropriate metadata that you derive from the text of the paper: - + list yourself as the first 'Creator' so your test submission can easily be found, - + for the purposes of this training exercise, not every single author needs to be listed with full contact details, listing the first two authors is fine, - + directly copying and pasting sections from the paper (abstract, methods, etc.) is also fine, - + attributes (column names) should be defined, including correct units and missing value codes. - + submit the dataset + + List yourself as the first 'Creator' so your test submission can easily be found. + + For the purposes of this training exercise, not every single author needs to be listed with full contact details, listing the first two authors is fine. + + Directly copying and pasting sections from the paper (abstract, methods, etc.) is also fine. + + Attributes (column names) should be defined, including correct units and missing value codes. + * Click "describe" to the right of the file name in order to add file-specific information. The title and description can be edited in the "Overview" tab, while attributes are defined in the "Attributes" tab. + + Submit the dataset and post a message to the #datateam channel with a link to your package. diff --git a/training/02_creating_a_data_package.Rmd b/training/02_creating_a_data_package.Rmd index a687219f..b294119d 100644 --- a/training/02_creating_a_data_package.Rmd +++ b/training/02_creating_a_data_package.Rmd @@ -6,17 +6,17 @@ This chapter will teach you how to create and submit a data package to a DataONE A data package generally consists of at least 3 components. -1. Metadata: One object is the metadata file itself. In case you are unfamiliar with metadata, metadata are information that describe data (e.g. who made the data, how were the data made, etc.). The metadata file will be in an XML format, and have the extension `.xml` (extensible markup language). We often refer to this file as the EML, which is the metadata standard that it uses. This is also what you see when you click on a page in the Arctic Data Center. +1. Metadata: One object is the metadata file itself. In case you are unfamiliar with metadata, metadata are information that describe data (e.g. who made the data, how were the data made, etc.). The metadata file will be in an XML format, and have the extension `.xml` (extensible markup language). We often refer to this file as the EML (Ecological Metadata Language), which is the metadata standard that it uses. Each dataset page in the Arctic Data Center is a direct representation of an EML document, made to look prettier for the web. 2. Data: Other objects in a package are the data files themselves. Most commonly these are data tables (`.csv`), but they can also be audio files, NetCDF files, plain text files, PDF documents, image files, etc. -3. Resource Map: The final object is the resource map. This object is a plain text file with the extension `.rdf` (Resource Description Framework) that defines the relationships between all of the other objects in the data package. It says things like "this metadata file describes this data file," and is critical to making a data package render correctly on the website with the metadata file and all of the data files together in the correct place. Fortunately, we rarely, if ever, have to actually look at the contents of resource maps; they are generated for us using tools in R. +3. Resource Map: The final object is the resource map. This object is a plain text file with the extension `.rdf` (Resource Description Framework) that defines the relationships between all of the other objects in the data package. You can think of it like a "basket" that holds the metadata file and all data files together. It says things like "this metadata file describes this data file," and is critical to making a data package render correctly on the website. Fortunately, we rarely, if ever, have to actually look at the contents of resource maps; they are generated for us using tools in R. ![From the DataOne Community Meeting (Session 7)](images/data-submission-workflow2.png) ## Packages on the Website -All of the package information is represented when you go to the landing page for a dataset. When you make changes through R those published changes will be reflected here. Although you can edit the metadata directly from the webpage but we recommend to use R in most cases. +All of the package information is represented when you go to the landing page for a dataset. In the previous section, you uploaded a data file and made edits to the metadata using the web editor. When you make changes to the metadata and data files through R, those published changes will also be reflected here. ![](images/arctic_data_center_web.png) @@ -34,7 +34,7 @@ Different versions of a package are linked together by what we call the "version ## Upload a package -We will be using R to connect to the NSF Arctic Data Center (ADC) data repository to push and pull edits in actual datasets. To identify yourself as an admin you will need to pass a 'token' into R. Do this by signing in to the ADC with your ORCid and password, then hovering over your name in the top right corner and clicking on "My profile", then navigating to "Settings" and "Authentication Token", copying the "Token for DataONE R", and finally pasting and running it in your *R console*. +We will be using R to connect to the NSF Arctic Data Center (ADC) data repository to push and pull edits in actual datasets. To identify yourself as an admin you will need to pass a 'token' into R. Do this by signing in to the ADC with your ORCid and password, then hovering over your name in the top right corner and clicking on "My profile", then navigating to "Settings" and "Authentication Token", copying the "Token for DataONE R", and finally pasting and running it in your *R console*. The console is the bottom left window in RStudio. ```{block, type = "warning"} **This token is your identity on these sites, please treat it as you would a password** (i.e. don't paste into scripts that will be shared). The easiest way to do this is to always run the token in the *console*. There's no need to keep it in your script since it's temporary anyway. @@ -42,6 +42,8 @@ We will be using R to connect to the +Answer +
+`otherEntity` requires `entityType` and `entityName` children, or alternatively will accept only `references`. It is a series object, so there can be multiple `otherEntities`. Along with `otherEntity` and `creator`, `dataTable` and `attribute` can also be series objects. + + ```{r, child = '../workflows/explore_eml/access_specific_elements.Rmd'} ``` diff --git a/training/04_editing_eml.Rmd b/training/04_editing_eml.Rmd index 251a897e..4c41310a 100644 --- a/training/04_editing_eml.Rmd +++ b/training/04_editing_eml.Rmd @@ -7,6 +7,9 @@ Most of the functions you will see in this chapter will use the `arcticdatautils ```{block, type = "note"} This chapter will be longest of all the sections! This is a reminder to take frequent breaks when completing this section. ``` +```{block, type = "note"} +When using R to edit EML documents, run each line individually by highlighting the line and using CTRL+ENTER). Many EML functions only need to be ran once, and will either produce errors or make the EML invalid if run multiple times. +``` ```{r, child = '../workflows/edit_eml/edit_an_eml_element.Rmd'} ``` @@ -64,7 +67,7 @@ resource_map_pid <- ... dp <- getDataPackage(d1c_test, identifier=resource_map_pid, lazyLoad=TRUE, quiet=FALSE) # get metadata pid -mo <- selectMember(...) +metadataId <- selectMember(...) # read in EML doc <- read_eml(getObject(...)) @@ -95,10 +98,16 @@ You should see something like if everything passes: >attr(,"errors") >character(0) +```{block, type = "note"} +When troubleshooting EML errors, it is helpful to run `eml_validate()` after every edit to the EML document in order to pinpoint the problematic code. +``` + + Then save your EML to a path of your choice or a temp file. You will later pass this path as an argument to update the package. ```{r, eval = F} -eml_path <- "path/to/save/eml.xml" +# Create a standardized EML name from the dataset title +eml_path <- arcticdatautils::title_to_file_name(doc$dataset$title) write_eml(doc, eml_path) ``` @@ -111,7 +120,7 @@ After adding more metadata, we want to publish the dataset onto `test.arcticdata * Validate your metadata using `eml_validate`. * Use the [checklist](#final-checklist) to review your submission. -* Make edits where necessary +* Make edits where necessary (e.g. physicals) Once `eml_validate` returns `TRUE` go ahead and run `write_eml`, `replaceMember`, and `uploadDataPackage`. There might be a small lag for your changes to appear on the website. This part of the workflow will look roughly like this: @@ -121,7 +130,7 @@ eml_validate(...) write_eml(...) # replace the old metadata file with the new one in the local package -dp <- replaceMember(dp, ...) +dp <- replaceMember(dp, metadataId, replacement = eml_path) # upload the data package packageId <- uploadDataPackage(...) diff --git a/training/06_editing_sysmeta.Rmd b/training/06_editing_sysmeta.Rmd index 2bdaa356..c0223bc3 100644 --- a/training/06_editing_sysmeta.Rmd +++ b/training/06_editing_sysmeta.Rmd @@ -14,4 +14,4 @@ Sometimes the system doesn't recognize the file types properly. For example you * Read the system metadata in from the data file you uploaded [previously](#exercise-4). * Check to make sure the `fileName` and `formatId` are set correctly (the extension in `fileName` should match the `formatId`). -* Update the system metadata if necessary. +* Update the system metadata if necessary. CSVs have the formatId "text/csv". diff --git a/training/08_using_git.Rmd b/training/08_using_git.Rmd index afba70db..000caaee 100644 --- a/training/08_using_git.Rmd +++ b/training/08_using_git.Rmd @@ -6,7 +6,7 @@ We use git and Github to manage our packages (ie datamgmt, arcticdatautils) and ## What is Git? -
Git is a distributed version control system. +Git is a distributed version control system. The NCEAS Github repository contains many processing scripts and templates that can help when working with datasets. Important! If you have never used Git before, or only used it a little, or have no idea what it is, check out this intro to Git put together by the ecodatascience group at UCSB. Don't worry too much about the forking and branching sections, as we will primarily be using the basic commit-pull-push commands. After you have read through that presentation, come back to this chapter. @@ -89,7 +89,7 @@ If you are prompted to save your workspace during this process, make sure all of ### Adding a new script -If you have been working on a script that you want to put in the arctic-data GitHub repo, you first need to save it somewhere in the arctic-data folder you cloned to your account on the Datateam server. You can do this by either moving your script into the folder or using the save-as functionality. Note that Git will try and version anything that you save in this folder, so you should be careful about what you save here. For our purposes, things that probably shouldn't be saved in this folder include: +If you have been working on a script that you want to put in the arctic-data GitHub repo, you first need to save it somewhere in the arctic-data folder you cloned to your account on the Datateam server (/home/username/...). You can do this by either moving your script into the folder or using the save-as functionality. Note that Git will try and version anything that you save in this folder, so you should be careful about what you save here. For our purposes, things that probably shouldn't be saved in this folder include: - **Tokens**: Any token file or script with a token in it should NOT be saved in the repository. Others could steal your login credentials if you put a token in GitHub. - **Data files**: Git does not version data files very well. You shouldn't save any .csv files or any other data files (including metadata). diff --git a/workflows/edit_data_packages/01_datapack_background.Rmd b/workflows/edit_data_packages/01_datapack_background.Rmd index ccbc42d1..025b8a88 100644 --- a/workflows/edit_data_packages/01_datapack_background.Rmd +++ b/workflows/edit_data_packages/01_datapack_background.Rmd @@ -39,10 +39,17 @@ dp <- dataone::getDataPackage(d1c, "resource_map_urn:uuid:1f9eee7e-2d03-43c4-ad7 ``` ### Data Objects + +You can see what slots are in an S4 object after typing the subsetting operator `@`, or pressing TAB with the cursor after an existing `@`. Try viewing the slots of the data package. +```{r, eval=F} +dp@ +``` + Check out the `objects` slot ```{r, eval=F} dp@objects ``` +The `objects` slot contains a list of object PIDs that are accessed using the `$` subsetting operator. Both are found within the structure of data packages in R. Get the number for data and metadata files associated with this data package: ```{r, eval=F} @@ -63,7 +70,7 @@ metadataId <- selectMember(dp, name="sysmeta@ADD THE NAME OF THE SLOT", value="P **Example:** ```{r, eval = F} -selectMember(dp, name="sysmeta@formatId", value="image/tiff") +selectMember(dp, name="sysmeta@formatId", value="image/tiff") selectMember(dp, name="sysmeta@fileName", value="filename.csv") ``` diff --git a/workflows/edit_data_packages/02_create_package_data_pack.Rmd b/workflows/edit_data_packages/02_create_package_data_pack.Rmd index f627dbca..acf93fc0 100644 --- a/workflows/edit_data_packages/02_create_package_data_pack.Rmd +++ b/workflows/edit_data_packages/02_create_package_data_pack.Rmd @@ -25,13 +25,13 @@ This is a bit of an unusual way to reference a local file path, but all this doe emlFile <- system.file("extdata/strix-pacific-northwest.xml", package="dataone") ``` -Create a new `DataObject` and add it to the package. +Create a new `DataObject` for the metadata and add it to the package. ```{r, eval = F} metadataObj <- new("DataObject", format="https://eml.ecoinformatics.org/eml-2.2.0", filename=emlFile) dp <- addMember(dp, metadataObj) ``` -Check the dp object to see if the `DataObject` was added correctly. +Check the dp object to see if the metadata was added correctly. ```{r, eval = F} dp ``` @@ -40,7 +40,7 @@ dp ```{r, eval = F} sourceData <- system.file("extdata/OwlNightj.csv", package="dataone") sourceObj <- new("DataObject", format="text/csv", filename=sourceData) -dp <- addMember(dp, sourceObj, metadataObj) +dp <- addMember(dp, sourceObj, metadataObj) # The third argument of addMember() associates the new DataObject to the metadata that was just added. ``` diff --git a/workflows/edit_data_packages/edit_sysmeta.Rmd b/workflows/edit_data_packages/edit_sysmeta.Rmd index 16f96463..18773071 100644 --- a/workflows/edit_data_packages/edit_sysmeta.Rmd +++ b/workflows/edit_data_packages/edit_sysmeta.Rmd @@ -3,7 +3,7 @@ To edit the sysmeta of an object (data file, EML, or resource map, etc.) with a `PID`, first load the sysmeta into R using the following command: ```{r, eval = FALSE} -sysmeta <- getSystemMetadata(mn, pid) +sysmeta <- getSystemMetadata(d1c_test@mn, pid) ``` Then edit the sysmeta slots by using `@` functionality. For example, to change the `fileName` use the following command: diff --git a/workflows/edit_data_packages/update_a_package.Rmd b/workflows/edit_data_packages/update_a_package.Rmd index 65423302..fa2639c5 100644 --- a/workflows/edit_data_packages/update_a_package.Rmd +++ b/workflows/edit_data_packages/update_a_package.Rmd @@ -9,14 +9,14 @@ Make sure you have the package you want to update loaded into R using `dataone:: Now we can update your data package to include the new data object. Assuming you have updated your data package earlier something like the below: ```{r, eval = FALSE} d1c_test <- dataone::D1Client("STAGING", "urn:node:mnTestARCTIC") -packageId <- "the resource map" +packageId <- "resource_map_urn:uuid..." dp <- getDataPackage(d1c_test, identifier=packageId, lazyLoad=TRUE, quiet=FALSE) metadataId <- selectMember(dp, name="sysmeta@formatId", value="https://eml.ecoinformatics.org/eml-2.2.0") #some modification to the EML here -eml_path <- "path/to/your/saved/eml.xml" +eml_path <- arcticdatautils::title_to_file_name(doc$dataset$title) write_eml(doc, eml_path) dp <- replaceMember(dp, metadataId, replacement=eml_path) @@ -30,7 +30,7 @@ packageId <- uploadDataPackage(d1c_test, dp, public=FALSE, accessRules=myAccessR If a package is ready to be public, you can change the `public` argument in the `datapack::uploadDataPackage()` call to `TRUE`. -If you want to publish with a DOI (Digital Object Identifier) instead of a UUID (Universally Unique Identifier), you need to do this when replacing the metadata. **This should only be done after the package is finalized and has been thoroughly reviewed!** +If you want to publish with a DOI (Digital Object Identifier) instead of a UUID (Universally Unique Identifier), you need to do this when replacing the metadata using the optional `newId` argument in `replaceMember()`. **This should only be done after the package is finalized and has been thoroughly reviewed!** ```{r, eval = FALSE} doi <- dataone::generateIdentifier(d1c_test@mn, "DOI") dp <- replaceMember(dp, metadataId, replacement=eml_path, newId=doi) @@ -40,7 +40,7 @@ newPackageId <- uploadDataPackage(d1c_test, dp, public=TRUE, quiet=FALSE) If there is a pre-issued DOI (researcher requested the DOI for the publication first), please do the following: ```{r, eval = FALSE} -dp <- replaceMember(dp, metadataId, replacement=eml_path, newId="your pre-issued doi previously generated") +dp <- replaceMember(dp, metadataId, replacement=eml_path, newId="doi:10.../...") newPackageId <- uploadDataPackage(d1c_test, dp, public=TRUE, quiet=FALSE) ``` diff --git a/workflows/edit_data_packages/update_an_object.Rmd b/workflows/edit_data_packages/update_an_object.Rmd index 393a0787..739997e6 100644 --- a/workflows/edit_data_packages/update_an_object.Rmd +++ b/workflows/edit_data_packages/update_an_object.Rmd @@ -1,4 +1,4 @@ -## Update a data object +## Update an object To update a data file associated with a data package, you need to do three things: @@ -8,12 +8,14 @@ To update a data file associated with a data package, you need to do three thing The `datapack::replaceMember` function takes care of the first two of these tasks. First you need to get the pid of the file you want to replace by using `datapack::selectMember` ```{r, eval = F} -metadataId <- selectMember(dp, name="sysmeta@formatId", value="https://eml.ecoinformatics.org/eml-2.2.0") +# Select the EML PID +dataId <- selectMember(dp, name="sysmeta@formatId", value="text/csv") ``` Then use `replaceMember`: ```{r, eval = F} -dp <- replaceMember(dp, metadataId, replacement=file_path) +# Replace the old file with the new +dp <- replaceMember(dp, dataId, replacement=file_path) ``` diff --git a/workflows/edit_eml/edit_an_eml_element.Rmd b/workflows/edit_eml/edit_an_eml_element.Rmd index 067be927..380b8230 100644 --- a/workflows/edit_eml/edit_an_eml_element.Rmd +++ b/workflows/edit_eml/edit_an_eml_element.Rmd @@ -19,7 +19,7 @@ doc$dataset$title <- list("New Title", "Second New Title") ``` -However, this isn't always the best method to edit the EML, particularly if the element has sub-elements. +However, this isn't always the best method to edit the EML, particularly if the element has sub-elements. Adding directly to doc without a helper function can overwrite these parts of the doc that we need. ### Edit EML with the "EML" package diff --git a/workflows/edit_eml/edit_attributelists.Rmd b/workflows/edit_eml/edit_attributelists.Rmd index 590c79d5..16a43150 100644 --- a/workflows/edit_eml/edit_attributelists.Rmd +++ b/workflows/edit_eml/edit_attributelists.Rmd @@ -1,28 +1,33 @@ ## Edit attributeLists -Attributes are descriptions of variables, typically columns or column names in tabular data. Attributes are stored in an attributeList. When editing attributes in R, you need to create one to three objects: +Attributes are descriptions of variables, typically columns or column names in tabular data. Attributes are stored in an attributeList. When editing attributes in R, we convert the attribute list information to data frame (table) format so that it is easier to edit. When editing attributes you will need to create one to three data frame objects: 1. A data.frame of attributes 2. A data.frame of custom units (if applicable) +3. A data.frame of factors (if applicable) -The `attributeList` is an element within one of 4 different types of entity objects. An entity corresponds to a file, typically. Multiple entities (files) can exist within a dataset. The 4 different entity types are `dataTable` (most common for us), `spatialVector`, `spatialRaster`, and `otherEntity` +The `attributeList` is an element within one of 4 different types of entity objects. An entity corresponds to a file, typically. Multiple entities (files) can exist within a dataset. The 4 different entity types are `dataTable` (most common for us), `spatialVector`, `spatialRaster`, and `otherEntity`. -Please note that submitting attribute information through the website will store them in an `otherEntity` object by default. We prefer to store them in a `dataTable` object for tabular data or a `spatialVector` object for spatial data. +Please note that submitting attribute information through the website will store them in an `otherEntity` object by default. We prefer to store them in a `dataTable` object for tabular data or a `spatialVector` or `spatialRaster` object for spatial data. -To edit or examine an existing attribute table already in an EML file, you can use the following commands, where `i` represents the index of the series element you are interested in. Note that if there is only one item in the series (ie there is only one `dataTable`), you should just call `doc$dataset$dataTable`, as in this case `doc$dataset$dataTable[[1]]` will return the first sub-element of the `dataTable` (the `entityName`) +To edit or examine an existing attribute list already in an EML file, you can use the following commands, where `i` represents the index of the series element you are interested in. Note that if there is only one item in the series (ie there is only one `dataTable`), you should just call `doc$dataset$dataTable`, as in this case `doc$dataset$dataTable[[1]]` will return the first sub-element of the `dataTable` (the `entityName`) ```{r, eval = FALSE} # If they are stored in an otherEntity (submitted from the website by default) -attributeList <- EML::get_attributes(doc$dataset$otherEntity[[i]]$attributeList) +attribute_tables <- EML::get_attributes(doc$dataset$otherEntity[[i]]$attributeList) # Or if they are stored in a dataTable (usually created by a datateam member) -attributeList <- EML::get_attributes(doc$dataset$dataTable[[i]]$attributeList) -# Or if they are stored in a spatialVector (usually created by a datateam member) -attributeList <- EML::get_attributes(doc$dataset$spatialVector[[i]]$attributeList) +attribute_tables <- EML::get_attributes(doc$dataset$dataTable[[i]]$attributeList) +``` -attributes <- attributeList$attributes -print(attributes) +The `get_attributes()` function returns the `attribute_tables` object, which is a list of the three data frames mentioned above. The data frame with the attributes is called `attribute_tables$attributes`. +```{r, eval = FALSE} +attribute_tables$attributes +print(attribute_tables$attributes) ``` + + + ### Edit attributes Attribute information should be stored in a `data.frame` with the following columns: @@ -42,7 +47,7 @@ Attribute information should be stored in a `data.frame` with the following colu + *dateTimeDomain*: `dateTime` attributes + *numericDomain*: attributes that are numbers (either `ratio` or `interval`) * **formatString**: Required for `dateTime`, NA otherwise. Format string for dates, e.g. "DD/MM/YYYY". -* **definition**: Required for `textDomain`, NA otherwise. Definition for attributes that are a character string, matches attribute definition in most cases. +* **definition**: Required for `textDomain`, NA otherwise. Defines a format for attributes that are a character string. e.g.: "Any text" or "7-digit alphanumeric code" * **unit**: Required for `numericDomain`, NA otherwise. Unit string. If the unit is not a standard unit, a warning will appear when you create the attribute list, saying that it has been forced into a custom unit. Use caution here to make sure the unit really needs to be a custom unit. A list of standard units can be found using: `standardUnits <- EML::get_unitList()` then running `View(standardUnits$units)`. * **numberType**: Required for `numericDomain`, NA otherwise. Options are `real`, `natural`, `whole`, and `integer`. + *real*: positive and negative fractions and integers (...-1,-0.25,0,0.25,1...) @@ -62,14 +67,16 @@ attributes <- data.frame( measurementScale = c('dateTime', 'nominal','nominal', 'nominal', 'ratio', 'ratio', 'interval', 'nominal'), domain = c('dateTimeDomain', 'enumeratedDomain','enumeratedDomain', 'textDomain', 'numericDomain', 'numericDomain', 'numericDomain', 'textDomain'), formatString = c('MM-DD-YYYY', NA,NA,NA,NA,NA,NA,NA), - definition = c(NA,NA,NA,'Sample number', NA, NA, NA, 'comments about sampling process'), + definition = c(NA,NA,NA,'Six-digit code', NA, NA, NA, 'Any text'), unit = c(NA, NA, NA, NA,'milliliter', 'dimensionless', 'celsius', NA), numberType = c(NA, NA, NA,NA, 'real', 'real', 'real', NA), missingValueCode = c(NA, NA, NA,NA, NA, NA, NA, 'NA'), missingValueCodeExplanation = c(NA, NA, NA,NA, NA, NA, NA, 'no sampling comments')) ``` -However, typing this out in R can be a major pain. Luckily, there's a Shiny app that you can use to build attribute information. You can use the app to build attributes from a data file loaded into R (recommended as the app will auto-fill some fields for you) to edit an existing attribute table, or to create attributes from scratch. Use the following commands to create or modify attributes (these commands will launch a Shiny app in your web browser): +However, typing this out in R can be a major pain. Luckily, there's an app that you can use to build attribute information. You can use the app to build attributes from a data file loaded into R (recommended as the app will auto-fill some fields for you) to edit an existing attribute table, or to create attributes from scratch. + +Use the following commands to create or modify attributes. These commands will launch a "Shiny" app in your web browser. You must select "Quit App" in order to save your changes, and R will not run code while the app is open. ```{r, eval = FALSE} #first download the CSV in your data package from Exercise #2 @@ -79,19 +86,19 @@ data <- read.csv(text=rawToChar(getObject(d1c_test@mn, data_pid))) ```{r, eval = FALSE} # From data (recommended) -EML::shiny_attributes(data = data) - -# From an existing attribute table -attributeList <- get_attributes(doc$dataset$dataTable[[i]]$attributeList) -EML::shiny_attributes(data = NULL, attributes = attributeList$attributes) +attribute_tables <- EML::shiny_attributes(data = data) # From scratch -atts <- EML::shiny_attributes() +attribute_tables <- EML::shiny_attributes() + +# From an existing attribute list +attribute_tables <- get_attributes(doc$dataset$dataTable[[i]]$attributeList) +attribute_tables <- EML::shiny_attributes(attributes = attribute_tables$attributes) ``` -Once you are done editing a table in the app, quit the app and the tables will be assigned to the `atts` variable as a list of data frames (one for attributes, factors, and units). Alternatively, each table can be to exported to a csv file by clicking the `Download` button. +Once you are done editing a table in the app, quit the app and the tables will be assigned to the `attribute_tables` variable as a list of data frames (one for attributes, factors, and units). Be careful to not overwrite your completed `attribute_tables` object when trying to make edits. The last line of code can be used in order to make edits to an existing `attribute_tables` object. -If you downloaded the table, read the table back into your R session and assign it to a variable in your script (e.g. `attributes <- data.frame(...)`), or just use the variable that `shiny_attributes` returned. +Alternatively, each table can be to exported to a csv file by clicking the `Download` button. If you downloaded the table, read the table back into your R session and assign it to a variable in your script (e.g. `attributes <- data.frame(...)`), or just use the variable that `shiny_attributes` returned. For simple attribute corrections, `datamgmt::edit_attribute()` allows you to edit the slots of a single attribute within an attribute list. To use this function, pass an attribute through `datamgmt::edit_attribute()` and fill out the parameters you wish to edit/update. An example is provided below where we are changing `attributeName`, `domain`, and `measurementScale` in the first attribute of a dataset. After completing the edits, insert the new version of the attribute back into the EML document. @@ -113,11 +120,13 @@ View(standardUnits$units) Search the units list for your unit before attempting to create a custom unit. You can search part of the unit you can look up part of the unit ie `meters` in the table to see if there are any matches. -If you have units that are not in the standard EML unit list, you will need to build a custom unit list. A unit typically consists of the following fields: +If you have units that are not in the standard EML unit list, you will need to build a custom unit list. Attribute tables with custom units listed will generate a warning indicating that a custom unit will need to be described. + +A unit typically consists of the following fields: * **id**: The `unit id` (ids are camelCased) * **unitType**: The `unitType` (run `View(standardUnits$unitTypes)` to see standard `unitType`s) -* **parentSI**: The `parentSI` unit (e.g. for kilometer `parentSI` = "meter") +* **parentSI**: The `parentSI` unit (e.g. for kilometer `parentSI` = "meter"). The parentSI does not need to be part of the unitList. * **multiplierToSI**: Multiplier to the `parentSI` unit (e.g. for kilometer `multiplierToSI` = 1000) * **name**: Unit abbreviation (e.g. for kilometer `name` = "km") * **description**: Text defining the unit (e.g. for kilometer `description` = "1000 meters") @@ -135,6 +144,13 @@ custom_units <- data.frame( description = c('siemens per meter', 'decibar')) ``` +Custom units can also be created in the shiny app, under the "units" tab. They cannot be edited again in the shiny app once created. +```{r, eval=FALSE} +attribute_tables <- EML::shiny_attributes() + +custom_units <- attribute_tables$units +``` + Using `EML::get_unit_id` for custom units will also generate valid EML unit ids. Custom units are then added to `additionalMetadata` using the following command: @@ -144,6 +160,7 @@ unitlist <- set_unitList(custom_units, as_metadata = TRUE) doc$additionalMetadata <- list(metadata = list(unitList = unitlist)) ``` + ### Edit factors For attributes that are `enumeratedDomains`, a table is needed with three columns: `attributeName`, `code`, and `definition`. @@ -166,13 +183,27 @@ factors <- rbind(data.frame(attributeName = 'Location', code = names(Location), data.frame(attributeName = 'Region', code = names(Region), definition = unname(Region))) ``` +Factors can also be created in the shiny app, under the "factors" tab. They cannot be edited again in the shiny app once created. +```{r, eval=FALSE} +attribute_tables <- EML::shiny_attributes() + +attribute_tables$factors +``` + + ### Finalize attributeList -Once you have built your attributes, factors, and custom units, you can add them to EML objects. Attributes and factors are combined to form an `attributeList` using the following command: +Once you have built your attributes, factors, and custom units, you can add them to EML objects. Attributes and factors are combined to form an `attributeList` using `set_attributes()`: + +```{r, eval = FALSE} +# Create an attributeList object +attributeList <- EML::set_attributes(attributes = attribute_tables$attributes, + factors = attribute_tables$factors) +``` +This `attributeList` object can then be checked for errors and [added to a `dataTable`](#edit-datatables) in the EML document. ```{r, eval = FALSE} -attributeList <- EML::set_attributes(attributes = attributes, - factors = factors) +# Edit EML document with object +doc$dataset$dataTable[[i]]$attributeList <- attributeList ``` -This `attributeList` must then be [added to a `dataTable`](#edit-datatables). diff --git a/workflows/edit_eml/edit_datatables.Rmd b/workflows/edit_eml/edit_datatables.Rmd index 8f99c94b..e0940f60 100644 --- a/workflows/edit_eml/edit_datatables.Rmd +++ b/workflows/edit_eml/edit_datatables.Rmd @@ -1,7 +1,7 @@ ## Edit dataTables -To edit a `dataTable`, first [edit/create an `attributeList`](#edit-attributelists) and [set the physical](#set-physical). -Then create a new `dataTable` using the `eml$dataTable()` helper function as below: + +Entities that are `dataTables` require an attribute list. To edit a `dataTable`, first [edit/create an `attributeList`](#edit-attributelists) and [set the physical](#set-physical). Then create a new `dataTable` using the `eml$dataTable()` helper function as below: ```{r, eval = FALSE} dataTable <- eml$dataTable(entityName = "A descriptive name for the data (does not need to be the same as the data file)", @@ -50,9 +50,11 @@ After getting a list of `dataTables`, assign the resulting list to `dataTable` E doc$dataset$dataTable <- dts ``` -By default, the online submission form adds all entities as `otherEntity`, even when most should probably be `dataTable`. You can use `eml_otherEntity_to_dataTable` to easily move items in `otherEntity` over to `dataTable`. Most tabular data or data that contain variables should be listed as a `dataTable`. Data that do not contain variables (eg: plain text readme files, pdfs, jpegs) should be listed as `otherEntity`. +By default, the online submission form adds all entities as `otherEntity`, even when most should probably be `dataTable`. You can use `eml_otherEntity_to_dataTable` to easily move items in `otherEntity` over to `dataTable`, and delete the old `otherEntity`. + +Most tabular data or data that contain variables should be listed as a `dataTable`. Data that do not contain variables (eg: plain text readme files, pdfs, jpegs) should be listed as `otherEntity`. ```{r, eval = FALSE} eml_otherEntity_to_dataTable(doc, - 1, # which otherEntities you want to convert, for multiple use - 1:5 + 1, # Indexes of otherEntities you want to convert, for multiple use 1:5 or c(1,3,5..) validate_eml = F) # set this to False if the physical or attributes are not added ``` diff --git a/workflows/edit_eml/edit_semantic_annotation.Rmd b/workflows/edit_eml/edit_semantic_annotation.Rmd index 63d14f58..c0dbffcd 100644 --- a/workflows/edit_eml/edit_semantic_annotation.Rmd +++ b/workflows/edit_eml/edit_semantic_annotation.Rmd @@ -3,7 +3,7 @@ For a brief overview of what a semantic annotation is, and why we use them check out [this video.](https://drive.google.com/file/d/1Err-fL8O21kd1NzHJ9HJkK_B2fpt4sQW/view?usp=sharing) Even more information on how to add semantic annotations to EML 2.2.0 can be found - here. Currently metacatUI does not support the editing of semantic annotations on the website so all changes will have to be done in R. + here. There are several elements in the EML 2.2.0 schema that can be annotated: @@ -11,7 +11,7 @@ There are several elements in the EML 2.2.0 schema that can be annotated: * entity (eg: `otherEntity` or `dataTable`) * `attribute` -On the datateam, we will only be adding annotations to attributes for now. +Attribute annotations can be edited in R and also on the website. Dataset and entity annotations are only done in R. ### How annotations are used @@ -22,10 +22,20 @@ On the website you can see annotations in each of the attributes. You can click on any one of them to search for more datasets with that same annotation. ![](../images/annotations_web_use.png) +```{block, type = "note"} +Semantic attribute annotations can be applied to spatialRasters, spatialVectors and dataTables +``` + -#### Attribute-level annotations +#### Attribute-level annotations on the website editor -To add annotations to the `attributeList` you will need information about the `propertyURI` and `valueURI` +The website has a searchable list of attribute annotations that are grouped by category and specificity. Open your dataset on the test website from earlier and enter the attribute editor. Look through all of the available annotations. + +Adding attribute annotations using the website is the easiest way, however adding them using R and/or the Shiny app may be quicker with very larger datasets. + +#### Attribute-level annotations in R + +To manually add annotations to the `attributeList` in R you will need information about the `propertyURI` and `valueURI`. Annotations are essentially composed of a sentence, which contains a subject (the attribute), predicate (`propertyURI`), and object (`valueURI`). Because of the way our search interface is built, for now we will be using attribute annotations that have a `propertyURI` label of "contains measurements of type". @@ -55,9 +65,6 @@ $valueURI$valueURI [1] "http://purl.dataone.org/odo/ECSO_00002617" ``` -```{block, type = "note"} -Semantic attribute annotations can be applied to spatialRasters, spatialVectors and dataTables -``` ### How to add an annotation @@ -80,12 +87,12 @@ There are several ontologies to search in. In order of most to least likely to b * [Information Artifact Ontology (IAO)](http://bioportal.bioontology.org/ontologies/IAO/?p=summary) - this ontology contains terms related to information entities (eg: journals, articles, datasets, identifiers) -To search, navigate through the "classes" until you find an appropriate term. When we are picking terms, it is important that we not just pick a similar term or a term that seems close - we want a term that is totally "right". For example, if you have an attribute for carbon tetroxide flux and an ontology with a class hierarchy like this: +To search, navigate through the "classes" until you find an appropriate term. When we are picking terms, it is important that we not just pick a similar term or a term that seems close - we want a term that is 100% accurate. For example, if you have an attribute for carbon tetroxide flux and an ontology with a class hierarchy like this: -- carbon flux |---- carbon dioxide flux -Our exact attribute, carbon tetroxide flux is not listed. In this case, we should pick "carbon flux" as it's completely correct and not "carbon dioxide flux" because it's more specific but not quite right. +Our exact attribute, carbon tetroxide flux, is not listed. In this case, we should pick "carbon flux" as it's completely correct/accurate and not "carbon dioxide flux" because it's more specific/precise but not quite right. ```{block, type = "note"} For general attributes (such as ones named depth or length), it is important to be as specific as possible about what is being measured. @@ -140,11 +147,15 @@ doc$dataset$dataTable[[3]]$attributeList$attribute[[6]]$annotation$valueURI <- l On the far right of the table of `shiny_attributes` there are 4 columns: `id`, `propertyURI`, `propertyLabel`, `valueURI`, `valueLabel` that can be filled out. -### Annotating sensitive data +### Dataset Annotations + +Dataset annotations can only be made using R. There are several helper functions that assist with making dataset annotations. + +#### Data Sensitivity Sensitive datasets that might cover protected characteristics (human subjects data, endangered species locations, etc) should be annotated using the data sensitivity ontology: https://bioportal.bioontology.org/ontologies/SENSO/?p=classes&conceptid=root. -#### Dataset Annotations +#### Dataset Discipline As a final step in the data processing pipeline, we will categorize the dataset. We are trying to categorize datasets so we can have a general idea of what kinds of data we have at the Arctic Data Center. @@ -153,5 +164,10 @@ Datasets will be categorized using the [Academic Ontology](https://bioportal.bio Be sure to ask your peers in the #datateam slack channel whether they agree with the themes you think best fit your dataset. Once there is consensus, use the following line of code: ```{r, eval = F} -doc <- datamgmt::eml_categorize_dataset(doc, c("list", "of", "themes")) +doc <- arcticdatautils::eml_categorize_dataset(doc, c("Soil Science", "Plant Science", "Ecology")) +``` + +Be careful not to duplicate dataset annotations. The above code does not existing prior dataset annotations. Duplicate annotations can be removed by setting them to `NULL`. +```{r, eval = F} +doc$dataset$annotation[[i]] <- NULL ``` diff --git a/workflows/edit_eml/set_physical.Rmd b/workflows/edit_eml/set_physical.Rmd index fa72822c..6e15b92e 100644 --- a/workflows/edit_eml/set_physical.Rmd +++ b/workflows/edit_eml/set_physical.Rmd @@ -2,17 +2,28 @@ To set the `physical` aspects of a data object, use the following commands to build a `physical` object from a data `PID` that exists in your package. **Remember to set the member node to test.arcticdata.io!** +Every entity that we upload needs a physical description added. When replacing files, the physical must be replaced as well. + ```{block, type = "note"} -The word ‘physical’ derives from database systems, which distinguish the ‘logical’ model (e.g., what attributes are in a table, etc) from the physical model (how the data are written to a physical hard disk (basically, the serialization). so, we grouped metadata about the file (eg. dataformat, file size, file name) as written to disk in physical. +The word ‘physical’ derives from database systems, which distinguish the ‘logical’ model (e.g., what attributes are in a table, etc) from the physical model (how the data are written to a physical hard disk (basically, the serialization)). so, we grouped metadata about the file (eg. dataformat, file size, file name) as written to disk in physical. It includes info like the file size. For CSV files, the physical describes the number of header lines and the attribute orientation. ``` ```{r, eval = FALSE} +# Get the PID of a file data_pid <- selectMember(dp, name = "sysmeta@fileName", value = "your_file_name.csv") +# Get the physical info and store it in an object physical <- arcticdatautils::pid_to_eml_physical(mn, data_pid) ``` -The `physical` must then be assigned to the data object. +The `physical` object can then be checked for errors and added to the EML document. +```{r, eval=FALSE} +# Edit EML document with object +doc$dataset$dataTable[[i]]$physical <- physical +``` Note that the above workflow only works if your data object already exists on the member node. +Physicals can be seen in the website representation of the EML below the entity description. +![](../images/physical_screenshot.png) + diff --git a/workflows/explore_eml/navigate_through_eml.Rmd b/workflows/explore_eml/navigate_through_eml.Rmd index 999530d2..071fcf2a 100644 --- a/workflows/explore_eml/navigate_through_eml.Rmd +++ b/workflows/explore_eml/navigate_through_eml.Rmd @@ -27,6 +27,11 @@ Just like you navigate in a `data.frame`, you can use the `$` operator to naviga ![](../images/rstudio_autocomplete.png) -Note that if you hit tab, and nothing pops up, this most likely implies that you are trying to go into an EML element that can take a series items. For example ```doc$dataset$creator$``` will not show a pop-up menu. This is because `creator` is a series-type object (i.e. you can have multiple `creator`s). If you want to go deeper into `creator`, you first must tell R which `creator` you are interested in. Do this by writing `[[i]]` first where `i` is the index of the `creator` you are concerned with. For example, if you want to look at the first `creator` i = 1. Now ```doc$dataset$creator[[1]]$``` will give you many more options. Note, an empty autocomplete result sometimes means you have reached the end of a branch in the EML structure. +Note that if you hit tab, and nothing pops up, this most likely implies that you are trying to go into an EML element that can take a series items. For example ```doc$dataset$creator$``` will not show a pop-up menu. This is because `creator` is a **series-type object** (i.e. you can have multiple `creator`s). If you want to go deeper into `creator`, you first must tell R which `creator` you are interested in. Do this by writing `[[i]]` first where `i` is the index of the `creator` you are concerned with. For example, if you want to look at the first `creator` i = 1. Now ```doc$dataset$creator[[1]]$``` will give you many more options. Note, an empty autocomplete result sometimes means you have reached the end of a branch in the EML structure. + + +Below is the structure of `doc$dataset`. There are a series of multiple `creator`s, which can be accessed individually by index: `doc$dataset$creator[[#]]`. +![](images/XML_series_pbject.png) + At this point stop and take a deep breath. The key takeaway is that EML is a hierarchical tree structure. The best way to get familiar with it is to explore the structure. Try entering `doc$dataset` into your console, and print it. Now make the search more specific, for instance: `doc$dataset$abstract`. diff --git a/workflows/explore_eml/understand_eml_schema.Rmd b/workflows/explore_eml/understand_eml_schema.Rmd index 785385c1..49dc3153 100644 --- a/workflows/explore_eml/understand_eml_schema.Rmd +++ b/workflows/explore_eml/understand_eml_schema.Rmd @@ -1,6 +1,6 @@ ## Understand the EML schema -Another great resource for navigating the EML structure is looking at the schema which defines the structure. The schema diagrams on this page are interactive. Further explanations of the symbology can be found here. The schema is complicated and may take some time to get familiar with before you will be able to fully understand it. +Another great resource for navigating the EML structure is looking at the schema which defines the structure. The schema diagrams on this page are interactive. Further explanations of the symbology can be found here. The schema is complicated and may take some time to get familiar with before you will be able to fully understand it. Use your browser's "search in page" function (usually CTRL-F or Command-F) to navigate the EML schema page quickly. For example, let's take a look at eml-party. To start off, notice that some elements have bolded lines leading to them.