From ce253e935a8a7ebb0dbc38787301419b9569dd4c Mon Sep 17 00:00:00 2001 From: gbganalyst Date: Thu, 22 Feb 2024 17:49:25 +0100 Subject: [PATCH] worked on package website --- NEWS.md | 32 ++- README.Rmd | 301 +---------------------- README.md | 502 +-------------------------------------- _pkgdown.yml | 62 ++++- man/bulkreadr-package.Rd | 1 + vignettes/bulkreadr.Rmd | 25 +- 6 files changed, 87 insertions(+), 836 deletions(-) diff --git a/NEWS.md b/NEWS.md index 994de90..b9651ec 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,45 +1,43 @@ -# What is New in *bulkreadr*? - -## bulkreadr 1.1.0 (2023-11-13) +# bulkreadr 1.1.0 (2023-11-13) This update includes the following new features: -- `generate_dictionary()`: This function is designed to automatically create a comprehensive data dictionary from labelled datasets. The generated dictionary provides detailed insights into each variable, aiding in better data understanding and management. +* `generate_dictionary()`: This function is designed to automatically create a comprehensive data dictionary from labelled datasets. The generated dictionary provides detailed insights into each variable, aiding in better data understanding and management. -- `look_for()`: This enhances the capability to efficiently search within labelled datasets. It allows users to quickly find variable names and their descriptions by searching for specific keywords. This feature streamlines data exploration and analysis, particularly in large datasets with extensive variables. +* `look_for()`: This enhances the capability to efficiently search within labelled datasets. It allows users to quickly find variable names and their descriptions by searching for specific keywords. This feature streamlines data exploration and analysis, particularly in large datasets with extensive variables. These enhancements aim to improve the user experience in data management and exploration within `bulkreadr`. We hope these new features will assist our users in more effectively navigating and understanding their labelled datasets. -## bulkreadr 1.0.0 (2023-09-20) +# bulkreadr 1.0.0 (2023-09-20) This update includes the following new features and improvements: -- Developed `read_stata_data()` to import Stata data file (`.dta`) into an R data frame, converting labeled variables into factors. +* Developed `read_stata_data()` to import Stata data file (`.dta`) into an R data frame, converting labeled variables into factors. -- Reduced dependency packages to optimize efficiency. +* Reduced dependency packages to optimize efficiency. -## 0.2.0 (2023-09-11) +# 0.2.0 (2023-09-11) This update includes the following new features and improvements: -- Developed bulkreadr vignette +* Developed bulkreadr vignette -- Developed `read_spss_data()` to seamlessly import data from an SPSS data (`.sav` or `.zsav`) files and converting labelled variables into factors, a crucial step that enhances the ease of data manipulation and analysis within the R programming environment. +* Developed `read_spss_data()` to seamlessly import data from an SPSS data (`.sav` or `.zsav`) files and converting labelled variables into factors, a crucial step that enhances the ease of data manipulation and analysis within the R programming environment. -- Added more unit tests +* Added more unit tests -## 0.1.0 (2023-07-24) +# 0.1.0 (2023-07-24) This update includes the following new features and improvements: -- Improved error handling by adding meaningful error messages for all functions within `bulkreadr` package. This will make it easier for users to identify and troubleshoot issues that may arise during their use of the package. +* Improved error handling by adding meaningful error messages for all functions within `bulkreadr` package. This will make it easier for users to identify and troubleshoot issues that may arise during their use of the package. -- Added package-level documentation. The user can now use `?bulkreadr::bulkreadr` for basic package-level documentation. +* Added package-level documentation. The user can now use `?bulkreadr::bulkreadr` for basic package-level documentation. -- Added `inspect_na()` to summarize missingness in data frame columns and `fill_missing_values()` to impute missing values in a dataframe. +* Added `inspect_na()` to summarize missingness in data frame columns and `fill_missing_values()` to impute missing values in a dataframe. -## 0.0.0.9 (2023-07-03) +# 0.0.0.9 (2023-07-03) The development version of bulkreadr is now on Githhub. diff --git a/README.Rmd b/README.Rmd index 7b36c3b..0110e21 100644 --- a/README.Rmd +++ b/README.Rmd @@ -25,6 +25,7 @@ options(tibble.print_min = 5, tibble.print_max = 5) [![R-CMD-check](https://github.com/gbganalyst/bulkreadr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/gbganalyst/bulkreadr/actions/workflows/R-CMD-check.yaml) [![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/bulkreadr)](https://cran.r-project.org/package=bulkreadr) +[![metacran downloads](https://cranlogs.r-pkg.org/badges/bulkreadr)](https://cran.r-project.org/package=bulkreadr) [![metacran downloads](https://cranlogs.r-pkg.org/badges/grand-total/bulkreadr)](https://cran.r-project.org/package=bulkreadr) [![Codecov test coverage](https://codecov.io/gh/gbganalyst/bulkreadr/branch/main/graph/badge.svg)](https://app.codecov.io/gh/gbganalyst/bulkreadr?branch=main) @@ -75,306 +76,6 @@ library(dplyr) ``` -## Functions in bulkreadr package - -This section provides a concise overview of the different functions available in the `bulkreadr` package. These functions serve various purposes and are designed to handle importing of data in bulk. - -| Functions to Import Data | Other Functions | -|--------------------------------------|-----------------------------------------| -| [`read_excel_workbook()`](#read_excel_workbook) | [`generate_dictionary()`](#generate_dictionary) | -| [`read_excel_files_from_dir()`](#read_csv_files_from_dir) | [`look_for()`](#look_for) | -| [`read_csv_files_from_dir()`](#read_csv_files_from_dir) | [`pull_out()`](#pull_out) | -| [`read_gsheets()`](#read_gsheets) | [`convert_to_date()`](#convert_to_date) | -| [`read_spss_data()`](#read_spss_data) | [`inspect_na()`](#inspect_na) | -| [`read_stata_data()`](#read_stata_data) | [`fill_missing_values()`](#fill_missing_values) | - - -**Note:** - -> For the majority of functions within this package, we will utilize data stored in the system file by the `bulkreadr`, which can be accessed using the `system.file()` function. If you wish to utilize your own data stored in your local directory, please ensure that you have set the appropriate file path prior to using any functions provided by the bulkreadr package. - - -## `read_excel_workbook()` - -`read_excel_workbook()` reads all the data from the sheets of an Excel workbook and return an appended dataframe. - -```{r example1} - -# path to the xls/xlsx file. - -path <- system.file("extdata", "Diamonds.xlsx", package = "bulkreadr", mustWork = TRUE) - -# read the sheets - -read_excel_workbook(path = path) - -``` - -## `read_excel_files_from_dir()` - -`read_excel_files_from_dir()` reads all Excel workbooks in the `"~/data"` directory and returns an appended dataframe. - -```{r example1a} - -# path to the directory containing the xls/xlsx files. - -directory <- system.file("xlsxfolder", package = "bulkreadr") - -# import the workbooks - -read_excel_files_from_dir(dir_path = directory) - -``` - -## `read_csv_files_from_dir()` - -`read_csv_files_from_dir()` reads all csv files from the `"~/data"` directory and returns an appended dataframe. The resulting dataframe will be in the same order as the CSV files in the directory. - -```{r example2} -# path to the directory containing the CSV files. - -directory <- system.file("csvfolder", package = "bulkreadr") - -# import the csv files - -read_csv_files_from_dir(dir_path = directory) - -``` - -## `read_gsheets()` - -The `read_gsheets()` function imports data from multiple sheets in a Google Sheets spreadsheet and appends the resulting dataframes from each sheet together to create a single dataframe. This function is a powerful tool for data analysis, as it allows you to easily combine data from multiple sheets into a single dataset. - -```{r, include=FALSE} -googlesheets4::gs4_deauth() -``` - -```{r example3} - -# Google Sheet ID or the link to the sheet - -sheet_id <- "1izO0mHu3L9AMySQUXGDn9GPs1n-VwGFSEoAKGhqVQh0" - -# read all the sheets - -read_gsheets(ss = sheet_id) -``` - -## `read_spss_data()` - -`read_spss_data()` is designed to seamlessly import data from an SPSS data (`.sav` or `.zsav`) files. It converts labelled variables into factors, a crucial step that enhances the ease of data manipulation and analysis within the R programming environment. - -```{r spssdata1} - -# Read an SPSS data file without converting variable labels as column names - -file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr") - -data <- read_spss_data(file = file_path) - -data - -``` - -```{r spssdata2} - -# Read an SPSS data file and convert variable labels as column names - -data <- read_spss_data(file = file_path, label = TRUE) - -data - -``` - -## read_stata_data() - -`read_stata_data()` reads Stata data file (`.dta`) into an R data frame, converting labeled variables into factors. - -**Read the Stata data file without converting variable labels as column names** - -```{r statadata1} - -file_path <- system.file("extdata", "Wages.dta", package = "bulkreadr") - -data <- read_stata_data(file = file_path) - -data - -``` - -**Read the Stata data file and convert variable labels as column names** - -```{r statadata2} - -data <- read_stata_data(file = file_path, label = TRUE) - -data - -``` - - -## `generate_dictionary()` - -`generate_dictionary()` creates a data dictionary from a specified data frame. This function is particularly useful for understanding and documenting the structure of your dataset, similar to data dictionaries in Stata or SPSS. - -```{r} - -# Creating a data dictionary from an SPSS file - -file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr") - -wage_data <- read_spss_data(file = file_path) - -generate_dictionary(wage_data) -``` - - -## `look_for()` - -The `look_for()` function is designed to emulate the functionality of the Stata `lookfor` command in R. It provides a powerful tool for searching through large datasets, specifically targeting variable names, variable label descriptions, factor levels, and value labels. This function is handy for users working with extensive and complex datasets, enabling them to quickly and efficiently locate the variables of interest. - - -```{r} - -# Look for a single keyword. - -look_for(wage_data, "south") -``` - - -## `pull_out()` - -`pull_out()` is similar to `[`. It acts on vectors, matrices, arrays and lists to extract or replace parts. It is pleasant to use with the magrittr (`⁠%>%`⁠) and base(`|>`) operators. - -```{r example4} - -top_10_richest_nig <- c("Aliko Dangote", "Mike Adenuga", "Femi Otedola", "Arthur Eze", "Abdulsamad Rabiu", "Cletus Ibeto", "Orji Uzor Kalu", "ABC Orjiakor", "Jimoh Ibrahim", "Tony Elumelu") - -top_10_richest_nig %>% - pull_out(c(1, 5, 2)) -``` - -```{r} -top_10_richest_nig %>% - pull_out(-c(1, 5, 2)) -``` - - -## `convert_to_date()` - -`convert_to_date()` parses an input vector into POSIXct date-time object. It is also powerful to convert from excel date number like `42370` into date value like `2016-01-01`. - -```{r example 5} - -## ** heterogeneous dates ** - -dates <- c( - 44869, "22.09.2022", NA, "02/27/92", "01-19-2022", - "13-01- 2022", "2023", "2023-2", 41750.2, 41751.99, - "11 07 2023", "2023-4" - ) - -# Convert to POSIXct or Date object - -convert_to_date(dates) - -# It can also convert date time object to date object - -convert_to_date(lubridate::now()) - -``` - - -```{r example5} -# With dataframe - -file_path <- system.file("extdata", "OGD.xlsx", package = "bulkreadr") - -ogd_data <- read_excel_workbook(path = file_path) - - -ogd_data %>% head() - -# Convert to POSIXct or Date object - -modified_ogd_data <- ogd_data %>% - mutate(Date_format = convert_to_date(Date)) - -modified_ogd_data %>% head() - -``` - - -## `inspect_na()` - -`inspect_na()` summarizes the rate of missingness in each column of a data frame. For a grouped data frame, the rate of missingness is summarized separately for each group. - -```{r example 6a} - -# dataframe summary - -inspect_na(airquality) - -# grouped dataframe summary - -airquality %>% - group_by(Month) %>% - inspect_na() - -``` - -## `fill_missing_values()` - -`fill_missing_values()` in an efficient function that addresses missing values in a dataframe. It uses imputation by function, meaning it replaces missing data in numeric variables with either the mean or the median, and in non-numeric variables with the mode. The function takes a column-based imputation approach, ensuring that replacement values are derived from the respective columns, resulting in accurate and consistent data. This method enhances the integrity of the dataset and promotes sound decision-making and analysis in data processing workflows. - -```{r example 6} - -df <- tibble::tibble( - Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5), - Sepal.Width = c(4.1, 3.6, 3, 3, 2.9, 2.5, 2.4), - Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7), - Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA), - Species = c("setosa", NA, "versicolor", "setosa", - NA, "virginica", "setosa" - ) -) - -df - -# Using mean to fill missing values for numeric variables - -result_df_mean <- fill_missing_values(df, use_mean = TRUE) - -result_df_mean - -# Using median to fill missing values for numeric variables - -result_df_median <- fill_missing_values(df, use_mean = FALSE) - -result_df_median -``` - -### Impute missing values (NAs) in a grouped data frame - -You can use the `fill_missing_values()` in a grouped data frame by using other grouping and map functions. Here is an example of how to do this: - -```{r} -sample_iris <- tibble::tibble( -Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5), -Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7), -Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA), -Species = c("setosa", "setosa", "versicolor", "setosa", - "virginica", "virginica", "setosa") -) - -sample_iris - -sample_iris %>% - group_by(Species) %>% - group_split() %>% - map_df(fill_missing_values) -``` - ## Context bulkreadr draws on and complements / emulates other packages such as readxl, readr, and googlesheets4 to read bulk data in R. diff --git a/README.md b/README.md index 9c1584f..833e73d 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,8 @@ [![R-CMD-check](https://github.com/gbganalyst/bulkreadr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/gbganalyst/bulkreadr/actions/workflows/R-CMD-check.yaml) [![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/bulkreadr)](https://cran.r-project.org/package=bulkreadr) [![metacran +downloads](https://cranlogs.r-pkg.org/badges/bulkreadr)](https://cran.r-project.org/package=bulkreadr) +[![metacran downloads](https://cranlogs.r-pkg.org/badges/grand-total/bulkreadr)](https://cran.r-project.org/package=bulkreadr) [![Codecov test coverage](https://codecov.io/gh/gbganalyst/bulkreadr/branch/main/graph/badge.svg)](https://app.codecov.io/gh/gbganalyst/bulkreadr?branch=main) @@ -70,506 +72,6 @@ library(bulkreadr) library(dplyr) ``` -## Functions in bulkreadr package - -This section provides a concise overview of the different functions -available in the `bulkreadr` package. These functions serve various -purposes and are designed to handle importing of data in bulk. - -| Functions to Import Data | Other Functions | -|-----------------------------------------------------------|-------------------------------------------------| -| [`read_excel_workbook()`](#read_excel_workbook) | [`generate_dictionary()`](#generate_dictionary) | -| [`read_excel_files_from_dir()`](#read_csv_files_from_dir) | [`look_for()`](#look_for) | -| [`read_csv_files_from_dir()`](#read_csv_files_from_dir) | [`pull_out()`](#pull_out) | -| [`read_gsheets()`](#read_gsheets) | [`convert_to_date()`](#convert_to_date) | -| [`read_spss_data()`](#read_spss_data) | [`inspect_na()`](#inspect_na) | -| [`read_stata_data()`](#read_stata_data) | [`fill_missing_values()`](#fill_missing_values) | - -**Note:** - -> For the majority of functions within this package, we will utilize -> data stored in the system file by the `bulkreadr`, which can be -> accessed using the `system.file()` function. If you wish to utilize -> your own data stored in your local directory, please ensure that you -> have set the appropriate file path prior to using any functions -> provided by the bulkreadr package. - -## `read_excel_workbook()` - -`read_excel_workbook()` reads all the data from the sheets of an Excel -workbook and return an appended dataframe. - -``` r - -# path to the xls/xlsx file. - -path <- system.file("extdata", "Diamonds.xlsx", package = "bulkreadr", mustWork = TRUE) - -# read the sheets - -read_excel_workbook(path = path) -#> # A tibble: 260 × 9 -#> carat color clarity depth table price x y z -#> -#> 1 2 I SI1 65.9 60 13764 7.8 7.73 5.12 -#> 2 0.7 H SI1 65.2 58 2048 5.49 5.55 3.6 -#> 3 1.51 E SI1 58.4 70 11102 7.55 7.39 4.36 -#> 4 0.7 D SI2 65.5 57 1806 5.56 5.43 3.6 -#> 5 0.35 F VVS1 54.6 59 1011 4.85 4.79 2.63 -#> # ℹ 255 more rows -``` - -## `read_excel_files_from_dir()` - -`read_excel_files_from_dir()` reads all Excel workbooks in the -`"~/data"` directory and returns an appended dataframe. - -``` r - -# path to the directory containing the xls/xlsx files. - -directory <- system.file("xlsxfolder", package = "bulkreadr") - -# import the workbooks - -read_excel_files_from_dir(dir_path = directory) -#> # A tibble: 260 × 10 -#> cut carat color clarity depth table price x y z -#> -#> 1 Fair 2 I SI1 65.9 60 13764 7.8 7.73 5.12 -#> 2 Fair 0.7 H SI1 65.2 58 2048 5.49 5.55 3.6 -#> 3 Fair 1.51 E SI1 58.4 70 11102 7.55 7.39 4.36 -#> 4 Fair 0.7 D SI2 65.5 57 1806 5.56 5.43 3.6 -#> 5 Fair 0.35 F VVS1 54.6 59 1011 4.85 4.79 2.63 -#> # ℹ 255 more rows -``` - -## `read_csv_files_from_dir()` - -`read_csv_files_from_dir()` reads all csv files from the `"~/data"` -directory and returns an appended dataframe. The resulting dataframe -will be in the same order as the CSV files in the directory. - -``` r -# path to the directory containing the CSV files. - -directory <- system.file("csvfolder", package = "bulkreadr") - -# import the csv files - -read_csv_files_from_dir(dir_path = directory) -#> # A tibble: 260 × 10 -#> cut carat color clarity depth table price x y z -#> -#> 1 Fair 2 I SI1 65.9 60 13764 7.8 7.73 5.12 -#> 2 Fair 0.7 H SI1 65.2 58 2048 5.49 5.55 3.6 -#> 3 Fair 1.51 E SI1 58.4 70 11102 7.55 7.39 4.36 -#> 4 Fair 0.7 D SI2 65.5 57 1806 5.56 5.43 3.6 -#> 5 Fair 0.35 F VVS1 54.6 59 1011 4.85 4.79 2.63 -#> # ℹ 255 more rows -``` - -## `read_gsheets()` - -The `read_gsheets()` function imports data from multiple sheets in a -Google Sheets spreadsheet and appends the resulting dataframes from each -sheet together to create a single dataframe. This function is a powerful -tool for data analysis, as it allows you to easily combine data from -multiple sheets into a single dataset. - -``` r - -# Google Sheet ID or the link to the sheet - -sheet_id <- "1izO0mHu3L9AMySQUXGDn9GPs1n-VwGFSEoAKGhqVQh0" - -# read all the sheets - -read_gsheets(ss = sheet_id) -#> # A tibble: 260 × 9 -#> carat color clarity depth table price x y z -#> -#> 1 2 I SI1 65.9 60 13764 7.8 7.73 5.12 -#> 2 0.7 H SI1 65.2 58 2048 5.49 5.55 3.6 -#> 3 1.51 E SI1 58.4 70 11102 7.55 7.39 4.36 -#> 4 0.7 D SI2 65.5 57 1806 5.56 5.43 3.6 -#> 5 0.35 F VVS1 54.6 59 1011 4.85 4.79 2.63 -#> # ℹ 255 more rows -``` - -## `read_spss_data()` - -`read_spss_data()` is designed to seamlessly import data from an SPSS -data (`.sav` or `.zsav`) files. It converts labelled variables into -factors, a crucial step that enhances the ease of data manipulation and -analysis within the R programming environment. - -``` r - -# Read an SPSS data file without converting variable labels as column names - -file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr") - -data <- read_spss_data(file = file_path) - -data -#> # A tibble: 400 × 9 -#> id educ south sex exper wage occup marr ed -#> -#> 1 3 12 does not live in South Male 17 7.5 Other Married High s… -#> 2 4 13 does not live in South Male 9 13.1 Other Not married Some c… -#> 3 5 10 lives in South Male 27 4.45 Other Not married Less t… -#> 4 12 9 lives in South Male 30 6.25 Other Not married Less t… -#> 5 13 9 lives in South Male 29 20.0 Other Married Less t… -#> # ℹ 395 more rows -``` - -``` r - -# Read an SPSS data file and convert variable labels as column names - -data <- read_spss_data(file = file_path, label = TRUE) - -data -#> # A tibble: 400 × 9 -#> `Worker ID` `Number of years of education` `Live in south` Gender -#> -#> 1 3 12 does not live in South Male -#> 2 4 13 does not live in South Male -#> 3 5 10 lives in South Male -#> 4 12 9 lives in South Male -#> 5 13 9 lives in South Male -#> # ℹ 395 more rows -#> # ℹ 5 more variables: `Number of years of work experience` , -#> # `Wage (dollars per hour)` , Occupation , `Marital status` , -#> # `Highest education level` -``` - -## read_stata_data() - -`read_stata_data()` reads Stata data file (`.dta`) into an R data frame, -converting labeled variables into factors. - -**Read the Stata data file without converting variable labels as column -names** - -``` r - -file_path <- system.file("extdata", "Wages.dta", package = "bulkreadr") - -data <- read_stata_data(file = file_path) - -data -#> # A tibble: 400 × 9 -#> id educ south sex exper wage occup marr ed -#> -#> 1 3 12 does not live in South Male 17 7.5 Other Married High s… -#> 2 4 13 does not live in South Male 9 13.1 Other Not married Some c… -#> 3 5 10 lives in South Male 27 4.45 Other Not married Less t… -#> 4 12 9 lives in South Male 30 6.25 Other Not married Less t… -#> 5 13 9 lives in South Male 29 20.0 Other Married Less t… -#> # ℹ 395 more rows -``` - -**Read the Stata data file and convert variable labels as column names** - -``` r - -data <- read_stata_data(file = file_path, label = TRUE) - -data -#> # A tibble: 400 × 9 -#> `Worker ID` `Number of years of education` `Live in south` Gender -#> -#> 1 3 12 does not live in South Male -#> 2 4 13 does not live in South Male -#> 3 5 10 lives in South Male -#> 4 12 9 lives in South Male -#> 5 13 9 lives in South Male -#> # ℹ 395 more rows -#> # ℹ 5 more variables: `Number of years of work experience` , -#> # `Wage (dollars per hour)` , Occupation , `Marital status` , -#> # `Highest education level` -``` - -## `generate_dictionary()` - -`generate_dictionary()` creates a data dictionary from a specified data -frame. This function is particularly useful for understanding and -documenting the structure of your dataset, similar to data dictionaries -in Stata or SPSS. - -``` r - -# Creating a data dictionary from an SPSS file - -file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr") - -wage_data <- read_spss_data(file = file_path) - -generate_dictionary(wage_data) -#> # A tibble: 9 × 6 -#> position variable description `column type` missing levels -#> -#> 1 1 id Worker ID dbl 0 -#> 2 2 educ Number of years of education dbl 0 -#> 3 3 south Live in south fct 0 -#> 4 4 sex Gender fct 0 -#> 5 5 exper Number of years of work experi… dbl 0 -#> # ℹ 4 more rows -``` - -## `look_for()` - -The `look_for()` function is designed to emulate the functionality of -the Stata `lookfor` command in R. It provides a powerful tool for -searching through large datasets, specifically targeting variable names, -variable label descriptions, factor levels, and value labels. This -function is handy for users working with extensive and complex datasets, -enabling them to quickly and efficiently locate the variables of -interest. - -``` r - -# Look for a single keyword. - -look_for(wage_data, "south") -#> pos variable label col_type missing values -#> 3 south Live in south fct 0 does not live in South -#> lives in South -``` - -## `pull_out()` - -`pull_out()` is similar to `[`. It acts on vectors, matrices, arrays and -lists to extract or replace parts. It is pleasant to use with the -magrittr (`⁠%>%`⁠) and base(`|>`) operators. - -``` r - -top_10_richest_nig <- c("Aliko Dangote", "Mike Adenuga", "Femi Otedola", "Arthur Eze", "Abdulsamad Rabiu", "Cletus Ibeto", "Orji Uzor Kalu", "ABC Orjiakor", "Jimoh Ibrahim", "Tony Elumelu") - -top_10_richest_nig %>% - pull_out(c(1, 5, 2)) -#> [1] "Aliko Dangote" "Abdulsamad Rabiu" "Mike Adenuga" -``` - -``` r -top_10_richest_nig %>% - pull_out(-c(1, 5, 2)) -#> [1] "Femi Otedola" "Arthur Eze" "Cletus Ibeto" "Orji Uzor Kalu" -#> [5] "ABC Orjiakor" "Jimoh Ibrahim" "Tony Elumelu" -``` - -## `convert_to_date()` - -`convert_to_date()` parses an input vector into POSIXct date-time -object. It is also powerful to convert from excel date number like -`42370` into date value like `2016-01-01`. - -``` r - -## ** heterogeneous dates ** - -dates <- c( - 44869, "22.09.2022", NA, "02/27/92", "01-19-2022", - "13-01- 2022", "2023", "2023-2", 41750.2, 41751.99, - "11 07 2023", "2023-4" - ) - -# Convert to POSIXct or Date object - -convert_to_date(dates) -#> [1] "2022-11-04" "2022-09-22" NA "1992-02-27" "2022-01-19" -#> [6] "2022-01-13" "2023-01-01" "2023-02-01" "2014-04-21" "2014-04-22" -#> [11] "2023-07-11" "2023-04-01" - -# It can also convert date time object to date object - -convert_to_date(lubridate::now()) -#> [1] "2023-11-25" -``` - -``` r -# With dataframe - -file_path <- system.file("extdata", "OGD.xlsx", package = "bulkreadr") - -ogd_data <- read_excel_workbook(path = file_path) - - -ogd_data %>% head() -#> # A tibble: 6 × 2 -#> PID Date -#> -#> 1 NIG-CON-002 22.09.2022 -#> 2 NIG-CON-004 44569 -#> 3 NIG-CON-007 44569 -#> 4 NIG-CON-009 44569 -#> 5 NIG-CON-010 44569 -#> # ℹ 1 more row - -# Convert to POSIXct or Date object - -modified_ogd_data <- ogd_data %>% - mutate(Date_format = convert_to_date(Date)) - -modified_ogd_data %>% head() -#> # A tibble: 6 × 3 -#> PID Date Date_format -#> -#> 1 NIG-CON-002 22.09.2022 2022-09-22 -#> 2 NIG-CON-004 44569 2022-01-08 -#> 3 NIG-CON-007 44569 2022-01-08 -#> 4 NIG-CON-009 44569 2022-01-08 -#> 5 NIG-CON-010 44569 2022-01-08 -#> # ℹ 1 more row -``` - -## `inspect_na()` - -`inspect_na()` summarizes the rate of missingness in each column of a -data frame. For a grouped data frame, the rate of missingness is -summarized separately for each group. - -``` r - -# dataframe summary - -inspect_na(airquality) -#> # A tibble: 6 × 3 -#> col_name cnt pcnt -#> -#> 1 Ozone 37 24.2 -#> 2 Solar.R 7 4.58 -#> 3 Wind 0 0 -#> 4 Temp 0 0 -#> 5 Month 0 0 -#> # ℹ 1 more row - -# grouped dataframe summary - -airquality %>% - group_by(Month) %>% - inspect_na() -#> # A tibble: 25 × 4 -#> # Groups: Month [5] -#> Month col_name cnt pcnt -#> -#> 1 5 Ozone 5 16.1 -#> 2 5 Solar.R 4 12.9 -#> 3 5 Wind 0 0 -#> 4 5 Temp 0 0 -#> 5 5 Day 0 0 -#> # ℹ 20 more rows -``` - -## `fill_missing_values()` - -`fill_missing_values()` in an efficient function that addresses missing -values in a dataframe. It uses imputation by function, meaning it -replaces missing data in numeric variables with either the mean or the -median, and in non-numeric variables with the mode. The function takes a -column-based imputation approach, ensuring that replacement values are -derived from the respective columns, resulting in accurate and -consistent data. This method enhances the integrity of the dataset and -promotes sound decision-making and analysis in data processing -workflows. - -``` r - -df <- tibble::tibble( - Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5), - Sepal.Width = c(4.1, 3.6, 3, 3, 2.9, 2.5, 2.4), - Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7), - Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA), - Species = c("setosa", NA, "versicolor", "setosa", - NA, "virginica", "setosa" - ) -) - -df -#> # A tibble: 7 × 5 -#> Sepal_Length Sepal.Width Petal_Length Petal_Width Species -#> -#> 1 5.2 4.1 1.5 NA setosa -#> 2 5 3.6 1.4 0.2 -#> 3 5.7 3 4.2 1.2 versicolor -#> 4 NA 3 1.4 0.2 setosa -#> 5 6.2 2.9 NA 1.3 -#> # ℹ 2 more rows - -# Using mean to fill missing values for numeric variables - -result_df_mean <- fill_missing_values(df, use_mean = TRUE) - -result_df_mean -#> # A tibble: 7 × 5 -#> Sepal_Length Sepal.Width Petal_Length Petal_Width Species -#> -#> 1 5.2 4.1 1.5 0.94 setosa -#> 2 5 3.6 1.4 0.2 setosa -#> 3 5.7 3 4.2 1.2 versicolor -#> 4 5.72 3 1.4 0.2 setosa -#> 5 6.2 2.9 3 1.3 setosa -#> # ℹ 2 more rows - -# Using median to fill missing values for numeric variables - -result_df_median <- fill_missing_values(df, use_mean = FALSE) - -result_df_median -#> # A tibble: 7 × 5 -#> Sepal_Length Sepal.Width Petal_Length Petal_Width Species -#> -#> 1 5.2 4.1 1.5 1.2 setosa -#> 2 5 3.6 1.4 0.2 setosa -#> 3 5.7 3 4.2 1.2 versicolor -#> 4 5.6 3 1.4 0.2 setosa -#> 5 6.2 2.9 2.6 1.3 setosa -#> # ℹ 2 more rows -``` - -### Impute missing values (NAs) in a grouped data frame - -You can use the `fill_missing_values()` in a grouped data frame by using -other grouping and map functions. Here is an example of how to do this: - -``` r -sample_iris <- tibble::tibble( -Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5), -Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7), -Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA), -Species = c("setosa", "setosa", "versicolor", "setosa", - "virginica", "virginica", "setosa") -) - -sample_iris -#> # A tibble: 7 × 4 -#> Sepal_Length Petal_Length Petal_Width Species -#> -#> 1 5.2 1.5 0.3 setosa -#> 2 5 1.4 0.2 setosa -#> 3 5.7 4.2 1.2 versicolor -#> 4 NA 1.4 0.2 setosa -#> 5 6.2 NA 1.3 virginica -#> # ℹ 2 more rows - -sample_iris %>% - group_by(Species) %>% - group_split() %>% - map_df(fill_missing_values) -#> # A tibble: 7 × 4 -#> Sepal_Length Petal_Length Petal_Width Species -#> -#> 1 5.2 1.5 0.3 setosa -#> 2 5 1.4 0.2 setosa -#> 3 5.23 1.4 0.2 setosa -#> 4 5.5 3.7 0.233 setosa -#> 5 5.7 4.2 1.2 versicolor -#> # ℹ 2 more rows -``` - ## Context bulkreadr draws on and complements / emulates other packages such as diff --git a/_pkgdown.yml b/_pkgdown.yml index 319a932..99d7f22 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -2,15 +2,69 @@ url: https://gbganalyst.github.io/bulkreadr/ template: bootstrap: 5 +reference: + +- title: Spreadsheets + desc: > + Functions that operate on spreasheets + contents: + - read_excel_workbook + - read_excel_files_from_dir + +- title: Google Sheets + desc: > + A function that operates on Google Sheets + contents: + - read_gsheets + +- title: Flat files + desc: > + A function that operates on csv files + contents: + - read_csv_files_from_dir + +- title: Labelled data + desc: > + Functions that read labelled data and make it easier to work with. + contents: + - read_spss_data + - read_stata_data + +- title: Data dictionary + desc: > + Functions that provide descriptions of the labelled data + contents: + - generate_dictionary + - look_for + +- title: Other functions in bulkreadr + desc: > + Unlike other functions in bulkreadr, these functions operate on individual + vectors, not on data frames, with the exceptions of `inspect_na()` and + `fill_missing_values()`. + contents: + - pull_out + - convert_to_date + - fill_missing_values + - inspect_na + + +articles: +- title: Get started + navbar: ~ + contents: + - bulkreadr + navbar: - title: "bayesplot" + title: "bulkreadr" left: - - icon: fa-home fa-lg + - text: Home + icon: fa-home href: index.html - - text: "Vignettes" - href: articles/index.html - text: "Functions" href: reference/index.html + - text: "Vignettes" + href: articles/index.html - text: "News" href: news/index.html - text: "Other Packages" diff --git a/man/bulkreadr-package.Rd b/man/bulkreadr-package.Rd index f6b5aba..5c2298b 100644 --- a/man/bulkreadr-package.Rd +++ b/man/bulkreadr-package.Rd @@ -14,6 +14,7 @@ Designed to simplify and streamline the process of reading and processing large Useful links: \itemize{ \item \url{https://github.com/gbganalyst/bulkreadr} + \item \url{https://gbganalyst.github.io/bulkreadr/} \item Report bugs at \url{https://github.com/gbganalyst/bulkreadr/issues} } diff --git a/vignettes/bulkreadr.Rmd b/vignettes/bulkreadr.Rmd index 618a29f..392f106 100644 --- a/vignettes/bulkreadr.Rmd +++ b/vignettes/bulkreadr.Rmd @@ -1,9 +1,16 @@ --- -title: "Introduction to bulkreadr package" +title: "Introduction to bulkreadr" output: rmarkdown::html_vignette +description: > + Start here if this is your first time using bulkreadr. You'll learn how to + use functions like `read_excel_workbook()` and `read_excel_files_from_dir()` + for importing data from Excel and `read_gsheets()` for Google Sheets, + allowing for data importation from multiple sheets. For handling CSV + files, `read_csv_files_from_dir()` reads all CSV files from a specified + directory. author: "Ezekiel Ogundepo and Ernest Fokoué" vignette: > - %\VignetteIndexEntry{bulkreadr} + %\VignetteIndexEntry{Introduction to bulkreadr package} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: @@ -243,7 +250,7 @@ The `look_for()` function is designed to emulate the functionality of the Stata look_for(wage_data, "south") -look_for(wage_data, "e") +look_for(wage_data, "s") ``` ## pull_out() @@ -377,15 +384,3 @@ sample_iris %>% map_df(fill_missing_values) ``` -## Inspiration - -bulkreadr draws on and complements / emulates other packages such as `readxl`, `readr`, and `googlesheets4` to read bulk data in R. - - * [readxl](https://readxl.tidyverse.org) is the tidyverse package for reading Excel files (xls or xlsx) into an R data frame. - - * [readr](https://readr.tidyverse.org) is the tidyverse package for reading delimited files (e.g., csv or tsv) into an R data frame. - - * [googlesheets4](https://cran.r-project.org/package=googlesheets) is the package to interact with Google Sheets through the Sheets API v4 . - - -