Skip to content

Commit

Permalink
updated the vignettes files
Browse files Browse the repository at this point in the history
  • Loading branch information
gbganalyst committed Feb 25, 2024
1 parent ce253e9 commit c7b5953
Show file tree
Hide file tree
Showing 6 changed files with 302 additions and 258 deletions.
8 changes: 1 addition & 7 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,8 @@ Additionally, the package seamlessly works with labelled data from SPSS and Stat

## The Concept Map



![](man/figures/concept-map.png)



## Installation

You can install `bulkreadr` package from [CRAN](https://cran.r-project.org/) with:
Expand All @@ -72,13 +68,11 @@ Now that you have installed `bulkreadr` package, you can simply load it by using

```{r pkgload}
library(bulkreadr)
library(dplyr)
```


## Context

bulkreadr draws on and complements / emulates other packages such as readxl, readr, and googlesheets4 to read bulk data in R.
bulkreadr is designed to integrate with and augment the capabilities of established packages such as `readxl`, `readr`, and `googlesheets4`, offering enhanced functionality for reading bulk data within the R programming environment.

* [readxl](https://readxl.tidyverse.org) is the tidyverse package for reading Excel files (xls or xlsx) into an R data frame.

Expand Down
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,13 +69,14 @@ by using:

``` r
library(bulkreadr)
library(dplyr)
```

## Context

bulkreadr draws on and complements / emulates other packages such as
readxl, readr, and googlesheets4 to read bulk data in R.
bulkreadr is designed to integrate with and augment the capabilities of
established packages such as `readxl`, `readr`, and `googlesheets4`,
offering enhanced functionality for reading bulk data within the R
programming environment.

- [readxl](https://readxl.tidyverse.org) is the tidyverse package for
reading Excel files (xls or xlsx) into an R data frame.
Expand Down
2 changes: 2 additions & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ articles:
navbar: ~
contents:
- bulkreadr
- labelled-data
- other-functions

navbar:
title: "bulkreadr"
Expand Down
251 changes: 3 additions & 248 deletions vignettes/bulkreadr.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -64,43 +64,15 @@ Now that you have installed `bulkreadr` package, you can simply load it by using

```{r pkgload}
library(bulkreadr)
library(dplyr)
```

## Functions in bulkreadr package

This section provides a concise overview of the different functions available in the `bulkreadr` package. These functions serve various purposes and are designed to handle importing of data in bulk.
This section provides a concise overview of the different functions available in the `bulkreadr` package for importing bulk data in R.

- [`read_excel_workbook()`](#read_excel_workbook)

- [`read_excel_files_from_dir()`](#read_csv_files_from_dir)

- [`read_csv_files_from_dir()`](#read_csv_files_from_dir)

- [`read_gsheets()`](#read_gsheets)

- [`read_spss_data()`](#read_spss_data)

- [`read_stata_data()`](#read_stata_data)

## Other functions in bulkreadr package:

- [`generate_dictionary()`](#generate_dictionary)

- [`look_for()`](#look_for)

- [`pull_out()`](#pull_out)

- [`convert_to_date()`](#convert_to_date)

- [`inspect_na()`](#inspect_na)

- [`fill_missing_values()`](#fill_missing_values)

**Note:**

For the majority of functions within this package, we will utilize data stored in the system file by the `bulkreadr`, which can be accessed using the `system.file()` function. If you wish to utilize your own data stored in your local directory, please ensure that you have set the appropriate file path prior to using any functions provided by the bulkreadr package.
## Note

> For the majority of functions within this package, we will utilize data stored in the system file by the `bulkreadr`, which can be accessed using the `system.file()` function. If you wish to utilize your own data stored in your local directory, please ensure that you have set the appropriate file path prior to using any functions provided by the bulkreadr package.
## read_excel_workbook()

Expand Down Expand Up @@ -167,220 +139,3 @@ sheet_id <- "1izO0mHu3L9AMySQUXGDn9GPs1n-VwGFSEoAKGhqVQh0"
read_gsheets(ss = sheet_id)
```

## read_spss_data()

`read_spss_data()` is designed to seamlessly import data from an SPSS data (`.sav` or `.zsav`) files. It converts labelled variables into factors, a crucial step that enhances the ease of data manipulation and analysis within the R programming environment.

**Read the SPSS data file without converting variable labels as column names**

```{r spssdata1}
file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr")
data <- read_spss_data(file = file_path)
data
```


**Read the SPSS data file and convert variable labels as column names**

```{r spssdata2}
data <- read_spss_data(file = file_path, label = TRUE)
data
```


## read_stata_data()

`read_stata_data()` reads Stata data file (`.dta`) into an R data frame, converting labeled variables into factors.

**Read the Stata data file without converting variable labels as column names**

```{r statadata1}
file_path <- system.file("extdata", "Wages.dta", package = "bulkreadr")
data <- read_stata_data(file = file_path)
data
```

**Read the Stata data file and convert variable labels as column names**

```{r statadata2}
data <- read_stata_data(file = file_path, label = TRUE)
data
```


## generate_dictionary()

`generate_dictionary()` creates a data dictionary from a specified data frame. This function is particularly useful for understanding and documenting the structure of your dataset, similar to data dictionaries in Stata or SPSS.

```{r}
# Creating a data dictionary from an SPSS file
file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr")
wage_data <- read_spss_data(file = file_path)
generate_dictionary(wage_data)
```


## look_for()

The `look_for()` function is designed to emulate the functionality of the Stata `lookfor` command in R. It provides a powerful tool for searching through large datasets, specifically targeting variable names, variable label descriptions, factor levels, and value labels. This function is handy for users working with extensive and complex datasets, enabling them to quickly and efficiently locate the variables of interest.


```{r}
# Look for a single keyword.
look_for(wage_data, "south")
look_for(wage_data, "s")
```

## pull_out()

`pull_out()` is similar to [. It acts on vectors, matrices, arrays and lists to extract or replace parts. It is pleasant to use with the magrittr (`⁠%>%`⁠) and base(`|>`) operators.

```{r example4}
top_10_richest_nig <- c("Aliko Dangote", "Mike Adenuga", "Femi Otedola", "Arthur Eze", "Abdulsamad Rabiu", "Cletus Ibeto", "Orji Uzor Kalu", "ABC Orjiakor", "Jimoh Ibrahim", "Tony Elumelu")
top_10_richest_nig %>%
pull_out(c(1, 5, 2))
```

```{r}
top_10_richest_nig %>%
pull_out(-c(1, 5, 2))
```


## convert_to_date()

`convert_to_date()` parses an input vector into POSIXct date-time object. It is also powerful to convert from excel date number like `42370` into date value like `2016-01-01`.

```{r example 5}
## ** heterogeneous dates **
dates <- c(
44869, "22.09.2022", NA, "02/27/92", "01-19-2022",
"13-01- 2022", "2023", "2023-2", 41750.2, 41751.99,
"11 07 2023", "2023-4"
)
# Convert to POSIXct or Date object
convert_to_date(dates)
# It can also convert date time object to date object
convert_to_date(lubridate::now())
```

## inspect_na()

`inspect_na()` summarizes the rate of missingness in each column of a data frame. For a grouped data frame, the rate of missingness is summarized separately for each group.

```{r example 6a}
# dataframe summary
inspect_na(airquality)
```

**Grouped dataframe summary**

```{r}
airquality %>%
group_by(Month) %>%
inspect_na()
```

## fill_missing_values()

`fill_missing_values()` in an efficient function that addresses missing values in a dataframe. It uses imputation by function, meaning it replaces missing data in numeric variables with either the mean or the median, and in non-numeric variables with the mode. The function takes a column-based imputation approach, ensuring that replacement values are derived from the respective columns, resulting in accurate and consistent data. This method enhances the integrity of the dataset and promotes sound decision-making and analysis in data processing workflows.

```{r example 6}
df <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Sepal.Width = c(4.1, 3.6, 3, 3, 2.9, 2.5, 2.4),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", NA, "versicolor", "setosa",
NA, "virginica", "setosa"
)
)
```

```{r}
df
```


**Using mean to fill missing values for numeric variables**


```{r}
result_df_mean <- fill_missing_values(df, use_mean = TRUE)
result_df_mean
```


**Using median to fill missing values for numeric variables**

```{r}
result_df_median <- fill_missing_values(df, use_mean = FALSE)
result_df_median
```


### Impute missing values (NAs) in a grouped data frame

You can use the `fill_missing_values()` in a grouped data frame by using other grouping and map functions. Here is an example of how to do this:

```{r}
sample_iris <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", "setosa", "versicolor", "setosa",
"virginica", "virginica", "setosa")
)
```

```{r}
sample_iris
```

```{r}
sample_iris %>%
group_by(Species) %>%
group_split() %>%
map_df(fill_missing_values)
```

Loading

0 comments on commit c7b5953

Please sign in to comment.