Skip to content

Commit 0815267

Browse files
authored
Merge pull request #741 from SebKrantz/development
Development
2 parents 95a1667 + 3008cee commit 0815267

11 files changed

+38
-37
lines changed

README.md

+7-5
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,14 @@
1616
[![arXiv](https://img.shields.io/badge/arXiv-2403.05038-0969DA.svg)](https://arxiv.org/abs/2403.05038)
1717
<!-- badges: end -->
1818

19-
*collapse* is a C/C++ based package for data transformation and statistical computing in R. Its aims are:
19+
*collapse* is a large C/C++-based package for data transformation and statistical computing in R. It aims to:
20+
21+
* Facilitate complex data transformation, exploration and computing tasks in R.
22+
* Help make R code fast, flexible, parsimonious and programmer friendly.
23+
24+
Its flexible [class-agnostic architecture](https://sebkrantz.github.io/collapse/articles/collapse_object_handling.html) supports operations on all basic R objects and their popular extensions, including, but not limited to, *units*, *integer64*, *xts*/*zoo*, *tibble*, *grouped_df*, *data.table*, *sf*, *pseries* and *pdata.frame*.
2025

21-
* To facilitate complex data transformation, exploration and computing tasks in R.
22-
* To help make R code fast, flexible, parsimonious and programmer friendly.
2326

24-
It further implements a [class-agnostic approach to R programming](https://sebkrantz.github.io/collapse/articles/collapse_object_handling.html), supporting base R, *tibble*, *grouped_df* (*tidyverse*), *data.table*, *sf*, *units*, *pseries*, *pdata.frame* (*plm*), *xts*/*zoo* and variable labels.
2527

2628
**Key Features:**
2729

@@ -49,7 +51,7 @@ It further implements a [class-agnostic approach to R programming](https://sebkr
4951
* **Advanced data exploration**: Fast (grouped, weighted, panel-decomposed)
5052
summary statistics and descriptive tools.
5153

52-
*collapse* is written in C and C++, with algorithms much faster than base R's, scales well (benchmarks: [linux](https://duckdblabs.github.io/db-benchmark/) | [windows](https://github.com/AdrianAntico/Benchmarks?tab=readme-ov-file#benmark-results)), and very efficient for complex tasks (e.g., quantiles, weighted stats, mode/counting/deduplication, joins, pivots). Optimized R code ensures minimal evaluation overheads. <!-- , but imports C/C++ functions from *fixest*, *weights*, *RcppArmadillo*, and *RcppEigen* for certain statistical tasks. -->
54+
*collapse* is written in C and C++, with algorithms much faster than base R's, has extremely low evaluation overheads, and scales well (benchmarks: [linux](https://duckdblabs.github.io/db-benchmark/) | [windows](https://github.com/AdrianAntico/Benchmarks?tab=readme-ov-file#benmark-results)). It excels on complex statistical tasks. <!--, such as weighted statistics, mode/counting/deduplication, joins, pivots, panel data. Optimized R code ensures minimal evaluation overheads. , but imports C/C++ functions from *fixest*, *weights*, *RcppArmadillo*, and *RcppEigen* for certain statistical tasks. -->
5355

5456
## Installation
5557

man/GRP.Rd

+6-6
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
\alias{as_factor_GRP}
2525
\title{Fast Grouping / \emph{collapse} Grouping Objects}
2626
\description{
27-
\code{GRP} performs fast, ordered and unordered, groupings of vectors and data frames (or lists of vectors) using \code{\link{radixorderv}} or \code{\link{group}}. The output is a list-like object of class 'GRP' which can be printed, plotted and used as an efficient input to all of \emph{collapse}'s fast statistical and transformation functions and operators (see macros \code{.FAST_FUN} and \code{.OPERATOR_FUN}), as well as to \code{\link{collap}}, \code{\link{BY}} and \code{\link{TRA}}.
27+
\code{GRP} performs fast, ordered and unordered, groupings of vectors and data frames (or lists of vectors) using \code{\link{radixorder}} or \code{\link{group}}. The output is a list-like object of class 'GRP' which can be printed, plotted and used as an efficient input to all of \emph{collapse}'s fast statistical and transformation functions and operators (see macros \code{.FAST_FUN} and \code{.OPERATOR_FUN}), as well as to \code{\link{collap}}, \code{\link{BY}} and \code{\link{TRA}}.
2828
2929
\code{fgroup_by} is similar to \code{dplyr::group_by} but faster and class-agnostic. It creates a grouped data frame with a 'GRP' object attached - for fast dplyr-like programming with \emph{collapse}'s fast functions.
3030

@@ -99,15 +99,15 @@ fungroup(X, \dots)
9999
\item{sort}{logical. If \code{FALSE}, groups are not ordered but simply grouped in the order of first appearance of unique elements / rows. This often provides a performance gain if the data was not sorted beforehand. See also \code{method}.}
100100
\item{ordered}{logical. \code{TRUE} adds a class 'ordered' i.e. generates an ordered factor.}
101101

102-
\item{decreasing}{logical. Should the sort order be increasing or decreasing? Can be a vector of length equal to the number of arguments in \code{X} / \code{by} (argument passed to \code{\link{radixorderv}}).}
102+
\item{decreasing}{logical. Should the sort order be increasing or decreasing? Can be a vector of length equal to the number of arguments in \code{X} / \code{by} (argument passed to \code{\link{radixorder}}).}
103103

104-
\item{na.last}{logical. If missing values are encountered in grouping vector/columns, assign them to the last group (argument passed to \code{\link{radixorderv}}).}
104+
\item{na.last}{logical. If missing values are encountered in grouping vector/columns, assign them to the last group (argument passed to \code{\link{radixorder}}).}
105105

106106
\item{return.groups}{logical. Include the unique groups in the created GRP object.}
107107

108-
\item{return.order}{logical. If \code{sort = TRUE}, include the output from \code{\link{radixorderv}} in the created GRP object. This brings performance improvements in \code{gsplit} (and thus also benefits grouped execution of base R functions). }
108+
\item{return.order}{logical. If \code{sort = TRUE}, include the output from \code{\link{radixorder}} in the created GRP object. This brings performance improvements in \code{gsplit} (and thus also benefits grouped execution of base R functions). }
109109

110-
\item{method}{character. The algorithm to use for grouping: either \code{"radix"}, \code{"hash"} or \code{"auto"}. \code{"auto"} will chose \code{"radix"} when \code{sort = TRUE}, yielding ordered grouping via \code{\link{radixorderv}}, and \code{"hash"}-based grouping in first-appearance order via \code{\link{group}} otherwise. It is possibly to put \code{method = "radix"} and \code{sort = FALSE}, which will group character data in first appearance order but sort numeric data (a good hybrid option). \code{method = "hash"} currently does not support any sorting, thus putting \code{sort = TRUE} will simply be ignored.}
110+
\item{method}{character. The algorithm to use for grouping: either \code{"radix"}, \code{"hash"} or \code{"auto"}. \code{"auto"} will chose \code{"radix"} when \code{sort = TRUE}, yielding ordered grouping via \code{\link{radixorder}}, and \code{"hash"}-based grouping in first-appearance order via \code{\link{group}} otherwise. It is possibly to put \code{method = "radix"} and \code{sort = FALSE}, which will group character data in first appearance order but sort numeric data (a good hybrid option). \code{method = "hash"} currently does not support any sorting, thus putting \code{sort = TRUE} will simply be ignored.}
111111

112112
\item{group.sizes}{logical. \code{TRUE} tabulates factor levels using \code{\link{tabulate}} to create a vector of group sizes; \code{FALSE} leaves that slot empty when converting from factors.}
113113

@@ -195,7 +195,7 @@ Creating a factor from a 'GRP' object using \code{as_factor_GRP} does not involv
195195
[[5]] \tab\tab group.vars \tab\tab \code{character} \tab\tab The names of the grouping variables \cr\cr
196196
[[6]] \tab\tab ordered \tab\tab \code{logical(2)} \tab\tab \code{[1]} Whether the groups are ordered: equal to the \code{sort} argument in the default method, or \code{TRUE} if converted objects inherit a class \code{"ordered"} and \code{NA} otherwise, \code{[2]} Whether the data (\code{X}) is already sorted: the result of \code{!is.unsorted(group.id)}. If \code{sort = FALSE} (default method) the second entry is \code{NA}. \cr\cr
197197
198-
[[7]] \tab\tab order \tab\tab \code{integer(NROW(X))} or \code{NULL} \tab\tab Ordering vector from \code{radixorderv} (with \code{"starts"} attribute), or \code{NULL} if \code{return.order = FALSE} \cr\cr
198+
[[7]] \tab\tab order \tab\tab \code{integer(NROW(X))} or \code{NULL} \tab\tab Ordering vector from \code{radixorder} (with \code{"starts"} attribute), or \code{NULL} if \code{return.order = FALSE} \cr\cr
199199
[[8]] \tab\tab group.starts \tab\tab \code{integer(N.groups)} or \code{NULL} \tab\tab The first-occurrence positions/rows of the groups. Useful e.g. with \code{ffirst(x, g, na.rm = FALSE)}. \code{NULL} if \code{return.groups = FALSE}. \cr\cr
200200
201201
[[9]] \tab\tab call \tab\tab \code{match.call()} or \code{NULL} \tab\tab The \code{GRP()} call, obtained from \code{match.call()}, or \code{NULL} if \code{call = FALSE}

man/fast-data-manipulation.Rd

+12-13
Original file line numberDiff line numberDiff line change
@@ -6,28 +6,27 @@
66
\description{
77
\emph{collapse} provides the following functions for fast manipulation of (mostly) data frames.
88
\itemize{
9-
\item \code{\link{fselect}} is a much faster alternative to \code{dplyr::select} to select columns using expressions involving column names. \code{\link{get_vars}} is a more versatile and programmer friendly function to efficiently select and replace columns by names, indices, logical vectors, regular expressions or using functions to identify columns.
9+
\item \code{\link{fselect}} is a much faster alternative to \code{dplyr::select} to select columns using expressions involving column names. \code{\link{get_vars}} is a more versatile and programmer friendly function to efficiently select and replace columns by names, indices, logical vectors, regular expressions, or using functions to identify columns.
1010

11-
\item The functions \code{\link{num_vars}}, \code{\link{cat_vars}}, \code{\link{char_vars}}, \code{\link{fact_vars}}, \code{\link{logi_vars}} and \code{\link{date_vars}} are convenience functions to efficiently select and replace columns by data type.
11+
\item \code{\link{num_vars}}, \code{\link{cat_vars}}, \code{\link{char_vars}}, \code{\link{fact_vars}}, \code{\link{logi_vars}} and \code{\link{date_vars}} are convenience functions to efficiently select and replace columns by data type.
1212

13-
\item \code{\link{add_vars}} efficiently adds new columns at any position within a data frame (default at the end). This can be done vie replacement (i.e. \code{add_vars(data) <- newdata}) or returning the appended data (i.e. \code{add_vars(data, newdata1, newdata2, \dots)}). Because of the latter, \code{add_vars} is also a more efficient alternative to \code{cbind.data.frame}.
13+
\item \code{\link{add_vars}} efficiently adds new columns at any position within a data frame (default at the end). This can be done vie replacement (i.e. \code{add_vars(data) <- newdata}) or returning the appended data, e.g., \code{add_vars(data, newdata1, newdata2, \dots)}. It is thus also an efficient alternative to \code{\link{cbind.data.frame}}.
1414

15-
\item \code{\link{rowbind}} efficiently combines data frames / lists row-wise. The implementation is derived from \code{data.table::rbindlist}, it is also a fast alternative to \code{rbind.data.frame}.
15+
\item \code{\link{rowbind}} efficiently combines data frames / lists row-wise. The implementation is derived from \code{data.table::rbindlist}, it is also a fast alternative to \code{\link{rbind.data.frame}}.
1616

17-
\item \code{\link{join}} provides fast class-agnostic and verbose table joins.
17+
\item \code{\link{join}} provides fast, class-agnostic, and verbose table joins.
1818

19-
\item \code{\link{pivot}} efficiently reshapes data, supporting longer, wider and recast pivoting, as well as multi-column-pivots and taking along variable labels.
19+
\item \code{\link{pivot}} efficiently reshapes data, supporting longer, wider and recast pivoting, as well as multi-column-pivots and pivots taking along variable labels.
2020

21-
\item \code{\link{fsubset}} is a much faster version of \code{\link{subset}} to efficiently subset vectors, matrices and data frames. If the non-standard evaluation offered by \code{\link{fsubset}} is not needed, the function \code{\link{ss}} is a much faster and also more secure alternative to \code{[.data.frame}.
21+
\item \code{\link{fsubset}} is a much faster version of \code{\link{subset}} to efficiently subset vectors, matrices and data frames. If the non-standard evaluation offered by \code{\link{fsubset}} is not needed, the function \code{\link{ss}} is a much faster and more secure alternative to \code{[.data.frame}.
2222

23-
\item \code{\link{fslice}} is a much faster alternative to \code{dplyr::slice_[head|tail|min|max]} for filtering/deduplicating matrix-like objects (by groups).
23+
\item \code{\link[=fslice]{fslice(v)}} is a much faster alternative to \code{dplyr::slice_[head|tail|min|max]} for filtering/deduplicating matrix-like objects (by groups).
2424

25-
\item \code{\link{fsummarise}} is a much faster version of \code{dplyr::summarise} when used together with the \link[=fast-statistical-functions]{Fast Statistical Functions} and \code{\link{fgroup_by}}, with whom it also supports super fast weighted aggregation.
25+
\item \code{\link{fsummarise}} is a much faster version of \code{dplyr::summarise}, especially when used together with the \link[=fast-statistical-functions]{Fast Statistical Functions} and \code{\link{fgroup_by}}.
2626

27-
\item \code{\link{fmutate}} is a much faster version of \code{dplyr::mutate} when used together with the \link[=fast-statistical-functions]{Fast Statistical Functions} as well as fast \link[=data-transformations]{Data Transformation Functions} and \code{\link{fgroup_by}}.
27+
\item \code{\link{fmutate}} is a much faster version of \code{dplyr::mutate}, especially when used together with the \link[=fast-statistical-functions]{Fast Statistical Functions}, the fast \link[=data-transformations]{Data Transformation Functions}, and \code{\link{fgroup_by}}.
2828

29-
30-
\item \code{\link{ftransform}} is a much faster version of \code{\link{transform}}, which also supports list input and nested pipelines. \code{\link{settransform}} does all of that by reference, i.e. it modifies the data frame in the global environment. \code{\link{fcompute}} is similar to \code{\link{ftransform}} but only returns modified and computed columns in a new data frame. %As a new feature, it is now possible to bulk-process columns with \code{\link{ftransform}}, i.e. \code{ftransform(data, fscale(data[1:2]))} is the same as \code{ftransform(data, col1 = fscale(col1), col2 = fscale(col2))}, and \code{ftransform(data) <- fscale(data[1:2]))} or \code{settransform(data, fscale(data[1:2]))} are both equivalent to \code{data[1:2] <- fscale(data[1:2]))}. Non-matching columns are added to the data.frame.
29+
\item \code{\link[=ftransform]{ftransform(v)}} is a much faster version of \code{\link{transform}}, which also supports list input and nested pipelines. \code{\link[=ftransform]{settransform(v)}} does all of that by reference, i.e. it assigns to the calling environment. \code{\link[=fcompute]{fcompute(v)}} is similar to \code{\link[=ftransform]{ftransform(v)}} but only returns modified/computed columns. %As a new feature, it is now possible to bulk-process columns with \code{\link{ftransform}}, i.e. \code{ftransform(data, fscale(data[1:2]))} is the same as \code{ftransform(data, col1 = fscale(col1), col2 = fscale(col2))}, and \code{ftransform(data) <- fscale(data[1:2]))} or \code{settransform(data, fscale(data[1:2]))} are both equivalent to \code{data[1:2] <- fscale(data[1:2]))}. Non-matching columns are added to the data.frame.
3130

3231
\item \code{\link{roworder}} is a fast substitute for \code{dplyr::arrange}, but the syntax is inspired by \code{data.table::setorder}.
3332

@@ -49,7 +48,7 @@
4948
\code{\link{ss}} \tab\tab No methods, for data frames \tab\tab Fast subset data frames \cr
5049
\code{\link[=fslice]{fslice(v)}} \tab\tab No methods, for matrices and data frames\tab\tab Fast slicing of rows \cr
5150
\code{\link{fsummarise}} \tab\tab No methods, for data frames \tab\tab Fast data aggregation \cr
52-
\code{\link{fmutate}}, \code{\link[=ftransform]{(f/set)ftransform(<-)}} \tab\tab No methods, for data frames \tab\tab Compute, modify or delete columns (non-standard evaluation) \cr
51+
\code{\link{fmutate}}, \code{\link[=ftransform]{(f/set)transform(v)(<-)}} \tab\tab No methods, for data frames \tab\tab Compute, modify or delete columns (non-standard evaluation) \cr
5352
%\code{\link{settransform}} \tab\tab No methods, for data frames \tab\tab Compute, modify or delete columns by reference (non-standard evaluation) \cr
5453
\code{\link[=fcompute]{fcompute(v)}} \tab\tab No methods, for data frames \tab\tab Compute or modify columns, returned in a new data frame (non-standard evaluation) \cr
5554
\code{\link[=roworder]{roworder(v)}} \tab\tab No methods, for data frames incl. pdata.frame \tab\tab Reorder rows and return data frame (standard and non-standard evaluation) \cr

0 commit comments

Comments
 (0)