dplyr 0.2
Piping
dplyr now imports %>% from magrittr (#330). I recommend that you use this instead of %.% because it is easier to type (since you can hold down the shift key) and is more flexible. With you %>%, you can control which argument on the RHS recieves the LHS by using the pronoun .. This makes %>% more useful with base R functions because they don't always take the data frame as the first argument. For example you could pipe mtcars to xtabs() with:
mtcars %>% xtabs( ~ cyl + vs, data = .)
Thanks to @smbache for the excellent magrittr package. dplyr only provides %>% from magrittr, but it contains many other useful functions. To use them, load magrittr explicitly: library(magrittr). For more details, see vignette("magrittr").
%.% will be deprecated in a future version of dplyr, but it won't happen for a while. I've also deprecated chain() to encourage a single style of dplyr usage: please use %>% instead.
Do
do() has been completely overhauled. There are now two ways to use it, either with multiple named arguments or a single unnamed arguments. group_by() + do() is equivalent to plyr::dlply, except it always returns a data frame.
If you use named arguments, each argument becomes a list-variable in the output. A list-variable can contain any arbitrary R object so it's particularly well suited for storing models.
library(dplyr)
models <- mtcars %>% group_by(cyl) %>% do(lm = lm(mpg ~ wt, data = .))
models %>% summarise(rsq = summary(lm)$r.squared)
If you use an unnamed argument, the result should be a data frame. This allows you to apply arbitrary functions to each group.
mtcars %>% group_by(cyl) %>% do(head(., 1))
Note the use of the . pronoun to refer to the data in the current group.
do() also has an automatic progress bar. It appears if the computation takes longer than 5 seconds and lets you know (approximately) how much longer the job will take to complete.
New verbs
dplyr 0.2 adds three new verbs:
glimpse()makes it possible to see all the columns in a tbl,
displaying as much data for each variable as can be fit on a single line.sample_n()randomly samples a fixed number of rows from a tbl;
sample_frac()randomly samples a fixed fraction of rows. Only works
for local data frames and data tables (#202).summarise_each()andmutate_each()make it easy to apply one or more
functions to multiple columns in a tbl (#178).
Minor improvements
- If you load plyr after dplyr, you'll get a message suggesting that you
load plyr first (#347). as.tbl_cube()gains a method for matrices (#359, @paulstaab)compute()gainstemporaryargument so you can control whether the
results are temporary or permanent (#382, @cpsievert)group_by()now defaults toadd = FALSEso that it sets the grouping
variables rather than adding to the existing list. I think this is how
most people expectedgroup_byto work anyway, so it's unlikely to
cause problems (#385).- Support for MonetDB tables with
src_monetdb()
(#8, thanks to @hannesmuehleisen). - New vignettes:
memoryvignette which discusses how dplyr minimises memory usage
for local data frames (#198).new-sql-backendvignette which discusses how to add a new
SQL backend/source to dplyr.
changes()output more clearly distinguishes which columns were added or
deleted.explain()is now generic.- dplyr is more careful when setting the keys of data tables, so it never
accidentally modifies an object that it doesn't own. It also avoids
unnecessary key setting which negatively affected performance.
(#193, #255). print()methods fortbl_df,tbl_dtandtbl_sqlgainnargument to
control the number of rows printed (#362). They also works better when you have
columns containing lists of complex objects.row_number()can be called without arguments, in which case it returns
the same as1:n()(#303)."comment"attribute is allowed (white listed) as well as names (#346).- hybrid versions of
min,max,mean,var,sdandsum
handle thena.rmargument (#168). This should yield substantial
performance improvements for those functions. - Special case for call to
arrange()on a grouped data frame with no arguments. (#369)
Bug fixes
- Code adapted to Rcpp > 0.11.1
- internal
DataDotsclass protects against missing variables in verbs (#314),
including the case where...is missing. (#338) all.equal.data.framefrom base is no longer bypassed. we now have
all.equal.tbl_dfandall.equal.tbl_dtmethods (#332).arrange()correctly handles NA in numeric vectors (#331) and 0 row
data frames (#289).copy_to.src_mysql()now works on windows (#323)*_join()doesn't reorder column names (#324).rbind_all()is stricter and only accepts list of data frames (#288)rbind_*propagates time zone information forPOSIXctcolumns (#298).rbind_*is less strict about type promotion. The numericCollecterallows
collection of integer and logical vectors. The integerCollecteralso collects
logical values (#321).- internal
sumcorrectly handles integer (under/over)flow (#308). summarise()checks consistency of outputs (#300) and dropsnames
attribute of output columns (#357).- join functions throw error instead of crashing when there are no common
variables between the data frames, and also give a better error message when
only one data frame has a by variable (#371). top_n()returnsnrows instead ofn - 1(@leondutoit, #367).- SQL translation always evaluates subsetting operators (
$,[,[[)
locally. (#318). select()now renames variables in remote sql tbls (#317) and implicitly adds
grouping variables (#170).- internal
grouped_df_implfunction errors if there are no variables to group by (#398). n_distinctdid not treat NA correctly in the numeric case #384.- Some compiler warnings triggered by -Wall or -pedantic have been eliminated.
group_byonly creates one group for NA (#401).- Hybrid evaluator did not evaluate expression in correct environment (#403).