Improve performance (for big datasets) #673

hms1 · 2021-12-20T18:51:34Z

First of all thanks for making such a useful package!

However I have been trying to skim a dataset with lots of columns (thousands) and noticed very poor performance. The full dataset with >20,000 columns ran overnight without completing, so it seems the performance gets exponentially worse. I found similar issues being discussed in #370 and elsewhere.

I had a look to see where the bottleneck was and the skim_by_type methods and the build_results function called are both very slow when there are lots of columns. ~~This looks to me like an issue with dplyr::across when running summarize with lots of column / function pairs. Notwithstanding that~~ I refactored skim_by_type to improved performance by a factor of 25 for a 100,000 x 1,500 dataset. The larger dataset I am working with (which previously did not complete overnight) runs in ~1 minute. I may be missing something here so apologies in advance if so.

I am happy to make a branch to demonstrate this properly/ open for improvement but for now see below for a reproducible example showing the relatively performance for the refactored skim_by_type function, which should be able to replace all 3 of the current methods:

library(tidyverse)
library(skimr)
library(stringi)
library(microbenchmark)

#### Make some wide test data
number_rows <- 100000
coltypes <- c(rep("numeric",500), rep("date",500), rep("char", 500))

large_test_df <- bind_cols(
	map(seq_len(length(coltypes)), function(col){
		if(coltypes[col] == "numeric") content <- round(runif(number_rows, 0, col), sample(c(0,1,2,3), 1))
		if(coltypes[col] == "date")    content <- sample(seq(as.Date('1950-01-01'), Sys.Date(), by="day"), number_rows, replace = T)
		if(coltypes[col] == "char")    content <- sample(stri_rand_strings(500, sample(c(1,3,5), 1)), number_rows, replace = T)
		tibble(!!sym(str_c("col",col)) := content)
	}
	))

#### save the original function and define refactored function
skim_by_type_original <- skimr:::skim_by_type

skim_by_type_refactor <- function(mangled_skimmers, variable_names, data){
	group_columns <- groups(data)
	data <- as_tibble(data)
	map_df(variable_names, function(skim_var){
		pivot_longer(select(data, !! skim_var, !!! group_columns), cols = all_of(skim_var), names_to = "skim_variable", values_to = "values") %>%
			group_by(skim_variable, !!! group_columns) %>%
			summarize(across(values, mangled_skimmers$funs), .groups = "drop")
	})	%>%
		rename_with(~str_remove(.x,"values_~!@#\\$%\\^\\&\\*\\(\\)-\\+"), everything()) %>%
		arrange(match(skim_variable, variable_names))
}

#### run test
microbenchmark(
	{
		assignInNamespace("skim_by_type", skim_by_type_original, ns="skimr")
		full_version <- skim(large_test_df)
	},
	{
		assignInNamespace("skim_by_type", skim_by_type_refactor, ns="skimr")
		full_version <- skim(large_test_df)
	},
	times = 5
)

Unit: seconds
                                                                                                                                   expr
 {     assignInNamespace("skim_by_type", skim_by_type_original, ns = "skimr")     full_version <- skimr::skim(large_test_df) }
 {     assignInNamespace("skim_by_type", skim_by_type_refactor, ns = "skimr")     full_version <- skimr::skim(large_test_df) }
      min        lq      mean    median        uq       max neval
 251.1866 256.76540 256.68652 256.94793 258.26105 260.27161     5
  10.1157  10.18126  10.30293  10.28105  10.35256  10.58411     5

The text was updated successfully, but these errors were encountered:

elinw · 2021-12-21T06:47:50Z

Wow yes we have really been trying to think about the bottle necks, we have known that the processing was very slow, so this is really helpful! @michaelquinn32 thoughts?
In the original code we have skim_by_type.grouped_df() and skim_by_type.data.frame() (among others). But I don't think in your code you are using groups? Are you suggesting change to several functions or that we drop the differentiation?

hms1 · 2021-12-21T08:46:16Z

Hey @elinw, yes I saw that in the original code skim_by_type is split into three (?) methods to handle grouped_df, data.table and plain data.frames. The single function above should handle each of these classes without issue but of course it could be split into different methods to handle each class separately. It just seemed easier to demonstrate like the above. If you are interested I'd be happy to formalise things in a new branch when I find the time but I don't permissions to do so at the moment.

elinw · 2021-12-21T15:06:00Z

Well to create a new branch you would fork the project and create the branch in your repo.
We use the S3 approach because, among other things, it lets people develop skim methods for different kinds of objects. Now that you have raised this I think we are really at the point of maturity of skimr where working on performance (and, related, pushing into a database when appropriate) is really the biggest thing we can do.

michaelquinn32 · 2021-12-21T20:14:45Z

That's awesome Henry. Thanks!

We would love a pull request. I would be happy to review your code.

As @elinw pointed out, create a branch with your change and we can go from there.

If you haven't yet, take a look at our contributor's guide
Code should pass a run of lintr with the default settings
And you should run styler::style_pkg() before we merge

I'm surprised that dplyr::across() performs so much worse than pivoting the data first, but I'm thrilled to have discovered this opportunity for improving the user experience.

Thanks!

hms1 · 2021-12-23T15:17:17Z

Hey @michaelquinn32 ,

I had a chance to look at this properly rather than just re-writing the function that was going slow.

First off I should clarify that it is not dplyr::across that is causing issues. I am afraid to even think what the punishment is for people who bring dplyr into disrepute.

Amazingly it is the call to as.data.frame buried inside skimr:::reshape_skimmed that causes the majority of slowness (see here). My explanation is that as.data.frame gets slower as the object being converted gets larger. In combination with the increased calls to reshape_skimmed (once per variable) this leads to an exponential performance degradation as you add columns.

Anyway, I have made an initial PR that simply changes as.data.frame to as_tibble. What I suggested above in the thread is faster and is quite a lot cleaner (in my opinion) so I can make a subsequent PR to rewrite the methods. Just this initial change is a very low hanging fruit.

elinw · 2021-12-24T04:05:44Z

Let's look at the timing for both grouped and ungrouped data with the as_tibble change. I'm wondering what we are thinking right there that we used as.data.frame() -- was there some kind of edge case? It's right in the middle of a bunch of tidyverse code.
This is the commit 81b7ddd
which essentially added the as.data.frame().

Should we be coercing to a tibble after we check that the input to skim is a data frame or coercible to a data frame?

elinw · 2022-01-01T03:12:50Z

Even when we merge that change, let's keep this issue open for the general topic of making skimr faster for different situations.

elinw · 2022-01-01T21:27:44Z

Also let's consider performance without histograms as th base since we know they have a big impact.

astraetech · 2022-09-02T04:49:58Z

Hi, love skimr and use it a lot. Are there any updates on the performance issue? My datasets are 50-300+ vars and 8-60 mlns observations and skimr takes a long time to do summaries. Thank you!

michaelquinn32 · 2022-09-19T23:47:10Z

Thanks for using skimr!

We do support summaries of data.table objects. Have you tried that? Is it faster?

Otherwise, scaling up to that size is a big challenge for the design of skimr, and I'm not exactly sure what we could do. We've both been very stretched for time recently, so a big change (like support parallel backends, etc.) might be a ways away still.

jxu · 2023-10-23T19:56:16Z

skim takes about ten minutes for my dataset, a tibble with 150k rows and 1000 columns. This is a lot of columns but not unreasonable for raw medical data. For comparison, summary does not do as much, but finishes in 30 seconds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance (for big datasets) #673

Improve performance (for big datasets) #673

hms1 commented Dec 20, 2021 •

edited

Loading

elinw commented Dec 21, 2021

hms1 commented Dec 21, 2021 •

edited

Loading

elinw commented Dec 21, 2021

michaelquinn32 commented Dec 21, 2021

hms1 commented Dec 23, 2021

elinw commented Dec 24, 2021 •

edited

Loading

elinw commented Jan 1, 2022

elinw commented Jan 1, 2022

astraetech commented Sep 2, 2022

michaelquinn32 commented Sep 19, 2022

jxu commented Oct 23, 2023

Improve performance (for big datasets) #673

Improve performance (for big datasets) #673

Comments

hms1 commented Dec 20, 2021 • edited Loading

elinw commented Dec 21, 2021

hms1 commented Dec 21, 2021 • edited Loading

elinw commented Dec 21, 2021

michaelquinn32 commented Dec 21, 2021

hms1 commented Dec 23, 2021

elinw commented Dec 24, 2021 • edited Loading

elinw commented Jan 1, 2022

elinw commented Jan 1, 2022

astraetech commented Sep 2, 2022

michaelquinn32 commented Sep 19, 2022

jxu commented Oct 23, 2023

hms1 commented Dec 20, 2021 •

edited

Loading

hms1 commented Dec 21, 2021 •

edited

Loading

elinw commented Dec 24, 2021 •

edited

Loading