-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Feature description
I analysed a tree in PyROOT with RDataFrame. Statistics for most variables could be calculated correctly. However, for one column I got nan-s. After further inspection it turned out that one event from 8 millions contained a NaN in that column. I tried Stats and Mean, and they both returned NaN. NaN is the correct answer for that data, but it is not practical.
The possible solutions could be:
- RDataFrame could optionally filter irregular numbers (like NaN or inf).
- Statistics functions of RDataFrame optionally support that. Tree Drawer simply ignored that value and calculated the mean (already noted here). One could provide an option to Stats() or Mean() to ignore NaN-s.
- Documentation is added for user to handle that. There is a similar section Working with missing values in the dataset in the RDataFrame reference. NaNs are common in statistical analysis, so they would be worth explanation too.
The approach with RDataFrame looks cleaner; at the same time, another method like "FilterNaNs" for RDataFrame looks like feature creep. In Python that could be solved with a keyword argument (option); for C++ it is less clear to me (5 overloaded functions neither look good). An improved guideline is always possible.
Alternatives considered
I could solve it in PyROOT with
ROOT.gInterpreter.Declare("auto myisnan = static_cast<bool (*)(float)>(&isnan);") dr_cmx_no_cut = df.Filter("!myisnan(displacement.dr_proj_cmx_cm)")\ .Stats("displacement.dr_proj_cmx_cm")
However, that is a) non-evident, especially for Python users, b) filtering the same column with a dedicated function looks redundant (maybe there is a better syntax for that).
Additional context
Not sure whether it is connected, but there was a similar topic on the ROOT Forum.