add user-defined handling of NaN and inf to RDataFrame statistics

### Feature description

I analysed a tree in PyROOT with RDataFrame. Statistics for most variables could be calculated correctly. However, for one column I got nan-s. After further inspection it turned out that one event from 8 millions contained a NaN in that column. I tried *Stats* and *Mean*, and they both returned NaN. NaN is the correct answer for that data, but it is not practical.

The possible solutions could be:

1. RDataFrame could optionally filter irregular numbers (like NaN or inf).
2. Statistics functions of RDataFrame optionally support that. Tree Drawer simply ignored that value and calculated the mean (already noted [here](https://root-forum.cern.ch/t/new-defined-column-in-rdataframe-return-nan-value/42403/6)). One could provide an option to Stats() or Mean() to ignore NaN-s.
3. Documentation is added for user to handle that. There is a similar section [Working with missing values in the dataset](https://root.cern/doc/master/classROOT_1_1RDataFrame.html#missing-values) in the RDataFrame reference. NaNs are common in statistical analysis, so they would be worth explanation too.

The approach with RDataFrame looks cleaner; at the same time, another method like "FilterNaNs" for RDataFrame looks like feature creep. In Python that could be solved with a keyword argument (option); for C++ it is less clear to me (5 overloaded functions neither look good). An improved guideline is always possible.

### Alternatives considered

I could solve it in PyROOT with

`ROOT.gInterpreter.Declare("auto myisnan = static_cast<bool (*)(float)>(&isnan);")
dr_cmx_no_cut = df.Filter("!myisnan(displacement.dr_proj_cmx_cm)")\
                                   .Stats("displacement.dr_proj_cmx_cm")
`

However, that is a) non-evident, especially for Python users, b) filtering the same column with a dedicated function looks redundant (maybe there is a better syntax for that).

### Additional context

Not sure whether it is connected, but there was a similar [topic](https://root-forum.cern.ch/t/rdataframe-with-branch-contain-nan/41653) on the ROOT Forum.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add user-defined handling of NaN and inf to RDataFrame statistics #19879

Feature description

Alternatives considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

add user-defined handling of NaN and inf to RDataFrame statistics #19879

Description

Feature description

Alternatives considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions