[BUG] Slow Performance of cuDF Pandas on L4 #17140

ericphan-nv · 2024-10-22T05:22:44Z

Describe the bug
Low performance with cuDF Pandas and XGBoost using this dataset and notebook.

Performance is slower than CPU equivalents. Tested on Colab with L4 and local WSL with 4090.

Steps/Code to reproduce bug
Open the notebook and run through the cells. Observe the slow performance compared to CPU Pandas and XGboost.

Expected behavior
Performance is expected to be significantly faster than CPU with cuDF Pandas and XGBoost.

Environment overview (please complete the following information)

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
Colab L4 instance
Method of cuDF install: [conda, Docker, or from source]
Native RAPIDS installed on Colab

Additional context

Colab L4:
Loading time - 46 seconds
Preprocessing time - 476 seconds
Training time - 240 seconds

Colab CPU:
Loading time - 23 seconds
Preprocessing time - 47 seconds
Training time - 252 seconds

bdice · 2024-10-22T17:19:22Z

It seems like the slowdowns are due to from_pandas spending lots of time in _has_any_nan.

cudf/python/cudf/cudf/core/column/column.py

Lines 1478 to 1483 in 4fe338c

    
           def _has_any_nan(arbitrary: pd.Series | np.ndarray) -> bool: 
        
               """Check if an object dtype Series or array contains NaN.""" 
        
               return any( 
        
                   isinstance(x, (float, np.floating)) and np.isnan(x) 
        
                   for x in np.asarray(arbitrary) 
        
               )

It seems like maybe this is happening in the cells that call replace?

# Apply the consolidation
df['Company'] = df['Company'][df['Company'].isin(name_mapping.keys())].replace(name_mapping).astype('category')

takes ~75 seconds in DataFrame.__getitem__. I think this is related to the _has_any_nan call?

I am not able to dig any further on this at the moment but perhaps @galipremsagar or @mroeschke would have insight.

galipremsagar · 2024-10-23T00:23:13Z

I found the bug, working on a fix.

ericphan-nv added the bug Something isn't working label Oct 22, 2024

galipremsagar self-assigned this Oct 23, 2024

galipremsagar linked a pull request Oct 31, 2024 that will close this issue

Fix Dataframe.__setitem__ slow-downs #17222

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Slow Performance of cuDF Pandas on L4 #17140

[BUG] Slow Performance of cuDF Pandas on L4 #17140

ericphan-nv commented Oct 22, 2024

bdice commented Oct 22, 2024

galipremsagar commented Oct 23, 2024

[BUG] Slow Performance of cuDF Pandas on L4 #17140

[BUG] Slow Performance of cuDF Pandas on L4 #17140

Comments

ericphan-nv commented Oct 22, 2024

bdice commented Oct 22, 2024

galipremsagar commented Oct 23, 2024