Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Slow Performance of cuDF Pandas on L4 #17140

Open
ericphan-nv opened this issue Oct 22, 2024 · 2 comments · May be fixed by #17222
Open

[BUG] Slow Performance of cuDF Pandas on L4 #17140

ericphan-nv opened this issue Oct 22, 2024 · 2 comments · May be fixed by #17222
Assignees
Labels
bug Something isn't working

Comments

@ericphan-nv
Copy link

Describe the bug
Low performance with cuDF Pandas and XGBoost using this dataset and notebook.

Performance is slower than CPU equivalents. Tested on Colab with L4 and local WSL with 4090.

Steps/Code to reproduce bug
Open the notebook and run through the cells. Observe the slow performance compared to CPU Pandas and XGboost.

Expected behavior
Performance is expected to be significantly faster than CPU with cuDF Pandas and XGBoost.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
    Colab L4 instance
  • Method of cuDF install: [conda, Docker, or from source]
    Native RAPIDS installed on Colab

Additional context

Colab L4:
Loading time - 46 seconds
Preprocessing time - 476 seconds
Training time - 240 seconds

Colab CPU:
Loading time - 23 seconds
Preprocessing time - 47 seconds
Training time - 252 seconds

@ericphan-nv ericphan-nv added the bug Something isn't working label Oct 22, 2024
@bdice
Copy link
Contributor

bdice commented Oct 22, 2024

It seems like the slowdowns are due to from_pandas spending lots of time in _has_any_nan.

def _has_any_nan(arbitrary: pd.Series | np.ndarray) -> bool:
"""Check if an object dtype Series or array contains NaN."""
return any(
isinstance(x, (float, np.floating)) and np.isnan(x)
for x in np.asarray(arbitrary)
)

It seems like maybe this is happening in the cells that call replace?

# Apply the consolidation
df['Company'] = df['Company'][df['Company'].isin(name_mapping.keys())].replace(name_mapping).astype('category')

takes ~75 seconds in DataFrame.__getitem__. I think this is related to the _has_any_nan call?

I am not able to dig any further on this at the moment but perhaps @galipremsagar or @mroeschke would have insight.

@galipremsagar galipremsagar self-assigned this Oct 23, 2024
@galipremsagar
Copy link
Contributor

I found the bug, working on a fix.

@galipremsagar galipremsagar linked a pull request Oct 31, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants