Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inconsistence issue between pandas and cudf #17654

Open
BetterAndBetterII opened this issue Dec 23, 2024 · 2 comments
Open

[BUG] Inconsistence issue between pandas and cudf #17654

BetterAndBetterII opened this issue Dec 23, 2024 · 2 comments
Labels
0 - Waiting on Author Waiting for author to respond to review

Comments

@BetterAndBetterII
Copy link

Describe the bug

When I switch from pandas to cudf, I find a inconsistence issue in merge() function.

Running the same code in pandas and in cudf, I got different results.

Steps/Code to reproduce bug

Demo code:

import pandas as pd
import cudf


def demonstrate_merge_issue():
    # print pandas cudf
    print("=== Version ===")
    print(f"pandas Version: {pd.__version__}")
    print(f"cudf Version: {cudf.__version__}\n")

    # Create demo data
    data = {
        'time': ['2023-01-01', '2023-01-01', '2023-01-02'],
        'symbol': ['A', 'B', 'C'],
        'close': [100, 200, 300],
        'rank': [1, 2, 1]
    }

    # Use pandas to create DataFrame
    pandas_long_df = pd.DataFrame(data)
    pandas_short_df = pd.DataFrame({
        'time': ['2023-01-01', '2023-01-02'],
        'other_column': ['X', 'Y']
    })

    # pandas Operation
    print("=== pandas Operation ===")
    pandas_long_select_num = pandas_long_df.groupby('time')['symbol'].size().to_frame()
    pandas_long_select_num = pandas_long_select_num.rename(columns={'symbol': 'long_num'}).reset_index()
    print("pandas long_select_num columns:", pandas_long_select_num.columns)

    pandas_short_df = pandas_short_df.merge(pandas_long_select_num, on='time', how='left')
    print("pandas short_df columns after merge:", pandas_short_df.columns)

    # Use cudf to create DataFrame
    cudf_long_df = cudf.DataFrame.from_pandas(pandas_long_df)
    cudf_short_df = cudf.DataFrame.from_pandas(pandas_short_df)

    # cudf Operation
    print("\n=== cudf Operation ===")
    cudf_long_select_num = cudf_long_df.groupby('time')['symbol'].size().to_frame()
    cudf_long_select_num = cudf_long_select_num.rename(columns={'symbol': 'long_num'}).reset_index()
    print("cudf long_select_num columns:", cudf_long_select_num.columns)

    cudf_short_df = cudf_short_df.merge(cudf_long_select_num, on='time', how='left')
    print("cudf short_df columns after merge:", cudf_short_df.columns)


if __name__ == "__main__":
    demonstrate_merge_issue()

Result:

=== Version ===
pandas Version: 2.2.3
cudf Version: 24.12.00

=== pandas Operation ===
pandas long_select_num columns: Index(['time', 'long_num'], dtype='object')
pandas short_df columns after merge: Index(['time', 'other_column', 'long_num'], dtype='object')

=== cudf Operation ===
cudf long_select_num columns: Index(['time', 0], dtype='object')
cudf short_df columns after merge: Index(['time', 'other_column', 'long_num', 0], dtype='object')

Expected behavior

I think running the same operation should get the same result in these two packages, so that more devs can easily switch to cudf to speed up computation.

I don't really know detailed code of cudf, and I don't know if it's an expected behavior. Thanks!

Environment overview (please complete the following information)

  • Environment location: Python 3.12 in WSL
  • Method of cuDF install: from pip; pypi.nvidia.com

Environment details

Click here to see environment details
 **git***
 Not inside a git repository
 
 ***OS Information***
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=24.04
 DISTRIB_CODENAME=noble
 DISTRIB_DESCRIPTION="Ubuntu 24.04.1 LTS"
 PRETTY_NAME="Ubuntu 24.04.1 LTS"
 NAME="Ubuntu"
 VERSION_ID="24.04"
 VERSION="24.04.1 LTS (Noble Numbat)"
 VERSION_CODENAME=noble
 ID=ubuntu
 ID_LIKE=debian
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 UBUNTU_CODENAME=noble
 LOGO=ubuntu-logo
 Linux Night 5.15.167.4-microsoft-standard-WSL2 #1 SMP Tue Nov 5 00:21:55 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
 
 ***GPU Information***
 Tue Dec 24 00:07:15 2024
 +-----------------------------------------------------------------------------------------+
 | NVIDIA-SMI 565.77.01              Driver Version: 566.36         CUDA Version: 12.7     |
 |-----------------------------------------+------------------------+----------------------+
 | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
 |                                         |                        |               MIG M. |
 |=========================================+========================+======================|
 |   0  NVIDIA GeForce RTX 4070 ...    On  |   00000000:01:00.0  On |                  N/A |
 |  0%   39C    P8             10W /  220W |    2814MiB /  12282MiB |      4%      Default |
 |                                         |                        |                  N/A |
 +-----------------------------------------+------------------------+----------------------+
 
 +-----------------------------------------------------------------------------------------+
 | Processes:                                                                              |
 |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
 |        ID   ID                                                               Usage      |
 |=========================================================================================|
 |    0   N/A  N/A         1      C   /python3.10                                 N/A      |
 +-----------------------------------------------------------------------------------------+
 
 ***CPU***
 Architecture:                         x86_64
 CPU op-mode(s):                       32-bit, 64-bit
 Address sizes:                        48 bits physical, 48 bits virtual
 Byte Order:                           Little Endian
 CPU(s):                               16
 On-line CPU(s) list:                  0-15
 Vendor ID:                            AuthenticAMD
 Model name:                           AMD Ryzen 7 7800X3D 8-Core Processor
 CPU family:                           25
 Model:                                97
 Thread(s) per core:                   2
 Core(s) per socket:                   8
 Socket(s):                            1
 Stepping:                             2
 BogoMIPS:                             8399.81
 Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
 Virtualization:                       AMD-V
 Hypervisor vendor:                    Microsoft
 Virtualization type:                  full
 L1d cache:                            256 KiB (8 instances)
 L1i cache:                            256 KiB (8 instances)
 L2 cache:                             8 MiB (8 instances)
 L3 cache:                             96 MiB (1 instance)
 Vulnerability Gather data sampling:   Not affected
 Vulnerability Itlb multihit:          Not affected
 Vulnerability L1tf:                   Not affected
 Vulnerability Mds:                    Not affected
 Vulnerability Meltdown:               Not affected
 Vulnerability Mmio stale data:        Not affected
 Vulnerability Reg file data sampling: Not affected
 Vulnerability Retbleed:               Not affected
 Vulnerability Spec rstack overflow:   Mitigation; safe RET
 Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
 Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
 Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
 Vulnerability Srbds:                  Not affected
 Vulnerability Tsx async abort:        Not affected
 
 ***CMake***
 /usr/bin/cmake
 cmake version 3.28.3
 
 CMake suite maintained and supported by Kitware (kitware.com/cmake).
 
 ***g++***
 /usr/bin/g++
 g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
 Copyright (C) 2023 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
 
 ***nvcc***
 
 ***Python***
 
 ***Environment Variables***
 PATH                            : /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/mnt/c/Program
 Files/NVIDIA/CUDNN/v9.6/bin:/mnt/c/Program: Files/NVIDIA
 GPU                             : Computing
 Toolkit/CUDA/v12.6/bin:/mnt/c/Program: Files/NVIDIA
 GPU                             : Computing
 Toolkit/CUDA/v12.6/libnvvp:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/Program: Files/dotnet/:/mnt/d/Program
 Files/Bandizip/:/mnt/d/Program  : Files/nodejs/:/mnt/c/Program
 Files                           : (x86)/Windows
 Kits/10/Windows                 : Performance
 Toolkit/:/mnt/d/sqlite-tools-win-x64-3470100:/mnt/c/Program: Files/Git/cmd:/Docker/host/bin:/mnt/c/Program
 Files/NVIDIA                    : Corporation/NVIDIA
 app/NvDLISR:/mnt/c/Program      : Files
 (x86)/NVIDIA                    : Corporation/PhysX/Common:/mnt/c/Program
 Files/NVIDIA                    : Corporation/Nsight
 Compute                         : 2024.3.2/:/mnt/c/Users/Administrator/AppData/Local/Programs/Python/Python311/Scripts/:/mnt/c/Users/Administrator/AppData/Local/Programs/Python/Python311/:/mnt/c/Users/Administrator/AppData/Local/Microsoft/WindowsApps:/mnt/c/Users/Administrator/AppData/Local/JetBrains/Toolbox/scripts:/mnt/c/Users/Administrator/AppData/Local/GitHubDesktop/bin:/mnt/c/Users/Administrator/AppData/Roaming/npm:/snap/bin
 LD_LIBRARY_PATH                 :
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    :
 PYTHON_PATH                     :
 
 conda not found
 ***pip packages***
 /usr/bin/pip
 Package             Version
 ------------------- -------------
 attrs               23.2.0
 Automat             22.10.0
 Babel               2.10.3
 bcrypt              3.2.2
 blinker             1.7.0
 certifi             2023.11.17
 chardet             5.2.0
 click               8.1.6
 cloud-init          24.4
 colorama            0.4.6
 command-not-found   0.3
 configobj           5.0.8
 constantly          23.10.4
 cryptography        41.0.7
 dbus-python         1.3.2
 distro              1.9.0
 distro-info         1.7+build1
 httplib2            0.20.4
 hyperlink           21.0.0
 idna                3.6
 incremental         22.10.0
 Jinja2              3.1.2
 jsonpatch           1.32
 jsonpointer         2.0
 jsonschema          4.10.3
 launchpadlib        1.11.0
 lazr.restfulclient  0.14.6
 lazr.uri            1.0.6
 markdown-it-py      3.0.0
 MarkupSafe          2.1.5
 mdurl               0.1.2
 netifaces           0.11.0
 oauthlib            3.2.2
 pip                 24.0
 pyasn1              0.4.8
 pyasn1-modules      0.2.8
 pycurl              7.45.3
 Pygments            2.17.2
 PyGObject           3.48.2
 PyHamcrest          2.1.0
 PyJWT               2.7.0
 pyOpenSSL           23.2.0
 pyparsing           3.1.1
 pyrsistent          0.20.0
 pyserial            3.5
 python-apt          2.7.7+ubuntu3
 pytz                2024.1
 PyYAML              6.0.1
 requests            2.31.0
 rich                13.7.1
 service-identity    24.1.0
 setuptools          68.1.2
 six                 1.16.0
 systemd-python      235
 Twisted             24.3.0
 typing_extensions   4.10.0
 ubuntu-pro-client   8001
 unattended-upgrades 0.1
 urllib3             2.0.7
 wadllib             1.3.6
 wheel               0.42.0
 zope.interface      6.1

Additional context
Add any other context about the problem here.

@BetterAndBetterII BetterAndBetterII added the bug Something isn't working label Dec 23, 2024
@galipremsagar
Copy link
Contributor

Hi @BetterAndBetterII,

Thank you for raising this issue. However, this bug is already fixed as of cudf-25.02 and there is a bug in your demonstrate_merge_issue function. Here is the diff to fix your demonstrate_merge_issue and the output in 25.02 version:

(cudfdev) pgali@viking-prod-206:/raid/pgali/cudf$ git diff
diff --git a/test.py b/test.py
index 3e0115caf8..6f0ace419b 100644
--- a/test.py
+++ b/test.py
@@ -29,8 +29,8 @@ def demonstrate_merge_issue():
     pandas_long_select_num = pandas_long_select_num.rename(columns={'symbol': 'long_num'}).reset_index()
     print("pandas long_select_num columns:", pandas_long_select_num.columns)
 
-    pandas_short_df = pandas_short_df.merge(pandas_long_select_num, on='time', how='left')
-    print("pandas short_df columns after merge:", pandas_short_df.columns)
+    pandas_short_df_merged = pandas_short_df.merge(pandas_long_select_num, on='time', how='left')
+    print("pandas short_df columns after merge:", pandas_short_df_merged.columns)
 
     # Use cudf to create DataFrame
     cudf_long_df = cudf.DataFrame.from_pandas(pandas_long_df)

output:

=== Version ===
pandas Version: 2.2.3
cudf Version: 25.02.00

=== pandas Operation ===
pandas long_select_num columns: Index(['time', 'long_num'], dtype='object')
pandas short_df columns after merge: Index(['time', 'other_column', 'long_num'], dtype='object')

=== cudf Operation ===
cudf long_select_num columns: Index(['time', 'long_num'], dtype='object')
cudf short_df columns after merge: Index(['time', 'other_column', 'long_num'], dtype='object')

@galipremsagar galipremsagar added Python Affects Python cuDF API. and removed bug Something isn't working labels Dec 23, 2024
@galipremsagar galipremsagar added 0 - Waiting on Author Waiting for author to respond to review and removed Python Affects Python cuDF API. labels Dec 23, 2024
@BetterAndBetterII
Copy link
Author

Hi @BetterAndBetterII,

Thank you for raising this issue. However, this bug is already fixed as of cudf-25.02 and there is a bug in your demonstrate_merge_issue function. Here is the diff to fix your demonstrate_merge_issue and the output in 25.02 version:

(cudfdev) pgali@viking-prod-206:/raid/pgali/cudf$ git diff
diff --git a/test.py b/test.py
index 3e0115caf8..6f0ace419b 100644
--- a/test.py
+++ b/test.py
@@ -29,8 +29,8 @@ def demonstrate_merge_issue():
pandas_long_select_num = pandas_long_select_num.rename(columns={'symbol': 'long_num'}).reset_index()
print("pandas long_select_num columns:", pandas_long_select_num.columns)

  • pandas_short_df = pandas_short_df.merge(pandas_long_select_num, on='time', how='left')
  • print("pandas short_df columns after merge:", pandas_short_df.columns)
  • pandas_short_df_merged = pandas_short_df.merge(pandas_long_select_num, on='time', how='left')

  • print("pandas short_df columns after merge:", pandas_short_df_merged.columns)

    Use cudf to create DataFrame

    cudf_long_df = cudf.DataFrame.from_pandas(pandas_long_df)
    output:

=== Version ===
pandas Version: 2.2.3
cudf Version: 25.02.00

=== pandas Operation ===
pandas long_select_num columns: Index(['time', 'long_num'], dtype='object')
pandas short_df columns after merge: Index(['time', 'other_column', 'long_num'], dtype='object')

=== cudf Operation ===
cudf long_select_num columns: Index(['time', 'long_num'], dtype='object')
cudf short_df columns after merge: Index(['time', 'other_column', 'long_num'], dtype='object')

thank you so much!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Waiting on Author Waiting for author to respond to review
Projects
Status: Todo
Development

No branches or pull requests

2 participants