Closed
Description
Describe the bug
When I switch from pandas to cudf, I find a inconsistence issue in merge() function.
Running the same code in pandas and in cudf, I got different results.
Steps/Code to reproduce bug
Demo code:
import pandas as pd
import cudf
def demonstrate_merge_issue():
# print pandas cudf
print("=== Version ===")
print(f"pandas Version: {pd.__version__}")
print(f"cudf Version: {cudf.__version__}\n")
# Create demo data
data = {
'time': ['2023-01-01', '2023-01-01', '2023-01-02'],
'symbol': ['A', 'B', 'C'],
'close': [100, 200, 300],
'rank': [1, 2, 1]
}
# Use pandas to create DataFrame
pandas_long_df = pd.DataFrame(data)
pandas_short_df = pd.DataFrame({
'time': ['2023-01-01', '2023-01-02'],
'other_column': ['X', 'Y']
})
# pandas Operation
print("=== pandas Operation ===")
pandas_long_select_num = pandas_long_df.groupby('time')['symbol'].size().to_frame()
pandas_long_select_num = pandas_long_select_num.rename(columns={'symbol': 'long_num'}).reset_index()
print("pandas long_select_num columns:", pandas_long_select_num.columns)
pandas_short_df = pandas_short_df.merge(pandas_long_select_num, on='time', how='left')
print("pandas short_df columns after merge:", pandas_short_df.columns)
# Use cudf to create DataFrame
cudf_long_df = cudf.DataFrame.from_pandas(pandas_long_df)
cudf_short_df = cudf.DataFrame.from_pandas(pandas_short_df)
# cudf Operation
print("\n=== cudf Operation ===")
cudf_long_select_num = cudf_long_df.groupby('time')['symbol'].size().to_frame()
cudf_long_select_num = cudf_long_select_num.rename(columns={'symbol': 'long_num'}).reset_index()
print("cudf long_select_num columns:", cudf_long_select_num.columns)
cudf_short_df = cudf_short_df.merge(cudf_long_select_num, on='time', how='left')
print("cudf short_df columns after merge:", cudf_short_df.columns)
if __name__ == "__main__":
demonstrate_merge_issue()
Result:
=== Version ===
pandas Version: 2.2.3
cudf Version: 24.12.00
=== pandas Operation ===
pandas long_select_num columns: Index(['time', 'long_num'], dtype='object')
pandas short_df columns after merge: Index(['time', 'other_column', 'long_num'], dtype='object')
=== cudf Operation ===
cudf long_select_num columns: Index(['time', 0], dtype='object')
cudf short_df columns after merge: Index(['time', 'other_column', 'long_num', 0], dtype='object')
Expected behavior
I think running the same operation should get the same result in these two packages, so that more devs can easily switch to cudf to speed up computation.
I don't really know detailed code of cudf, and I don't know if it's an expected behavior. Thanks!
Environment overview (please complete the following information)
- Environment location: Python 3.12 in WSL
- Method of cuDF install: from pip; pypi.nvidia.com
Environment details
Click here to see environment details
**git*** Not inside a git repository ***OS Information*** DISTRIB_ID=Ubuntu DISTRIB_RELEASE=24.04 DISTRIB_CODENAME=noble DISTRIB_DESCRIPTION="Ubuntu 24.04.1 LTS" PRETTY_NAME="Ubuntu 24.04.1 LTS" NAME="Ubuntu" VERSION_ID="24.04" VERSION="24.04.1 LTS (Noble Numbat)" VERSION_CODENAME=noble ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=noble LOGO=ubuntu-logo Linux Night 5.15.167.4-microsoft-standard-WSL2 #1 SMP Tue Nov 5 00:21:55 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux ***GPU Information*** Tue Dec 24 00:07:15 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.77.01 Driver Version: 566.36 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4070 ... On | 00000000:01:00.0 On | N/A | | 0% 39C P8 10W / 220W | 2814MiB / 12282MiB | 4% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1 C /python3.10 N/A | +-----------------------------------------------------------------------------------------+ ***CPU*** Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD Ryzen 7 7800X3D 8-Core Processor CPU family: 25 Model: 97 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 2 BogoMIPS: 8399.81 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm Virtualization: AMD-V Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 256 KiB (8 instances) L1i cache: 256 KiB (8 instances) L2 cache: 8 MiB (8 instances) L3 cache: 96 MiB (1 instance) Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected ***CMake*** /usr/bin/cmake cmake version 3.28.3 CMake suite maintained and supported by Kitware (kitware.com/cmake). ***g++*** /usr/bin/g++ g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Copyright (C) 2023 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ***nvcc*** ***Python*** ***Environment Variables*** PATH : /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/mnt/c/Program Files/NVIDIA/CUDNN/v9.6/bin:/mnt/c/Program: Files/NVIDIA GPU : Computing Toolkit/CUDA/v12.6/bin:/mnt/c/Program: Files/NVIDIA GPU : Computing Toolkit/CUDA/v12.6/libnvvp:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/Program: Files/dotnet/:/mnt/d/Program Files/Bandizip/:/mnt/d/Program : Files/nodejs/:/mnt/c/Program Files : (x86)/Windows Kits/10/Windows : Performance Toolkit/:/mnt/d/sqlite-tools-win-x64-3470100:/mnt/c/Program: Files/Git/cmd:/Docker/host/bin:/mnt/c/Program Files/NVIDIA : Corporation/NVIDIA app/NvDLISR:/mnt/c/Program : Files (x86)/NVIDIA : Corporation/PhysX/Common:/mnt/c/Program Files/NVIDIA : Corporation/Nsight Compute : 2024.3.2/:/mnt/c/Users/Administrator/AppData/Local/Programs/Python/Python311/Scripts/:/mnt/c/Users/Administrator/AppData/Local/Programs/Python/Python311/:/mnt/c/Users/Administrator/AppData/Local/Microsoft/WindowsApps:/mnt/c/Users/Administrator/AppData/Local/JetBrains/Toolbox/scripts:/mnt/c/Users/Administrator/AppData/Local/GitHubDesktop/bin:/mnt/c/Users/Administrator/AppData/Roaming/npm:/snap/bin LD_LIBRARY_PATH : NUMBAPRO_NVVM : NUMBAPRO_LIBDEVICE : CONDA_PREFIX : PYTHON_PATH : conda not found ***pip packages*** /usr/bin/pip Package Version ------------------- ------------- attrs 23.2.0 Automat 22.10.0 Babel 2.10.3 bcrypt 3.2.2 blinker 1.7.0 certifi 2023.11.17 chardet 5.2.0 click 8.1.6 cloud-init 24.4 colorama 0.4.6 command-not-found 0.3 configobj 5.0.8 constantly 23.10.4 cryptography 41.0.7 dbus-python 1.3.2 distro 1.9.0 distro-info 1.7+build1 httplib2 0.20.4 hyperlink 21.0.0 idna 3.6 incremental 22.10.0 Jinja2 3.1.2 jsonpatch 1.32 jsonpointer 2.0 jsonschema 4.10.3 launchpadlib 1.11.0 lazr.restfulclient 0.14.6 lazr.uri 1.0.6 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 netifaces 0.11.0 oauthlib 3.2.2 pip 24.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycurl 7.45.3 Pygments 2.17.2 PyGObject 3.48.2 PyHamcrest 2.1.0 PyJWT 2.7.0 pyOpenSSL 23.2.0 pyparsing 3.1.1 pyrsistent 0.20.0 pyserial 3.5 python-apt 2.7.7+ubuntu3 pytz 2024.1 PyYAML 6.0.1 requests 2.31.0 rich 13.7.1 service-identity 24.1.0 setuptools 68.1.2 six 1.16.0 systemd-python 235 Twisted 24.3.0 typing_extensions 4.10.0 ubuntu-pro-client 8001 unattended-upgrades 0.1 urllib3 2.0.7 wadllib 1.3.6 wheel 0.42.0 zope.interface 6.1
Additional context
Add any other context about the problem here.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Done