-
-
Notifications
You must be signed in to change notification settings - Fork 260
Open
Description
Describe the issue:
dask_ml.compose.ColumnTransformer
does not work with objects of types dask_expr._collection.DataFrame
or dask.dataframe.core.DataFrame
.
Minimal Complete Verifiable Example:
import numpy as np
import pandas as pd
from dask_ml.compose import ColumnTransformer
from dask_ml.preprocessing import StandardScaler
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
# Create a sample dataframe
df = pd.DataFrame({"A": np.random.rand(1000)})
ddf = dd.from_pandas(df, npartitions=2)
ColumnTransformer, specifying the columns using strings:
scaler = ColumnTransformer(
transformers=[("StandardScaler", StandardScaler(), ["A"])],
remainder="passthrough",
)
scaler.fit_transform(ddf) # or scaler.fit_transform(ddf.to_legacy_dataframe())
Out:
ValueError: Specifying the columns using strings is only supported for dataframes.
ColumnTransformer, specifying the columns using integers:
scaler = ColumnTransformer(
transformers=[("StandardScaler", StandardScaler(), [0])],
remainder="passthrough",
)
scaler.fit_transform(ddf) # or scaler.fit_transform(ddf.to_legacy_dataframe())
Out:
AttributeError: 'DataFrame' object has no attribute 'take'
Anything else we need to know?:
Pandas data frames, i.e.
scaler.fit_transform(ddf.compute())
works as expected.
Could be related to #962 and #887. If this is the same issue indeed, and there are no plans to fix it in the foreseeable future, could it better to remove it from the Dask ML API?
Environment:
- Dask version: 2024.4.1
- Dask ML version: 2024.4.1
- Scikit-learn version: 1.4.0
- Python version: 3.10.13
- Operating System: MacOS.
- Install method (conda, pip, source): pip
ruetzmax
Metadata
Metadata
Assignees
Labels
No labels