-
-
Notifications
You must be signed in to change notification settings - Fork 260
Open
Description
The dask ColumnTransformer stacks the different transformers. The following code (essentially #365) gives an undesirable output
import pandas as pd
import dask.dataframe as dd
import dask_ml.compose
import dask_ml.preprocessing
df = pd.DataFrame({"A": pd.Categorical(["a", "a", "b", "a"]), "B": [1.0, 2, 4, 5]})
ddf = dd.from_pandas(df, npartitions=2).reset_index(drop=True)
ct = dask_ml.compose.ColumnTransformer([
("A", dask_ml.preprocessing.OneHotEncoder(dtype='uint8'), ['A']), # Example categorical feature
("B", dask_ml.preprocessing.RobustScaler(), ['B']) # Numeric features
],
)
ct.fit_transform(ddf).compute()
The output I get is:
A_a A_b B
0 1.0 0 NaN
1 1.0 0 NaN
0 0 1.0 NaN
1 1.0 0 NaN
0 NaN NaN -1.000000
1 NaN NaN -0.666667
0 NaN NaN 0.000000
1 NaN NaN 0.333333
The output should be like that of #365
A_a A_b B
0 1.0 0.0 -1.000000
1 1.0 0.0 -0.666667
0 0.0 1.0 0.000000
1 1.0 0.0 0.333333
Environment:
- dask-ml version: 2024.4.4
- dask version: 2024.8.1
- Python version:3.10.14
- Operating System: Ubuntu 23.04
- Install method (conda, pip, source): pip
Metadata
Metadata
Assignees
Labels
No labels