-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Labels
bugSomething isn't workingSomething isn't working
Description
When a column contains unused categories, then the scatter plot colors and legend are wrong. Calling df.cat.remove_unused_categories()
resolve the issue, and I wonder whether jscatter
should call it as well?
See the code to reproduce:
import pandas as pd
import jscatter
import numpy as np
def keep_largest_categories(column: pd.Series, threshold: int) -> pd.Series:
"""Keep on the categories with a values count larger than a threshold"""
# 0. Make a copy of the column
column = column.copy()
# 1. Calculate the value counts for the specified column
category_counts = column.value_counts()
# 2. Identify categories with counts below the threshold
low_count_categories = category_counts[category_counts < threshold].index
# 3. Create a boolean mask to identify rows where the category is in low_count_categories
mask = column.isin(low_count_categories)
# 4. Use the mask to set values in the specified column to NaN
column[mask] = None
return column
n = 50
categories = [f"cat_{i}" for i in range(300)]
categories = pd.Categorical(categories)
df = pd.DataFrame({
"x": np.random.rand(n),
"y": np.random.rand(n),
"cat": np.random.choice(categories, size=n),
})
df["cat"] = df["cat"].astype("category")
# "cat" contains 300 categories
# now we only keep from cat_0 to cat_10
df["cat"] = keep_largest_categories(df["cat"], 2)
# if you dont remove the unused categories then the color and legend will be wrong
# df["cat"] = df["cat"].cat.remove_unused_categories()
scatter = jscatter.Scatter(
data=df,
x="x",
y="y",
color_by="cat",
size=10,
legend=True,
tooltip=True,
tooltip_properties=["cat"],
tooltip_size="large",
height=200,
width=500,
legend_size="large",
)
scatter.show()
without df.cat.remove_unused_categories()

with df.cat.remove_unused_categories()

Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working