Skip to content

Unused categories leads to wrong color and legend #189

@hadim

Description

@hadim

When a column contains unused categories, then the scatter plot colors and legend are wrong. Calling df.cat.remove_unused_categories() resolve the issue, and I wonder whether jscatter should call it as well?

See the code to reproduce:

import pandas as pd
import jscatter
import numpy as np


def keep_largest_categories(column: pd.Series, threshold: int) -> pd.Series:
    """Keep on the categories with a values count larger than a threshold"""

    # 0. Make a copy of the column
    column = column.copy()

    # 1. Calculate the value counts for the specified column
    category_counts = column.value_counts()

    # 2. Identify categories with counts below the threshold
    low_count_categories = category_counts[category_counts < threshold].index

    # 3. Create a boolean mask to identify rows where the category is in low_count_categories
    mask = column.isin(low_count_categories)

    # 4. Use the mask to set values in the specified column to NaN
    column[mask] = None

    return column


n = 50
categories = [f"cat_{i}" for i in range(300)]
categories = pd.Categorical(categories)

df = pd.DataFrame({
    "x": np.random.rand(n),
    "y": np.random.rand(n),
    "cat": np.random.choice(categories, size=n),
})

df["cat"] = df["cat"].astype("category")

# "cat" contains 300 categories
# now we only keep from cat_0 to cat_10
df["cat"] = keep_largest_categories(df["cat"], 2)

# if you dont remove the unused categories then the color and legend will be wrong
# df["cat"] = df["cat"].cat.remove_unused_categories()

scatter = jscatter.Scatter(
    data=df,
    x="x",
    y="y",
    color_by="cat",
    size=10,
    legend=True,
    tooltip=True,
    tooltip_properties=["cat"],
    tooltip_size="large",
    height=200,
    width=500,
    legend_size="large",
)

scatter.show()

without df.cat.remove_unused_categories()

Image

with df.cat.remove_unused_categories()

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions