feat: add unk_filter_by_count function and corresponding tests; update .gitignore and GitHub Actions workflow #30

haru-256 · 2025-04-20T02:20:37Z

This pull request introduces several changes, primarily focusing on enhancing data preprocessing for the Amazon Reviews dataset, adding a utility function for filtering low-frequency items, and updating configurations and tests. Below is a summary of the most important changes grouped by theme.

Data Preprocessing Enhancements:

Added a new utility function, unk_filter_by_count, to filter out items, users, or categories that fall below a specified frequency threshold in the dataset. This function calculates cumulative counts and filters based on a configurable percentage threshold. (ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py, ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.pyR69-R98)
Updated the seq_rec_preprocess_dataset function to use unk_filter_by_count for filtering users, items, and categories based on their occurrence counts. This replaces previous placeholder logic and ensures that low-frequency entities are handled consistently. (ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py, [1] [2]

Configuration Updates:

Modified the .gitignore file to allow tracking of the src/ml_sandbox_libs/data directory while ignoring other data/ directories. (ml_sandbox_libs/.gitignore, ml_sandbox_libs/.gitignoreR2)
Added an environment variable, RUFF_OUTPUT_FORMAT, set to github in the Python linting workflow configuration. This likely standardizes the output format for linting results in GitHub Actions. (.github/workflows/python-lint-test.yml, .github/workflows/python-lint-test.ymlR84-R92)

Testing Improvements:

Added a unit test for the unk_filter_by_count function to validate its behavior with a simple example. This ensures that the function filters out low-frequency items correctly based on the specified threshold. (ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.py, ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.pyR1-R11)

…e .gitignore and GitHub Actions workflow

github-actions · 2025-04-20T02:22:10Z

Failed to generate code suggestions for PR

feat: add unk_filter_by_count function and corresponding tests; updat…

b7cc15c

…e .gitignore and GitHub Actions workflow

fix: remove unnecessary lines in seq_rec_preprocess_dataset function

d1d0d34

haru-256 merged commit 895ca15 into main Apr 20, 2025
2 checks passed

haru-256 deleted the feat/unk-id branch April 20, 2025 02:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add unk_filter_by_count function and corresponding tests; update .gitignore and GitHub Actions workflow #30

feat: add unk_filter_by_count function and corresponding tests; update .gitignore and GitHub Actions workflow #30

Uh oh!

haru-256 commented Apr 20, 2025

Uh oh!

github-actions bot commented Apr 20, 2025

Uh oh!

Uh oh!

Uh oh!

feat: add unk_filter_by_count function and corresponding tests; update .gitignore and GitHub Actions workflow #30

feat: add unk_filter_by_count function and corresponding tests; update .gitignore and GitHub Actions workflow #30

Uh oh!

Conversation

haru-256 commented Apr 20, 2025

Data Preprocessing Enhancements:

Configuration Updates:

Testing Improvements:

Uh oh!

github-actions bot commented Apr 20, 2025

Uh oh!

Uh oh!

Uh oh!