Skip to content

feat: add unk_filter_by_count function and corresponding tests; update .gitignore and GitHub Actions workflow #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 20, 2025

Conversation

haru-256
Copy link
Owner

This pull request introduces several changes, primarily focusing on enhancing data preprocessing for the Amazon Reviews dataset, adding a utility function for filtering low-frequency items, and updating configurations and tests. Below is a summary of the most important changes grouped by theme.

Data Preprocessing Enhancements:

  • Added a new utility function, unk_filter_by_count, to filter out items, users, or categories that fall below a specified frequency threshold in the dataset. This function calculates cumulative counts and filters based on a configurable percentage threshold. (ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py, ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.pyR69-R98)
  • Updated the seq_rec_preprocess_dataset function to use unk_filter_by_count for filtering users, items, and categories based on their occurrence counts. This replaces previous placeholder logic and ensures that low-frequency entities are handled consistently. (ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py, [1] [2]

Configuration Updates:

  • Modified the .gitignore file to allow tracking of the src/ml_sandbox_libs/data directory while ignoring other data/ directories. (ml_sandbox_libs/.gitignore, ml_sandbox_libs/.gitignoreR2)
  • Added an environment variable, RUFF_OUTPUT_FORMAT, set to github in the Python linting workflow configuration. This likely standardizes the output format for linting results in GitHub Actions. (.github/workflows/python-lint-test.yml, .github/workflows/python-lint-test.ymlR84-R92)

Testing Improvements:

  • Added a unit test for the unk_filter_by_count function to validate its behavior with a simple example. This ensures that the function filters out low-frequency items correctly based on the specified threshold. (ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.py, ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.pyR1-R11)

Copy link

Failed to generate code suggestions for PR

@haru-256 haru-256 merged commit 895ca15 into main Apr 20, 2025
2 checks passed
@haru-256 haru-256 deleted the feat/unk-id branch April 20, 2025 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant