feat: add unk_filter_by_count function and corresponding tests; update .gitignore and GitHub Actions workflow #30
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces several changes, primarily focusing on enhancing data preprocessing for the Amazon Reviews dataset, adding a utility function for filtering low-frequency items, and updating configurations and tests. Below is a summary of the most important changes grouped by theme.
Data Preprocessing Enhancements:
unk_filter_by_count
, to filter out items, users, or categories that fall below a specified frequency threshold in the dataset. This function calculates cumulative counts and filters based on a configurable percentage threshold. (ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py
, ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.pyR69-R98)seq_rec_preprocess_dataset
function to useunk_filter_by_count
for filtering users, items, and categories based on their occurrence counts. This replaces previous placeholder logic and ensures that low-frequency entities are handled consistently. (ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py
, [1] [2]Configuration Updates:
.gitignore
file to allow tracking of thesrc/ml_sandbox_libs/data
directory while ignoring otherdata/
directories. (ml_sandbox_libs/.gitignore
, ml_sandbox_libs/.gitignoreR2)RUFF_OUTPUT_FORMAT
, set togithub
in the Python linting workflow configuration. This likely standardizes the output format for linting results in GitHub Actions. (.github/workflows/python-lint-test.yml
, .github/workflows/python-lint-test.ymlR84-R92)Testing Improvements:
unk_filter_by_count
function to validate its behavior with a simple example. This ensures that the function filters out low-frequency items correctly based on the specified threshold. (ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.py
, ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.pyR1-R11)