Skip to content

Handle multiple files in previous uploads, remove CSVs #171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 12, 2025

Conversation

dogversioning
Copy link
Contributor

Broken up into three commits for my own sanity/testing isolation:
Commit one:

  • Removes CSV generation, and updates infrastructure to not care about csv paths/not expect csvs during tests
  • Adds a migration for cleaning up buckets
    Commit two:
  • Changes how archiving works during powerset merge to archive everything in a folder, rather than expecting a filename pattern
    Commit three:
  • Removes the CSV enums, and also some ruff formatting on push (probably fine to skim for review

Copy link

github-actions bot commented May 12, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
867 847 98% 90% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
src/shared/awswrangler_functions.py 100% 🟢
src/shared/enums.py 100% 🟢
src/shared/functions.py 97% 🟢
src/shared/s3_manager.py 100% 🟢
src/site_upload/powerset_merge/powerset_merge.py 96% 🟢
src/site_upload/process_flat/process_flat.py 100% 🟢
src/site_upload/study_period/study_period.py 100% 🟢
TOTAL 99% 🟢

updated for commit: cd488b0 by action🐍

# If the latest uploads don't include this site, we'll use the last-valid
# one instead
try:
if not any(x.endswith(site_specific_name) for x in latest_file_list):
df = expand_and_concat_powersets(df, last_valid_path, last_valid_site)
if not any(subbucket_path in x for x in latest_file_list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is x.startswith(subbucket_path) equivalent for your purposes here? You know I hate a bare in check for strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because the latest file will also have the basedir at the front. i could make this a regex but i personally hate that more than this kind of substring check.

# otherwise, this is the first instance - after it's in the database,
# we'll generate a new list of valid tables for the dashboard
else:
is_new_data_package = True
df = expand_and_concat_powersets(df, latest_path, manager.site)
filename=functions.get_s3_filename(latest_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it worth adding filename to the s3 key parser method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this - the only time we ever really care (right now) is in this archive/unarchive case. but we could combine them.

@dogversioning dogversioning merged commit b4e7e86 into main May 12, 2025
2 checks passed
@dogversioning dogversioning deleted the mg/csv_duplicates branch May 12, 2025 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants