Handle multiple files in previous uploads, remove CSVs #171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

dogversioning merged 5 commits into main from mg/csv_duplicates

May 12, 2025

Contributor

dogversioning commented May 12, 2025

Broken up into three commits for my own sanity/testing isolation:
Commit one:

Removes CSV generation, and updates infrastructure to not care about csv paths/not expect csvs during tests
Adds a migration for cleaning up buckets
Commit two:
Changes how archiving works during powerset merge to archive everything in a folder, rather than expecting a filename pattern
Commit three:
Removes the CSV enums, and also some ruff formatting on push (probably fine to skim for review

dogversioning added 3 commits

May 12, 2025 09:25


          Remove CSV generation

32687ed


          Archive folders instead of single filename

8aa541a


          Removed csv enums

00b4df0

github-actions bot commented May 12, 2025 •

edited

Loading

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
867	847	98%	90%	🟢

New Files

No new covered files...

Modified Files

File	Coverage	Status
src/shared/awswrangler_functions.py	100%	🟢
src/shared/enums.py	100%	🟢
src/shared/functions.py	97%	🟢
src/shared/s3_manager.py	100%	🟢
src/site_upload/powerset_merge/powerset_merge.py	96%	🟢
src/site_upload/process_flat/process_flat.py	100%	🟢
src/site_upload/study_period/study_period.py	100%	🟢
TOTAL	99%	🟢

updated for commit: cd488b0 by action🐍


          removed awswranger csv generation

064f607

mikix approved these changes

View reviewed changes

scripts/migrations/migration.008.remove_csvs.py Show resolved Hide resolved

src/site_upload/powerset_merge/powerset_merge.py

                       # If the latest uploads don't include this site, we'll use the last-valid
                       # one instead
                       try:
-                          if not any(x.endswith(site_specific_name) for x in latest_file_list):
-                              df = expand_and_concat_powersets(df, last_valid_path, last_valid_site)
+                          if not any(subbucket_path in x for x in latest_file_list):

Contributor

mikix May 12, 2025

Is x.startswith(subbucket_path) equivalent for your purposes here? You know I hate a bare in check for strings.

Contributor Author

dogversioning May 12, 2025

no, because the latest file will also have the basedir at the front. i could make this a regex but i personally hate that more than this kind of substring check.

src/site_upload/powerset_merge/powerset_merge.py Outdated Show resolved Hide resolved

src/shared/functions.py Outdated Show resolved Hide resolved

src/site_upload/powerset_merge/powerset_merge.py Outdated Show resolved Hide resolved

src/site_upload/powerset_merge/powerset_merge.py Outdated Show resolved Hide resolved

src/site_upload/powerset_merge/powerset_merge.py Outdated

                           # otherwise, this is the first instance - after it's in the database,
                           # we'll generate a new list of valid tables for the dashboard
                           else:
                               is_new_data_package = True
                           df = expand_and_concat_powersets(df, latest_path, manager.site)
+                          filename=functions.get_s3_filename(latest_path)

Contributor

mikix May 12, 2025

nit: is it worth adding filename to the s3 key parser method?

Contributor Author

dogversioning May 12, 2025

I thought about this - the only time we ever really care (right now) is in this archive/unarchive case. but we could combine them.

src/site_upload/powerset_merge/powerset_merge.py Outdated Show resolved Hide resolved

src/site_upload/powerset_merge/powerset_merge.py Outdated Show resolved Hide resolved

dogversioning force-pushed the mg/csv_duplicates branch from 0826e7a to 482c35e Compare

May 12, 2025 20:16


          Coverage for awswrangler functions

cd488b0

dogversioning force-pushed the mg/csv_duplicates branch from 482c35e to cd488b0 Compare

May 12, 2025 20:18

dogversioning merged commit b4e7e86 into main

2 checks passed

dogversioning deleted the mg/csv_duplicates branch

May 12, 2025 20:21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet