-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to more stable test data path for CellLocations
tests
#375
Comments
@kenibrewer thanks for checking, this isn't yet resolved. We're using the following file: $ aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite --no-sign-request --human-readable
2022-10-02 07:47:03 3.6 GiB BR00126114.sqlite I also just noticed there's a reference to the previous file in the readme and some other spots (my apologies) (search link here). |
I looked into this briefly today and noticed the file $ aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite --no-sign-request --human-readable
2022-10-02 07:47:03 3.6 GiB BR00126114.sqlite
$ aws s3 ls s3://cellpainting-gallery/cpg0017-rohban-pathways/broad/workspace/backend/2013_10_11_SIGMA2_Pilot/41744/41744.sqlite --no-sign-request --human-readable
2023-02-16 17:47:43 7.7 GiB 41744.sqlite This would likely cause the GitHub Actions-based tests to hang longer during downloads of the data (testing this locally was noticeably slower). Looking through the other data within Would it make sense to instead produce a "shrunken" copy of |
Adding more thoughts here: we could intermix the capabilities here by performing a SQLite in-memory database reduction and combine something like Roughly, I think this would look something like the following. While this would maybe be less upfront complexity from import s3fs
import s3sqlite
import apsw
# Create an S3 filesystem. Check the s3fs docs for more examples:
# https://s3fs.readthedocs.io/en/latest/
s3 = s3fs.S3FileSystem(
...
)
s3vfs = s3sqlite.S3VFS(name="s3-vfs", fs=s3)
# Define the S3 location
key_prefix = "bucketname/BR00126114.sqlite"
# Create a database and query it
with apsw.Connection(
key_prefix, vfs=s3vfs.name, flags=apsw.SQLITE_OPEN_READONLY
) as conn:
# extract the schema from s3 database
cursor = conn.execute("""
ATTACH DATABASE 'BR00126114.sqlite' as 'BR00126114';
SELECT sql FROM BR00126114 WHERE type='table';
""")
create_stmts = cursor.fetchall()
# form an in-memory database with a limited subset of data from the s3 database
cursor = conn.execute(f"""
ATTACH DATABASE ':memory:' as 'test_BR00126114';
/* we need to create identical table schema within the new database
in order for the inserts below to process correctly */
{create_stmts}
/* we use limited selections from the AWS database to populate the in-mem database */
INSERT INTO test_BR00126114.Image SELECT * FROM BR00126114.Image WHERE ImageNumber IN (1,2,3) LIMIT 1000;
INSERT INTO test_BR00126114.Cells SELECT * FROM BR00126114.Cells WHERE ImageNumber IN (1,2,3) LIMIT 1000;
...
/* leverage vacuum to export in-mem data to file */
VACUUM test_BR00126114 INTO 'test_BR00126114.sqlite';
""") |
Hi @gwaybio and @kenibrewer , could I ask for your thoughts and input on what might be best moving forward for this issue? Some of the options I feel might be possible are (open to others too):
|
CellLocations
tests
If you're looking for an alternative file, I think this one sounds reasonable (and likely more stable)
s3://cellpainting-gallery/cpg0017-rohban-pathways/broad/workspace/backend/2013_10_11_SIGMA2_Pilot/41744/41744.sqlite
Originally posted by @gwaybio in #374 (comment)
The text was updated successfully, but these errors were encountered: