You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is some light refactoring that could be done to make the code easier to read.
Currently
Loader: a class that imports input data from a variety of places (local i2b2, bulk FHIR server export, etc) into a standard format of FHIR njdson, sitting in a local temporary directory.
Format: a class that exports output data into the final location, in a few different formats (json file tree, ndjson files, parquet files, and soon a delta lake).
Root: a class that abstracts filesystem access across cloud and local disk (basically a light wrapper around fsspec and is used by both Loaders and Formats).
The Problem
Cumulus ETL is an ETL. Standing for extract, transform, load. It might be considered confusing therefore that we are using Loaders to do the "extraction." (Though, I think it's crazy that the output step of ETL is called load in the first place. So... maybe it's better to just avoid the word "load"?)
Format and Root I don't hate. But they could be clearer maybe, especially in contrast to whatever we call the input step. And Root is defined in a file called store.py, which is another "output" word we are throwing in the mix.
One Solution
Loader -> Reader
Format -> Writer
I dunno on Root, maybe it stays as is?
Read/Write are very overloaded terms though. Even though that's what's happening here, there's plenty of other reading and writing happening in the ETL. It might be nice to have a more specific term of art? An insane suggestion would be something so specific like Ingester and Disgorger -- not ideal words, but they become terms of art instead of generic words... Dunno.
ETL Solution
We can lean into the text of ETL:
Loader -> Extractor
Format -> Loader
I personally think it's best to avoid the use of "loader". But I could be convinced otherwise probably.
Better Ideas
Any folks got a better naming ideas? Naming is hard. This might be a good thinker for Matt when he starts, and a good way to poke around the code base safely.
The text was updated successfully, but these errors were encountered:
There is some light refactoring that could be done to make the code easier to read.
Currently
fsspec
and is used by both Loaders and Formats).The Problem
Cumulus ETL is an ETL. Standing for extract, transform, load. It might be considered confusing therefore that we are using Loaders to do the "extraction." (Though, I think it's crazy that the output step of ETL is called load in the first place. So... maybe it's better to just avoid the word "load"?)
Format and Root I don't hate. But they could be clearer maybe, especially in contrast to whatever we call the input step. And Root is defined in a file called
store.py
, which is another "output" word we are throwing in the mix.One Solution
Read/Write are very overloaded terms though. Even though that's what's happening here, there's plenty of other reading and writing happening in the ETL. It might be nice to have a more specific term of art? An insane suggestion would be something so specific like Ingester and Disgorger -- not ideal words, but they become terms of art instead of generic words... Dunno.
ETL Solution
We can lean into the text of ETL:
I personally think it's best to avoid the use of "loader". But I could be convinced otherwise probably.
Better Ideas
Any folks got a better naming ideas? Naming is hard. This might be a good thinker for Matt when he starts, and a good way to poke around the code base safely.
The text was updated successfully, but these errors were encountered: