Command line tools to extract and load SQL and Carto tables using JSON Table Schema.
Use this tool to extract data from a source such as an Oracle SDE database, load it to S3, and then load it to a destination such as PostGIS or Carto from S3. In order for this to work, a JSON table schema for the table you are working with needs to reside in S3, in the bucket citygeo-airflow-databridge2 in the schemas/ folder.
The tool can be used either with Docker or as a standalone Python package.
- Python
>=3.7, <3.10due to dependencies in thearcgispackage. - Pip
- AWS CLI
- Oracle 11g Client
- Postgres
- Access to citygeo-oracle-instant-client S3 bucket
- Docker
- Access to citygeo-oracle-instant-client S3 bucket
- Perform the following if installing this package onto a server that has not already had these steps completed
- Install aws CLI if you don't have it
sudo apt-get install awscli - Install alien if don't already have it
sudo apt-get install alien - Run
bash ./scripts/pull-oracle-rpm.sh- You can then verify with
lsthat the oracle instant client was downloaded into the project folder
- You can then verify with
- Run
alien -i oracle-instantclient18.5-basiclite-18.5.0.0.0-3.x86_64.rpm \ && rm oracle-instantclient18.5-basiclite-18.5.0.0.0-3.x86_64.rpm
- Install aws CLI if you don't have it
- Source a virtual environment, then
pip install git+https://github.com/CityOfPhiladelphia/databridge-etl-tools.git- Set environment variables for Oracle, if the below are not already in your
~/.bashrcfile then add them there and run them in the terminal as well:export ORACLE_HOME=/usr/lib/oracle/18.5/client64 export LD_LIBRARY_PATH=$ORACLE_HOME/lib export PATH="$PATH:$ORACLE_HOME" - Create a virtual environment if one does not already exist
- Source your virtual environment
- On Python3.9 (and maybe other versions) remove the version dependencies on the following packages:
pyprojarcgis
- Install the following necessary packages (note that this is copied from the Dockerfile, which essentially does a similar process)
sudo apt-get install --no-install-recommends python3-dev \ libkrb5-dev \ libsasl2-dev \ libssl-dev \ libffi-dev \ build-essential \ libblas-dev \ liblapack-dev \ libpq-dev \ netbase \ apt-utils \ unzip \ curl \ netcat \ locales \ git \ alien \ libgdal-dev \ libgeos-dev \ binutils \ libproj-dev \ gdal-bin \ libspatialindex-dev \ libaio1 \ freetds-dev - Attempt to install the python3-dev that matches your python minor version, so for example
sudo apt-get install python3.9-dev
pip install .orpip install -e ./if you are contributing to the repository- Note, you do not need the file
docker-fast-requirements.txt
- Note, you do not need the file
- Until the dockerfile is able to auto-pull AWS credentials from
~/.aws/credentials, substitute the correct keys and run:export aws_access_key_id=<aws_access_key_id>export aws_secret_access_key=<aws_secret_access_key>
- Ensure docker is installed on the system - it may be necessary to run
sudo apt install docker.io - Download the oracle-client (
oracle-instantclient18.5-basiclite-18.5.0.0.0-3.x86_64.rpm) to the project folder from the citygeo-oracle-instant-client bucket - Run
docker build -f Dockerfile.fast -t dbtools --build-arg AWS_ACCESS_KEY_ID="$aws_access_key_id" --build-arg AWS_SECRET_ACCESS_KEY="$aws_secret_access_key" ./
This package uses a nested series of commands (via sub-modules) to implement separation of concerns. This makes it easier to isolate any bugs and offer additional functionality over time. At any time, add --help to the command to review the help guide for that command or sub-group. In click, commands and sub-groups are internally the same thing.
All commands will take the form of
databridge_etl_tools \
GROUP \
GROUP_ARGS \
COMMAND or SUB_GROUP1 \
[COMMAND ARGS or SUB_GROUP1 ARGS] \
[COMMAND or SUB_GROUP2] \
[COMMAND ARGS or SUB_GROUP2 ARGS] ...
See GROUPS, ARGS, SUB-GROUPS and COMMANDS at the end of this README for a full documentation of all possible methods.
# Remember to add --help after any command for an explanation and its parameters
# Upsert a record into AGO using a primary key
databridge_etl_tools \
ago \
--ago_org_url test \
--ago_user test \
--ago_pw test \
--ago_item_name test \
--s3_bucket test \
--s3_key test \
append-group \
--in_srid 0 \
--clean_columns test \
--batch_size 0 \
upsert \
--primary_key test
# Extract a table from Oracle SDE to S3
databridge_etl_tools \
oracle \
--table_name li_appeals_type \
--table_schema gis_lni \
--connection_string <user>/<password>@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP) (HOST=<host_name>)(PORT=<port>))(CONNECT_DATA=(SID=<dbname>))) \
--s3_bucket s3-bucket \
--s3_key s3_key \
extract
# Load a table from S3 to Carto
databridge_etl_tools \
carto \
--table_name test \
--connection_string carto://user:apikey \
--s3_bucket test \
--s3_key test \
--select_users optional \
--index_fields optional
update
# Load a table from S3 to Postgres
databridge_etl_tools \
postgres \
--table_name li_appeals_type \
--table_schema lni \
--connection_string postgresql://user:password@host:port/db_name \
--s3_bucket test \
--s3_key test \
load \
--json_schema_s3_key test
# Extract a json schema from Oracle (right now needed for carto uploads)
databridge_etl_tools \
oracle \
--table_name li_appeals_type \
--table_schema gis_lni \
--connection_string <user>/<password>@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP) (HOST=<host_name>)(PORT=<port>))(CONNECT_DATA=(SID=<dbname>))) \
--s3_bucket=test \
--s3_key=schemas/lni/li_appeals_type.json \
extract-json-schema
# Extract a json schema from DB2 (needed for carto uploads)
databridge_etl_tools \
postgres \
--table_name fema_floodplain_2023 \
--table_schema viewer_planning \
--connection_string postgresql://postgres:<password>@<hostname>:5432/<dbname> \
--s3_bucket=test \
--s3_key=schemas/planning/fema_floodplain_2023.json \
extract-json-schema
# Extract from Airtable
databridge_etl_tools \
airtable \
--app_id <some_app_id> \
--api_key <key_for_some_app> \
--table_name immigrant_services \
--s3_bucket test \
--s3_key staging/test/immigrant_services.csv \
extract
# Extract from Knack
databridge_etl_tools \
knack \
--app_id <some_app_id> \
--api_key <key_for_some_app> \
--knack_objectid <some_id_corresponding_to_table> \
--s3_bucket test \
--s3_key staging/test/candidates.csv \
extract
# Extract from Sharepoint
databridge_etl_tools \
sharepoint \
--graphapi_tenant_id <Tenant ID from Keeper> \
--graphapi_application_id <Application ID from Keeper> \
--graphapi_secret_value' <Secret Value from Keeper> \
--site_name ps360-metrics-share \
--file_path etl_tools_test_workbook.xlsx \
--s3_bucket test \
--s3_key staging/test/sharepoint_xlsx_test.csv \
--sheet_name Dataset \
extractTo manually test while developing, the package can be entered using the -m module flag (due to the presence of the __main__.py file)
python -m databridge_etl_tools \
oracle \
--table_name li_appeals_type \
--table_schema gis_lni \
--connection_string <user>/<password>@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<host_name>)(PORT=<port>))(CONNECT_DATA=(SID=<dbname>))) \
--s3_bucket s3-bucket \
--s3_key s3_key \
extractTo one run suite of tests locally, from the root databridge-etl-tools folder, run
pytest tests/<test_database.py>
--user $USER
--password $PASSWORD
--host $HOST
--database $DATABASE
-xv
--showlocals
--disable-warnings
--show-capture stdoutTo run just one test, add ::test_name to the end of test_database.py
When a commit is pushed to the master branch, GitHub actions will automatically run the tests given in .github/workflows/test_pr_build.yml using the secrets located in the repository in Settings > Secrets and Variables.
For this reason you should make changes to your test branch, make sure they pass automated tests and manual QA testing before merging to master.
ago: Run ETL commands for AGO- Args:
--ago_org_urlTEXT [required]--ago_userTEXT [required]--ago_pwTEXT [required]--ago_item_nameTEXT [required]--s3_bucketTEXT [required]--s3_keyTEXT [required]
- Commands:
exportExport from an AGO dataset into a csv file in S3post-index-fieldsPost index fields to AGO
- Sub-Group:
ago-append: Use this group for any commands that utilize append- Args:
--in_sridINTEGER The SRID of the source datasets geometry features.--clean_columnsTEXT Column, or comma separated list of column names to clean of AGO invalid characters.--batch_sizeINTEGER Size of batch updates to send to AGO
- Commands:
appendAppends records to AGO without truncating. NOTE that this is NOT an upsert and will absolutely duplicate rows if you run this multiple times.truncate-appendTruncates a dataset in AGO and appends to it from a CSV.upsertUpserts records to AGO, requires a primary key. Upserts the entire CSV into AGO, it does not look for changes or differences.- Args:
--primary_keyTEXT [required]
- Args:
- Args:
- Args:
carto: Run ETL commands for Carto:- Args:
--table_nameTEXT [required]--connection_stringTEXT [required]--s3_bucketTEXT [required]--s3_keyTEXT [required]--select_usersTEXT--index_fieldsTEXT
- Commands:
updateLoads a datasets from S3 into carto
- Args:
db2: Run ETL commands for DB2- Args:
--table_nameTEXT [required]--account_nameTEXT [required]
- Commands:
copy-staging-to-enterpriseCopies from etl_staging to the specified enterprise authoritative dataset.update-oracle-scnWRONG Creates a staging table in etl_staging from the specified enterprise authoritative dataset.- Args:
--oracle_conn_stringTEXT
- Args:
- Sub-Group:
libpqUse this group for any commands that utilize libpq- Args:
--enterprise_schemaTEXT [required]--libpq_conn_stringTEXT [required]
- Commands:
copy-dept-to-enterpriseCopy from the dept table directly to an enterpise able in a single transaction that can roll back if it fails.create-staging-from-enterpriseCreates a staging table in etl_staging from the specified enterprise authoritative dataset.
- Args:
- Args:
opendata: Run ETL commands for OpenData- Args:
--table_nameTEXT [required]--table_schemaTEXT [required]--s3_bucketTEXT [required]--s3_keyTEXT [required]--libpq_conn_stringTEXT [required]--opendata_bucketTEXT [required]
- Commands:
uploadTakes a CSV from S3, runs some transformations, and then uploads to the specified opendata bucket
- Args:
oracle: Run ETL commands for Oracle- Args:
--table_nameTEXT [required]--table_schemaTEXT [required]--connection_stringTEXT [required]--s3_bucketTEXT [required]--s3_keyTEXT [required]
- Commands:
extractExtracts a dataset in Oracle into a CSV file in S3extract-json-schemaExtracts a dataset's schema in Oracle into a JSON file in S3
- Args:
postgres: Run ETL commands for Postgres- Args:
--table_nameTEXT [required]--table_schemaTEXT [required]--connection_stringTEXT [required]--s3_bucketTEXT--s3_keyTEXT
- Commands:
-
extractExtracts data from a postgres table into a CSV file in S3. Has spatial and SRID detection and will output it in a way that the ago append commands will recognize.- Args:
--with_sridBOOLEAN Likely only needed for certain views. This controls whether the geopetl frompostgis() function exports with geom_with_srid. That wont work for some views so just export without. [default: True]
- Args:
-
extract-json-schemaExtracts a dataset's schema in Postgres into a JSON file in S3 -
loadPrepare and COPY a CSV from S3 to a Postgres table. The keyword arguments "column_mappings" or "mappings_file" can be used to map data file columns to database table colums with different names. Only one of column_mappings or mappings_file should be provided. Note that only the columns whose headers differ between the data file and the database table need to be included. All column names must be quoted.- Args:
--column_mappingsTEXT A string that can be read as a dictionary usingast.literal_eval(). It should take the form"{'data_col': 'db_table_col', 'data_col2': 'db_table_col2', ...}"--mappings_fileTEXT A text file that can be opened withopen()and that contains one Python dict that can be read withast.literal_eval(). The file should take the form{"data_col": "db_table_col", "data_col2": "db_table_col2", ... }. Note no quotes around the curly braces{}.
- Args:
-
upsert-csvUpserts data from a CSV to a Postgres table, which must have at least one primary key. The keyword arguments "column_mappings" or "mappings_file" can be used to map data file columns to database table colums with different names. Only one of column_mappings or mappings_file should be provided. Note that only the columns whose headers differ between the data file and the database table need to be included. All column names must be quoted.- Args:
--column_mappingsTEXT A string that can be read as a dictionary usingast.literal_eval(). It should take the form"{'data_col': 'db_table_col', 'data_col2': 'db_table_col2', ...}"--mappings_fileTEXT A text file that can be opened withopen()and that contains one Python dict that can be read withast.literal_eval(). The file should take the form{"data_col": "db_table_col", "data_col2": "db_table_col2", ... }. Note no quotes around the curly braces{}.
- Args:
-
upsert_tableUpserts data from a Postgres table to a Postgres table in the same database, which must have at least one primary key. The keyword arguments "column_mappings" or "mappings_file" can be used to map data file columns to database table colums with different names. Only one of column_mappings or mappings_file should be provided. Note that only the columns whose headers differ between the data file and the database table need to be included. All column names must be quoted.- Args:
--column_mappingsTEXT A string that can be read as a dictionary usingast.literal_eval(). It should take the form"{'data_col': 'db_table_col', 'data_col2': 'db_table_col2', ...}"--mappings_fileTEXT A text file that can be opened withopen()and that contains one Python dict that can be read withast.literal_eval(). The file should take the form{"data_col": "db_table_col", "data_col2": "db_table_col2", ... }. Note no quotes around the curly braces{}.--other_schemaTEXT Schema of Postgres table to upsert from. If None or absent, assume the same schema as the table being upserted to--other_tableTEXT Name of Postgres table to upsert from [required]
- Args:
-
- Args:
sharepoint: Run ETL commands for Sharepoint- Args:
--graphapi_tenant_idTEXT [required]--graphapi_application_idTEXT [required]--graphapi_secret_valueTEXT [required]--site_nameTEXT [required]--file_pathTEXT [required]--s3_bucketTEXT [required]--s3_keyTEXT [required]--sheet_nameTEXT
- Commands:
extractExtracts a dataset in Sharepoint into a CSV file in S3
- Args: