This repository contains the code for a machine learning data ingestion pipeline for flight data. The pipeline uses Google Cloud Platform (GCP) functions and interacts with the FlightAware API.
The data is Ingested as 'snapshots' of the flight statuses provided by the API, with a timestamp of the current snapshot run and the previous snapshot run. Each new snapshot is appended to a Google BigQuery table when this cloud function is invoked.
The repository is structured as follows:
src/main.py: The main execution point of the Python cloud function. It is responsible for importing and calling thesrc/ingest.pyscript.src/ingest.py: Used to get flight data from the FlightAware API, given a flight identifier and a time range. It then stores the data in a BigQuery table.src/ingest.ipynb: Where the ingest.py file can be developed and debugged. All execution must occur in themain()function. The script,src/ingest.py, should be an identical copy of this notebook.src/convert_to_py.ipynb: A single-cell notebook which converts thesrc/ingest.ipynbto a Python script. You can run this after making changes to thesrc/ingest.ipynbnotebook.src/utils.py: Has utility functions which are imported into ingest. Includes a class for encoding JSON objects to strings and decoding strings back to JSON objects. This is useful for storing JSON objects in environment variables, rather than importing them from a JSON file. Additionally, this script contains a helper class for the FlightAware API..github/workflows/deploy.yaml: Defines a GitHub Actions workflow for deploying ()or updating) the Cloud Function on GCP, along with the Cloud Function's configuration options.
Additional files in the repository include:
src/requirements.txt: This file lists the Python dependencies required by the project.- .env (add this yourself): Contains authentication keys.
To set up and run the project, you typically need to do the following:
- Create a virtual environment for with Python 3.10.0 for dev purposes. If using Miniconda, you can do this by running
conda create -n <env_name> python=3.10.0. - Install the required Python dependencies listed in
src/requirements.txt. You can do this by runningpip install -r src/requirements.txt. - Create a
.envfile in the base directory. Two environment vairble keys will be stored here:FLIGHTAWARE_API_KEY- API key for the FlightAware API.GCP_CREDENTIALS_JSON_ENCODED- A GCP service account key, encoded as a string.- This key is used to authenticate with GCP services such as the BigQuery client and pandas_gbq. To encode the JSON key as a string, you can use the
JSON_EncoderDecoderclass insrc/utils.pyscript. - The easiest way to do this is to use the
src/ingest.ipynbnotebook, in a new cell, paste your service key as a JSON object, encode the key usingJSON_EncoderDecoder(json_object).encode().get(), and copy the encoded key to the.envfile. Don't forget to delete the cell after you're done.
- This key is used to authenticate with GCP services such as the BigQuery client and pandas_gbq. To encode the JSON key as a string, you can use the
- (Optional) Debug the
src/ingest.ipynbnotebook to ensure that the data is being ingested correctly.- If your goal is to test, debug, or modify this app, run the
src/ingest.ipynbnotebook. Before commiting your changes, runsrc/convert_to_py.ipynbto copy the changes tosrc/ingest.py. This allows the project to be run as a Python script, as opposed to a Jupyter notebook, while still allowing for easy testing and debugging.
- If your goal is to test, debug, or modify this app, run the
- Run the main Python script with
python src/main.py.
- Github Actions Authentication
- The project is set up for deployment with GitHub Actions, as defined in the
.github/workflows/deploy.yamlfile. - To deploy the project, add a
GCP_SA_KEYsecret to your GitHub Actions environment. This allows the deployment workflow to authenticate with GCP services. Make sure to copy and paste the entire the service account key JSON object, including{}.
- The project is set up for deployment with GitHub Actions, as defined in the
- Cloud Function Authentication
- You will need to add the following environment variables to your GCP Cloud Function:
FLIGHTAWARE_API_KEYGCP_CREDENTIALS_JSON_ENCODED
- These should be stored in the GCP Secret Manager, then referenced in the Cloud Function configuration.
- These secrets are assigned to environment variable names in the
deploy.yamlworkflow, using thesecret_environment_variablesparameter,- For the deployed app to access these secrets, you must specify a
service-account-emailwhich the deployed app will belong to. Thisservice-account-emailmust have theSecret Manager Secret Accessorrole which can be granted in GCP's IAM & Admin section.
- For the deployed app to access these secrets, you must specify a
- These secrets are assigned to environment variable names in the
- List of Deployment Configuration Parameters: Github Actions for Google Cloud Functions
- You will need to add the following environment variables to your GCP Cloud Function: