To contribute to this project:
- Fork the repository on github
- On your development machine, clone your forked repo and add the official repo as a remote.
- Tip: by convention, the official repo is added with the name
upstream
. This can be done with the commandgit remote add upstream [email protected]:SatcherInstitute/<repo>.git
- Tip: by convention, the official repo is added with the name
When you're ready to make changes:
- Pull the latest changes from the official repo.
- Tip: If your official remote is named
upstream
, rungit pull upstream master
- Tip: If your official remote is named
- Create a local branch, make changes, and commit to your local branch. Repeat until changes are ready for review.
- [Optional] Rebase your commits so you have few commits with clear commit messages.
- Push your branch to your remote fork, use the github UI to open a pull request (PR), and add reviewer(s).
- Push new commits to your remote branch as you respond to reviewer comments.
- Note: once a PR is under review, don't rebase changes you've already pushed to the PR. This can confuse reviewers.
- When ready to submit, use the "Squash and merge" option. This maintains linear history and ensures your entire PR is merged as a single commit, while being simple to use in most cases. If there are conflicts, pull the latest changes from master, merge them into your PR, and try again.
Note that there are a few downsides to "Squash and merge"
- The official repo will not show commits from collaborators if the PR is a collaborative branch.
- Working off the same branch or a dependent branch duplicates commits on the dependent branch and can cause repeated merge conflicts. To work around this, if you have a PR
my_branch_1
and you want to start work on a new PR that is dependent onmy_branch_1
, you can do the following:- Create a new local branch
my_branch_2
based onmy_branch_1
. Continue to develop onmy_branch_2
. - If
my_branch_1
is updated (including by merging changes from master), switch tomy_branch_2
and rungit rebase -i my_branch_1
to incorporate the changes intomy_branch_2
while maintaining the the branch dependency. - When review is done, squash and merge
my_branch_1
. Don't deletemy_branch_1
yet. - From local client, go to master branch and pull from master to update the local master branch with the squashed change.
- From local client, run
git rebase --onto master my_branch_1 my_branch_2
. This tells git to move all the commits betweenmy_branch_1
andmy_branch_2
onto master. You can now deletemy_branch_1
.
- Create a new local branch
Read more about the forking workflow here. For details on "Squash and merge" see here
Install Cloud SDK (Quickstart)
Install Terraform (Getting started)
Install Docker Desktop (Get Docker)
gcloud config set project <project-id>
- Create a virtual environment in your project directory, for example:
python3 -m venv .venv
- Activate the venv:
source .venv/bin/activate
- Install pip-tools and other packages as needed:
pip install pip-tools
This may be useful either:
- until terraform and fully automated deployment is set up
- for manual testing/experimentation. Different cloud functions can be deployed from the same source code, so you can deploy to a test function without affecting any of the other resources.
Although a function can be created via the gcloud functions deploy
command, there are some options you need to configure the first time it is deployed. It is much easier to create the function from the cloud console, and then use the command line to deploy source code updates.
Once a function is created, to deploy it from the command line:
- Navigate to the directory the
main.py
function is in - Run
gcloud functions deploy fn_name
Note that this deploys the contents of the current directory to the cloud function specified by fn_name. Be careful as this will overwrite the contents of fn_name
with the contents of the current directory. You can use this for testing and development by deploying the source code to a test function.
To change configuration details, you have to specify these options in the deploy
command. For example:
- If you need to change the entrypoint, use the
--entry-point
option. - If you need to change the trigger topic, use the
--trigger-topic
option.
A full list of options can be found here. Changing configuration of the function is usually easier from the cloud console UI.
To test a Cloud Function or Cloud Run service triggered by a Pub/Sub topic, run
gcloud pubsub topics publish projects/<project-id>/topics/<your_topic_name> --message "your_message"
- your_topic_name is the name of the topic the function specified as a trigger.
- your_message is the json message that will be serialized and passed to the
'data'
property of the event.
Note that this method will work for the upload to GCS function or service, which expects to read information from the 'data'
field. The GCS-to-BQ function or service expects to read from the 'attributes'
field, so the --attribute
flag should be used instead. See Documentation for details.
For example, you can use the following command to trigger ingestion for the list of state names and state codes (note that backslashes are required on Windows because Windows is weird and messes up serialization if you don't. OS X or Linux may not require backslashes, I'm not sure).
gcloud pubsub topics publish projects/temporary-sandbox-290223/topics/{upload_to_gcs_topic_name} --message "{\"id\":\"STATE_NAMES\", \"url\":\"https://api.census.gov/data/2010/dec/sf1\", \"gcs_bucket\":{gcs_landing_bucket}, \"filename\":\"state_names.json\"}"
where upload_to_gcs_topic_name
and gcs_landing_bucket
are the same as the terraform variables of the same name
Most python code should go in the /python
directory, which contains packages that can be installed into any service. Each sub-directory of /python
is a package with an __init__.py
file, a setup.py
file, and a requirements.in
file. Shared code should go in one of these packages. If a new sub-package is added:
-
Create a folder
/python/<new_package>
. Inside, add:- An empty
__init__.py
file - A
setup.py
file with options:name=<new_package>
,package_dir={'<new_package>': ''}
, andpackages=['<new_package>']
- A
requirements.in
file with the necessary dependencies
- An empty
-
For each service that depends on
/python/<new_package>
, follow instructions at Adding an internal dependency
To work with the code locally, run pip install ./python/<package>
from the root project directory. If your IDE complains about imports after changing code in /python
, re-run pip install ./python/<package>
.
Note: the /python
directory has three root-level files that aren't necessary: main.py
, requirements.in
, and requirements.txt
. These exist purely so the whole /python
directory can be deployed as a cloud function, in case people are relying on that for development/quick iteration. Due to limitations with cloud functions, these files have to exist directly in the root folder. We should eventually remove these.
- Add the dependency to the appropriate
requirements.in
file.- If the dependency is used by
/python/<package>
, add it to the/python/<package>/requirements.in
file. - If the dependency is used directly by a service, add it to the
<service_directory>/requirements.in
file.
- If the dependency is used by
- For each service that needs the dependency (for deps in
/python/<package>
this means every service that depends on/python/<package>
):- Run
cd <service_directory>
, thenpip-compile requirements.in
where<service_directory>
is the root-level directory for the service. This will generate arequirements.txt
file. - Run
pip install -r requirements.txt
to ensure your local environment has the dependencies, or runpip install <new_dep>
directly. Note, you'll first need to have followed the python environment setup described above Python environment setup.
- Run
If a service adds a dependency on /python/<some_package>
:
- Add
-r ../python/<some_package>/requirements.in
to the<service_directory>/requirements.in
file. This will ensure that any deps needed for the package get installed for the service. - Follow step 2 of Adding an external dependency to generate the relevant
requirements.txt
files. - Add the line
RUN pip install ./python/<some_package>
to<service_directory>/Dockerfile
The Cloud Code plugin for VS Code and JetBrains IDEs lets you locally run and debug your container image in a Cloud Run emulator within your IDE. The emulator allows you configure an environment that is representative of your service running on Cloud Run.
- Install Cloud Run for VS Code or a JetBrains IDE.
- Follow the instructions for locally developing and debugging within your IDE.
- VS Code: Locally developing and debugging
- IntelliJ: Locally developing and debugging
- After installing the VS Code plugin, a
Cloud Code
entry should be added to the bottom toolbar of your editor. - Clicking on this and selecting the
Run on Cloud Run emulator
option will begin the process of setting up the configuration for your Cloud Run service. - Give your service a name
- Set the service container image url with the following format:
gcr.io/<PROJECT_ID>/<NAME>
- Make sure the builder is set to
Docker
and the correct Dockerfile path is selected,prototype/run_ingestion/Dockerfile
- Ensure the
Automatically re-build and re-run on changes
checkbox is selected for hot reloading. - Click run
After your Docker container successfully builds and is running locally you can start sending requests.
-
Open a terminal
-
Send curl requests in the following format:
DATA=$(printf '{"id":<INGESTION_ID>,"url":<INGESTION_URL>,"gcs_bucket":<BUCKET_NAME>,"filename":<FILE_NAME>}' |base64) && curl --header "Content-Type: application/json" -d '{"message":{"data":"'$DATA'"}}' http://localhost:8080
- Create a service account in Pantheon
- Using IAM, grant the appropriate permissions to the service account
- Inside the
launch.json
file, set theconfiguration->service->serviceAccountName
attribute to the service account email you just created.
Before deploying, make sure you have installed Terraform and a Docker client (e.g. Docker Desktop). See One time setup above.
-
Create your own
terraform.tfvars
file in the same directory as the other terraform files. For each variable declared inprototype_variables.tf
that doesn't have a default, add your own for testing. Typically your own variables should be unique and can just be prefixed with your name or ldap. There are some that have specific requirements like project ids, code paths, and image paths. -
Configure docker to use credentials through gcloud.
gcloud auth configure-docker
-
On the command line, navigate to your project directory and initialize terraform.
cd path/to/your/project terraform init
-
Build and push your Docker images to Google Container Registry. Select any unique identifier for
your-[ingestion|gcs-to-bq]-image-name
.# Build the images locally docker build -t gcr.io/<project-id>/<your-ingestion-image-name> -f run_ingestion/Dockerfile . docker build -t gcr.io/<project-id>/<your-gcs-to-bq-image-name> -f run_gcs_to_bq/Dockerfile . # Upload the image to Google Container Registry docker push gcr.io/<project-id>/<your-ingestion-image-name> docker push gcr.io/<project-id>/<your-gcs-to-bq-image-name>
-
Deploy via Terraform.
# Get the latest image digests export TF_VAR_ingestion_image_name=$(gcloud container images describe gcr.io/<project-id>/<your-ingestion-image-name> \ --format="value(image_summary.digest)") export TF_VAR_gcs_to_bq_image_name=$(gcloud container images describe gcr.io/<project-id>/<your-gcs-to-bq-image-name> \ --format="value(image_summary.digest)") # Deploy via terraform, providing the paths to the latest images so it knows to redeploy terraform apply -var="ingestion_image_name=<your-ingestion-image-name>@$TF_VAR_ingestion_image_name" \ -var="gcs_to_bq_image_name=<your-gcs-to-bq-image-name>@$TF_VAR_gcs_to_bq_image_name"
Alternatively, if you aren't familiar with bash or are on Windows, you can run the above
gcloud container images describe
commands manually and copy/paste the output into your tfvars file for theingestion_image_name
andgcs_to_bq_image_name
variables. -
To redeploy, e.g. after making changes to a Cloud Run service, repeat steps 4-5. Make sure you run the commands from your base project dir.
Currently the setup deploys both a cloud funtion and a cloud run instance for each pipeline. These are duplicates of each other. Eventually, we will delete the cloud fuctions, but for now you can just comment out the setup for whichever one you don't want to use in prototype.tf
Terraform doesn't automatically diff the contents of the functions/cloud run service, so simply calling terraform apply
after making code changes won't upload your new changes. This is why Steps 4 and 5 are needed above. Here are several alternatives:
- Use
terraform taint
to mark a resource as requiring redeploy. Egterraform taint google_cloud_run_service.ingestion_service
- For Cloud Run, you can then set the
run_ingestion_image_path
variable in your tfvars file togcr.io/<project-id>/<your-ingestion-image-name>
andrun_gcs_to_bq_image_path
togcr.io/<project-id>/<your-gcs-to-bq-image-name>
. Then replace Step 5 above with justterraform apply
. Step 4 is still required. - For Cloud Functions, no extra work is needed, just run
terraform taint
and thenterraform apply
- For Cloud Run, you can then set the
- For Cloud Functions, call
terraform destroy
every time beforeterraform apply
. This is slow but a good way to start from a clean slate. Note that this doesn't remove old container images so it doesn't help for Cloud Run services.