Skip to content

Latest commit

 

History

History
440 lines (312 loc) · 13.6 KB

README.md

File metadata and controls

440 lines (312 loc) · 13.6 KB

Recidiviz Data Platform

Coverage Status

At the center of Recidiviz is our platform for tracking granular criminal justice metrics in real time. It includes a system for the ingest of corrections records from different source data systems, and for calculation of various metrics from the ingested records.

Read more on data ingest in /recidiviz/ingest and calculation in /recidiviz/calculator.

License

This project is licensed under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Data Access

The data that we have gathered from criminal justice systems has been sanitized, de-duplicated, and standardized in a single schema. This processed data is central to our purposes but may be useful to others, as well. If you would like access to the processed data, in whole or in part, please reach out to us at [email protected]. We evaluate such requests on a case-by-case basis, in conjunction with our partners.

Calculated metrics can also be made available through the same process, though we anticipate publishing our analysis in various forms and channels over time.

Forking

The Recidiviz data system is provided as open source software - for transparency and collaborative development, to help jump-start similar projects in other spaces, and to ensure continuity if Recidiviz itself ever becomes inactive.

If you plan to fork the project for work in the criminal justice space (to ingest from the same systems we are, or similar), we ask that you first contact us for a quick consultation. We work carefully to ensure that our ingest activities don't disrupt other users' experiences with the public data services we read, but if multiple ingest processes are running against the same systems, without knowing about one another, it may place excessive strain on them and impact the services those systems provide.

If you have ideas or new work for the same data we're collecting, let us know and we'll work with you to find the best way to get it done.

Development

If you are contributing to this repository regularly for an extended period of time, request GitHub collaborator access to commit directly to the main repository.

Local Development

Environment setup

Option 1: Local Python installation

If you can install python3.9 locally, do so. For local Python development, you will also need to install the libpq PostgreSQL client library and openssl.

On a Mac with Homebrew, you can install python3.9 by first installing pyenv with:

brew install pyenv
brew install xz
mkdir ~/.pyenv

Then, add the following to your ~/.zshrc (or equivalent):

export PATH="$HOME/.local/bin:$PATH"
if command -v pyenv 1>/dev/null 2>&1; then
eval "$(pyenv init -)"
fi

Then run:

pyenv install 3.9.12
pyenv global 3.9.12

Verify that you have the correct version of python across contexts by opening a new terminal window and running:

python -V

Once python is installed, you can install libpq and openssl with:

$ brew install postgresql@13 openssl

and add the following to your ~/.zshrc (or equivalent):

export PATH="/opt/homebrew/opt/postgresql@13/bin:$PATH"

On Ubuntu 18.04,openssl is installed by default, you can install python3.9 and libpq with:

$ apt update -y && apt install -y python3.9-dev python3-pip libpq-dev

You do not need to change your default python version, as pipenv will look for 3.9.

Upgrade your pip to the latest version:

$ pip install -U pip

NOTE: if you get ImportError: cannot import name 'main' after upgrading pip, follow the suggestions in this issue.

If you do not already have pip installed, you can install it on a Mac with these commands:

$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
$ python get-pip.py --user

On Ubuntu 18.04, you can install pip with:

$ sudo apt-get install python-pip

Install pipenv:

$ pip install pipenv --user

Fork this repository, clone it locally, and enter its directory:

$ git clone [email protected]:your_github_username/pulse-data.git
$ cd pulse-data

To create a new pipenv environment and install all project and development dependencies on mac and debian machines, run the initial_pipenv_setup script.

NOTE: Installation of one of our dependencies (psycopg2) requires OpenSSL, and as OpenSSL is not linked on Macs by default, this script temporarily sets the necessary compiler flags and then runs pipenv sync --dev. After this initial installation all pipenv sync/installs should work without this script.

$ ./initial_pipenv_setup.sh

On a Linux machine, run the following:

$ pipenv sync --dev

NOTE: if you get pipenv: command not found, add the binary directory to your PATH as described here.

To activate your pipenv environment, run:

$ pipenv shell

On a Mac with Homebrew, you can install the JRE with:

$ brew install java

On Ubuntu 18.04, you can install the JRE with:

$ apt update -y && apt install -y default-jre

On a Mac with Homebrew, you can install jq (needed to deploy calculation pipelines) with:

$ brew install jq

On Ubuntu 18.04, you can install jq with:

$ apt update -y && apt install -y jq

Finally, run pytest. As of Feb 2022, one might expect ~200 tests to fail locally, with errors mainly falling into one of two categories: Receiver() takes no arguments and Already initialized database/ValueError: Accessing SQLite in-memory database on multiple threads. The former error is due to an incompatibility with Cython that may be due to newer Mac models or python versions, and the latter is due to tests not properly cleaning up after themselves. All of these tests pass in CI. You can ignore any failing tests with (for example):

$ pytest --ignore=recidiviz/tests/path/to/tests
Option 2: Docker container

If you can't install python3.9 locally, you can use Docker instead.

See below for installation instructions. Once Docker is installed, fork this repository, clone it locally, and enter its directory:

$ git clone [email protected]:your_github_username/pulse-data.git
$ cd pulse-data

Build the image:

$ docker build -t recidiviz-image . --build-arg DEV_MODE=True

Stop and delete previous instances of the image if they exist:

$ docker stop recidiviz && docker rm recidiviz

Run a new instance, mounting the local working directory within the image:

$ docker run --name recidiviz -d -t -v $(pwd):/app recidiviz-image

Open a bash shell within the instance:

$ docker exec -it recidiviz bash

Once in the instance's bash shell, update your pipenv environment:

$ pipenv sync --dev

To activate your pipenv environment, run:

$ pipenv shell

Finally, run pytest. If no tests fail, you are ready to develop!

Using this Docker container, you can edit your local repository files and use git as usual within your local shell environment, but execute code and run tests within the Docker container's shell environment. Depending on your IDE, you may need to install additional plugins to allow running tests in the container from the IDE.

Google Cloud

Recidiviz interacts with Google Cloud services using google-cloud-* Python client libraries. During development, you may find it useful to verify the integration with these services. First, install the Google Cloud SDK, then login to the SDK:

gcloud auth login --enable-gdrive-access --update-adc # Gets credentials to interact with services via the CLI
gcloud auth application-default login # Gets credentials which will be automatically read by our client libraries

Lastly, in a test script, use the local_project_id_override helper to override configuration used by our client library wrappers:

from recidiviz.utils.metadata import local_project_id_override
from recidiviz.utils.environment import GCP_PROJECT_STAGING

# Override configuration used by our client libraries
with local_project_id_override(GCP_PROJECT_STAGING):
    # Google Cloud Client libraries will use `recidiviz-staging` in this context

Now the code run in the above context will interact directly with our staging services. Use conservatively & exercise caution!

Terraform

Run the following to install Terraform:

brew tap hashicorp/tap
brew install hashicorp/tap/terraform

To test your installation, run:

terraform -chdir=recidiviz/tools/deploy/terraform init -backend-config "bucket=recidiviz-staging-tf-state"
recidiviz/tools/deploy/terraform_plan.sh recidiviz-staging

If the above commands succeed, the installation was successful. For employees, see more information on running Terraform at go/terraform.

Docker (🐳 go/docker)

Docker is needed for deploying new versions of our applications.

Follow these instructions to install Docker on Linux:

Go to this page to download Docker Desktop for Mac and Windows.

Once installed, increase the memory available to Docker to ensure it has enough resources to build the container. On Docker Desktop, you can do this by going to Settings > Resources and increasing Memory to 4GB.

Adding secrets

Recidiviz depends on sensitive information to run. This data is stored in Cloud Datastore, which should be added manually to your production environment (see utils/secrets for more information on the Datastore kind used).

Running tests

Individual tests can be run via pytest filename.py. To run all tests, go to the root directory and run pytest recidiviz.

The configuration in setup.cfg and .coveragerc will ensure the right code is tested and the proper code coverage metrics are displayed.

A bug in the google client requires that you have default application credentials. This should not be necessary in the future. For now, make sure that you have done both gcloud config set project recidiviz and gcloud auth application-default login.

Checking code style

Run Pylint across the main body of code, in particular: pylint recidiviz.

The output will include individual lines for all style violations, followed by a handful of reports, and finally a general code score out of 10. Fix any new violations in your commit. If you believe there is cause for a rule change, e.g. if you believe a particular rule is inappropriate in the codebase, then submit that change as part of your inbound pull request.

Autoformatting

We use black to ensure consistent formatting across the code base and isort to sort imports. There is a pre-commit hook that will format all of your files automatically. It is defined in githooks/pre-commit and is installed by ./initial_pipenv_setup.sh.

You can also set up your editor to run black and isort on save. See the black docs for how to configure external tools (both black and isort) to run in PyCharm (more info in PyCQA/isort#258).

In VSCode just add the following to your .vscode/settings.json:

    "editor.formatOnSave": true,
    "python.formatting.provider": "black",
    "[python.editor.codeActionsOnSave]": {
        "source.organizeImports": true
    },

Static type checking

Run Mypy across all code to check for static type errors: mypy recidiviz.

Static security checking

We use bandit to check for static security errors within the recidiviz folder. This is run in the CI. Adding # nosec to the effected line will ignore false positive issues.

Deployment

Install the GCloud SDK using the interactive installer.

Note: make sure the installer did not add google-cloud-sdk/platform/google_appengine or subdirectories thereof to your $PYTHONPATH, e.g. in your bash profile. This could break attempts to run tests within the pipenv shell by hijacking certain dependencies.

Make sure you have docker installed (see instructions above), then configure docker authentication:

$ gcloud auth login
$ gcloud auth configure-docker

Troubleshooting

If you see a pipenv error (either during install or sync) with the following:

An error occurred while installing psycopg2==...

On a Mac:

  1. Ensure postgresql and openssl are installed with: brew install postgresql openssl
  2. Run the initial pipenv setup script: ./initial_pipenv_setup.sh

On Linux: Ensure libpq is installed with: apt update -y && apt install -y libpq-dev