Skip to content

uscensusbureau/DAS_2020_GLS_Uncertainty_Evaluation

Repository files navigation

GLS Confidence Intervals for 2020 Census Disclosure Avoidance

This is a repository for a GLS estimator based on the noisy measurements generated by DAS, as described in Full-Information Estimation For Hierarchical Data.

The input parameters (eg: paths, spark partition sizes, etc.) for the main algorithm are specified in a short python file called a driver, which are located in GLS/drivers/. The example drivers currently in the directory drivers all run the main estimation algorithm and then estimate the confidence intervals (CIs) for a set of geographic levels and queries. These drivers can be called using the file run_cluster.sh.

In addition to the main algorithm, this repo also provides the script two_pass_gls_algorithm_cov_test.py. This script generates a random spine and random query variances and then estimates the GLS estimator variance matrix using the standard approach, ie, forming the full coefficient matrix that encodes the coefficient matrices for each geographic unit in the spine. Afterward, this script iterates through all possible pairs of geographic units on the spine and verifies that the covariance between the detailed cell estimates of these two geographic units (found using the previously-computed standard GLS covariance matrix) match the derivations for this matrix provided in the paper.

How to Run

To run the drivers for this program:

  1. Have either an EMR cluster or Spark environment set up. Important note: The spark submit configured in this run is intended to run with r6i.8xlarge instances. If your environment is not using at least this size, it is recommended to adjust the spark-submit configurations in the run_cluster.sh script.
  2. Go into the environment, and then clone the repo.
  3. After cloning the repo, cd into the repo.
  4. Run the environment setup: sudo bash setup_environment.sh. This will take care of setting up the virtual environment, the third party requirements, and the .zip file for spark.
  5. After this, you're done setting up! All that's left is to download the required noisy measurement files from AWS Open Data, set the reader paths in the desired driver file (e.g., drivers/pr_pl94.py) to the paths of the NMFs, set the writer paths in the driver files, and then run the driver file. For example, to call the PL94 driver for Puerto Rico, you can do driver=GLS/drivers/pr_pl94_2020.py nohup bash run_cluster.sh 2>&1 1> out_run_gls.log &.

Additional Notes

This repo runs far faster if numpy is compiled using the MKL LAPACK and BLAS.

About

Contains GLS code for evaluating uncertainty introduced by differential privacy for the 2020 Decennial Census

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published