Skip to content

A feature engineering pipeline for income prediction using the Ballet framework

Notifications You must be signed in to change notification settings

thswear/ballet-predict-census-income

 
 

Repository files navigation

ballet project chat

Predict Census Income

This is a collaborative predictive modeling project built on the ballet framework.

This project contains a feature engineering pipeline and associated models that can be used to predict personal income from raw survey responses to the US Census American Community Survey. The model built from features submitted by the community can then be used to optimize administration of the survey, direct public policy interventions, and assist empirical researchers.

See dates/times for upcoming virtual collaboration hours, opportunities to work at the same time as other collaborators and ask/answer questions in the chat.

Join the collaboration

Are you interested in joining the collaboration?

Your task

Your task is to create and submit one feature to the project.

  1. The easiest way to get started is to launch an interactive Jupyter Lab session to hack on this repository. You can read more about this development workflow here.

  2. Alternately, you can use your preferred tools and development environment to create and submit a feature from your own machine. You can read about the local development workflow here.

Getting started

First, get acquainted with the Ballet framework if you are not yet familiar.

Virtual collaboration hours

We are hosting several Virtual Collaboration Hours (VCH) in which we will work together on feature engineering and help each other with ideas and implementation. The VCH will start with a short video presentation aimed at beginners introducing Ballet and this predict-census-income project, with an opportunity for Q&A. Then, we will split off to work and will be chatting in the Gitter chat.

Schedule (check back here to confirm!)

Dataset

Input data

The input data is the raw survey responses to the 2018 US Census American Community Survey (ACS). This is known as the "Public Use Microdata Sample" because otherwise most numbers from the ACS are reported in aggregate.

  • The data documentation can be viewed here
  • The data dictionary can be viewed here in PDF form, or here in CSV form.
  • Many additional resources about the ACS can be viewed here.
  • The dataset is created by merging the "household" and "person" parts of the survey. Thus one row of the dataset contains the responses for one person to both the household and person surveys. A person is identified by a unique SERIALNO. A set of "reasonable" rows is filtered as follows: (1) individuals older than 16 (2) personal income greater than $100 (3) hours worked in a typical week greater than 0.

The full script that minimally prepares the data is here.

The resulting training dataset has 30085 rows (people) and 494 columns (raw).

Prediction target

The prediction target is whether an individual respondent will earn more than $84,770 in 2018. Though a bit contrived, this comes from adapting the classic ML "census" dataset to the modern era. The original prediction target is to

determine whether a person makes over 50K a year.

Thus we adjust for inflation from 1994 to 2018.

Getting help

Usage

To use the feature engineering pipeline:

from predict_census_income.api import api
X_df, y_df = api.load_data()
pipeline = api.pipeline
features = pipeline.fit_transform(X_df, y_df)

To use a sample model:

from predict_census_income.models import train, predict
from predict_census_income.models.logistic_regression import create_logistic_regression_model
X_df, y_df = api.load_data()
model, encoder = train(X_df, y_df, create_logistic_regression_model)  # loads the pipeline/encoder automatically
predict(model, X_df)  # make predictions on training data

About

A feature engineering pipeline for income prediction using the Ballet framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 64.9%
  • Jupyter Notebook 32.8%
  • Shell 2.3%