Contextualized Word Embeddings Contain Emergent Intersectional Biases in a Contextualized Distribution of Human-like Bias Scores
This repository is the official implementation of Contextualized Word Embeddings Contain Emergent Intersectional Biases in a Contextualized Distribution of Human-like Bias Scores.
To set up an environment for the project
conda create --name ceat
conda activate ceat
To install requirements:
pip install -r requirements.txt
We listed the details of Word Embedding Association Test and the words we used in validation of CEAT, IBD and UIBD as supp.pdf
in this codebase.
Reddit Comment Dataset 2014 is used. Here's the link of raw json comment files.
Since the raw dataset is too huge for storage, we provide a pickle file that contains the sentences for our experiment from this huge raw dataset. The link of this data file is in data.md
file. The pickle file contains a big dictionary file. The dictionary contains all the sentences we need for CEAT.
After downloading all comment json file, you can use the python file we provide.
import pickle
dataset = pickle.load(open('file','rb'))
If you prefer to download the raw data yourself, we also provide a code file to process the raw data.
python generate_txt.py
It extract the comment, clean the raw text and save it as a pickle file.
Besides the raw json files provided by the link.
Another way is to test with a small sample set. For this task, we can use Google BigQuery to inquiry the needed comments. Here's a sample big query script to select 10 comments in 2014.
select * from `fh-bigquery.reddit_comments.20014` limit 10
After you download the comments, the comments files should be stored as pickle file in dictionary. The keys are the target and attribute words. The values are lists. Each list contains comments that contain the key word.
Please set the file path as you needed before running the scripts.
In this step we generate contextualized word embeddings and store it in pickle files.
Each pickle file is a dictionary whose keys are the words in tests and values are a list whose items are 300-d contextualized word embeddings.
The generated contextualized word embeddings files should be named as weat{test_number}_{model_name}.pickle
python generate_ebd_{model_name}.py
There're four models we used: Bert, GPT, GPT2 and elmo. You will find four files in code folder.
Run the script to generate effect size, p value for N=10000 (by default) time of sampling.
For CEAT(C1~C10):
python ceat.py
Effect sizes, p values of each test are stored as list in seperate pickle files. The pickle file is named as: {model_name}_weat{test_number}_pvalue.pickle, {model_name}_weat{test_number}_effectsize.pickle
The returned values are Combined Effect Size (CES), P value of each test.
Users can use matplotlib library to draw the distribution based on sampling effect sizes and p values.
To detect intersectional biases of African American females (AF) and Latino American females (LF)
python ibd.py