Skip to content

Latest commit



193 lines (129 loc) · 5.1 KB

File metadata and controls

193 lines (129 loc) · 5.1 KB

Code to replicate The Undesirable Dependence on Frequency of Gender Bias Metrics Based on Word Embeddings (2022).

The following guide was run in Ubuntu 18.04.4 LTS with python=3.9.12 and R=4.2.0. You can set up a conda environment but it is not compulsory.


Install Python requirements:

python -m pip install -r requirements.txt

Install R requirements:

Rscript install_packages.R

Clone Stanford's GloVe repo into the repo:

git clone

or alternatively add it as submodule:

git submodule add

To build GloVe:

  • In Linux: cd GloVe && make

  • In Windows: make -C "GloVe"


Wikipedia corpus

  1. Download 2021 English Wikipedia dump into corpora dir:
mkdir -p corpora
wget -c -b -P corpora/ $WIKI_URL/$WIKI_FILE
# flag "-c": continue getting a partially-downloaded file
# flag "-b": go to background after startup. Output is redirected to wget-log.
  1. Extract dump into a raw .txt file:
src/data/ corpora/enwiki-20210401-pages-articles.xml.bz2
  1. Create text file with one line per sentence and removing articles of less than 50 words:
python3 -u src/data/ corpora/enwiki-20210401-pages-articles.txt
  1. Remove non alpha-numeric symbols from sentences, clean whitespaces and convert caps to lower:
src/data/ $CORPUS_IN > $CORPUS_OUT

See number of lines, tokens, characters in the preprocessed corpus:

wc corpora/wiki2021.txt
# 78051838  1748884626 10453280228 corpora/wiki2021.txt

Shuffle corpus

Shuffle the corpus multiple times. Set seeds in src/data/ Each new corpus is named as corpora/wiki2021s<seed>.txt.

bash src/data/ $CORPUS_IN

Co-occurrence counts

  1. Create vocabulary of original and shuffled corpora using GloVe module:
mkdir -p data/working &&
OUT_DIR=data/working &&
IDS=(wiki2021 wiki2021s1 wiki2021s2 wiki2021s3 wiki2021s4 wiki2021s5) && 
for id in ${IDS[@]}; do
    src/ $corpus $OUT_DIR $VOCAB_MINCOUNT
  1. Create co-occurrence matrices with scipy.sparse format (.npz file) using GloVe module:

Word embeddings

  1. Download GloVe and word2vec pretrained embeddings:
python3 -u src/
  1. Train SGNS on the corpora with gensim library. For each corpus, this saves a .model with trained model and .npy with the embeddings in array format. If the model is large, files with extension .trainables.syn1neg.npy and .wv.vectors.npy might be saved alongside .model.
bash src/
  1. Train GloVe on the corpora and save one .npy file for each corpus with the vectors in array format.
bash src/

Bias quantification

  1. Compute female vs male $Bias_{WE}$ with pre-trained word embeddings. The lists of contexts words are specified in words_lists/ as text files. Results are saved into results/bias_{modelname}_{A}_{B}.csv with one row per word in the vocabulary.
mkdir -p results &&
B="MALE" &&
python3 -u src/ $A $B "glove-wiki-gigaword-300" &&
python3 -u src/ $A $B "word2vec-google-news-300"
  1. Compute female vs male $Bias_{WE}$ and $Bias_{PMI}$ of the original and shuffled corpora. The lists of contexts words are specified in words_lists/ as text files. Results are saved into results/bias_{modelname}_{A}_{B}.csv with one row per word in the vocabulary.
B="MALE" &&
nohup src/ $A $B &


To replicate figures for $Bias_{WE}$ with pretrained word embeddings:

mkdir -p results/plots
R -e 'rmarkdown::render("plots_pretrained.Rmd", "html_document")'

Replicate tables and figures for $Bias_{WE}$ and $Bias_{PMI}$ with the original and shuffled 2021 Wikipedia with:

R -e 'rmarkdown::render("plots_trained.Rmd", "html_document")'

Results are saved as html documents.

conda environment

You can create a bias-frequency conda environment to install requirements and dependencies. This is not compulsory.

To install miniconda if needed, run:

# and follow stdout instructions to run commands with `conda`

To create a conda env with Python and R:

conda config --add channels conda-forge
conda create -n "bias-pmi" --channel=defaults python=3.9.12
conda install --channel=conda-forge r-base=4.2.0

Activate the environment with conda activate bias-frequency and install pip with conda install pip.