Overview

The Human Rights First Organization is a US-based nonprofit, nonpartisan organization concerned with international human rights. At its forefront are American ideals and universal values. For nearly 40 years HRF has challenged the status quo by highlighting the global struggle for human rights and stepping in to demand reform, accountability and justice. The goal of this project is to create a fully functioning web application capable of visually demonstrating valid and current incidences of police use of force within the United States. The information will help users, such as journalists and passersby, to formulate their perspectives on current matters. The exemplary user interface immediately captures attention with the clusters of incidence shown by geotagging.

This project has been worked on by many BloomTech labs teams over the past 10 months. In the final month of development, Labs Cohort 36 was tasked with finalizing our codebase and architecture to deploy a production-ready app. This included: automating our collection of Twitter data, deploying to AWS Elastic Beanstalk, adapting our database architecture to the backend team's schema, labeling 5,000 tweets to retrain our BERT model, creating performance metrics for our model, cleaning our codebase, and updating the documentation.

Features

Deployed Product

Front End Dashboard | Data Science API

Twitter Scraper

Automated through the FastAPI framework in main.py to run every four hours
Everytime it runs, will randomly select a search query from a set of phrases (police, police brutality, police abuse, police violence) to use in the Twitter API search
Relevant functions for the scraper feature can be found in scraper.py

Twitter Bot

Invoked through main.py.form_out
Needs to run main.py.advance_all to advance each conversation 1 step (This is automated to run every 4 minutes in one of the pull requests)
main.py.advance_all runs every hour automatically, distributed lock means only one worker runs at a time
Code fragments left to allow Twitter conversational bot to be updated
Checks made is being updated for each check, there should be an implementation for exponential backoff on check frequency. Look up exponential backoff.

Alembic

Allows developers to manage migrations safely
Connected to models.py through declarative_base import
Connected to production DB through .env file (obviously not in repo)
in CLI, after generating virtual environment from requirements.txt:
to generate a revision file run: alembic revision --autogenerate then spot check revision file for errors
to run that revision, run alembic upgrade head
to undo a revison run alembic downgrade
bear in mind that revisions won't store data if you drop a row, so keep a pg_dump file on hand to psql recreate db

BERT Model

BERT is an open-source, pre-trained, natural language processing (NLP) model from Google. The role of BERT in our project is to take the tweets collected from our Twitter scraper and predict whether or not the tweet discusses police use-of-force and what type of force they used. BERT uses a 6-rank classification system as follows:

Rank 0: No police presence.
Rank 1: Police are present, but no force detected.
Rank 2: Open-hand: Officers use bodily force to gain control of a situation. Officers may use grabs, holds, and joint locks to restrain an individual.
Rank 3: Blunt Force: Officers use less-lethal technologies to gain control of a situation. Baton or projectile may be used to immobilize a combative person for example.
Rank 4: Chemical & Electric: Officers use less-lethal technologies to gain control of a situation, such as chemical sprays, projectiles embedded with chemicals, or tasers to restrain an individual.
Rank 5: Lethal Force: Officers use lethal weapons (guns, explosives) to gain control of a situation.

The BERT model does not currently live in the GitHub repository due to its large file size. When running the app locally, it is best to manually store the saved_model file in the app directory.

BERT Rankings

Taking a deeper dive we can turn our eyes to the black box of our model. For this task we will use LIME. LIME is an acronym for local interpretable model-agnostic explanations. Local is refers to local fidelity, meaning we want the explanation to really reflect the behaviour of the classifier "around" the instance being predicted. Interpretable refers to making sense of these explanations. Lastly, model-agnostic refers to giving explanations without needing to ‘peak’ into it.

How does LIME work? For our problem we will utilize the LIME TextExplainer. The TextExplainer generates a lot of texts similar to the document(by removing some words), then trains a white-box classifier that predicts the output of the black-box classifier. This process can be broken down into three simple steps. First, generate text. Second, predict probabilities for these generated texts. Third, train another classifier to predict the output of the black box classifier. While black boxes are hard to approximate, this algorithm works by approximating it in a small neighborhood near the given text in a white-box classifier. Finally, let's look at some visualizations! Below LimeTextExplainer is showing us the weights for each word in an incident report.

In the picture above the model is predicting class 5 with a 100% probability. Within the incident report the word “shot” has the highest weights for class 5 at 0.22. Meaning if we remove the word ‘shot’ from the incident report we would expect the model to predict class 5 with the probability at 100% - 22% = 78%. Conversely, the words “handgun” and “was” have small negative weights.

Notebooks

There are two notebooks pertaining to the model:

FrankenBERT_Training.ipynb: trains a BERT instance based on the data given to it from the training table in our postgres AWS EB database and our generated tweets
FrankenBERT_Performance.ipynb: used for statistical analysis and to calculate model performance metrics (i.e. binary and multi-classification confusion matrices, accuracy, etc.)

There is a supplementary notebook for generating synthetic tweets with GPT-2:

Training_GPT_2_w_GPU.ipynb: trains GPT-2 to on force rank classes based on the data given to it from our postgres AWS database before generating batches of synthetic tweets

These notebooks can be accessed from your virtual environment once all dependencies are installed within it. Two additional libraries, Transformers and psycopg2-binary, are both installed after running the first cell in the notebooks.

DS Architecture

Old Codebase

Old and currently undeployed code is stored in the archive folder of the repo. Some files are stored to show the evolution of the code from previous BloomTech Labs cohorts to the current deployed code. Some files are starter codes that could help provide inspiration for features that were deprioritized for initial release (e.g. conversational Twitter Bot). A more in-depth description of each of the files is stored in a markdown file in the archive directory.

FastAPI Test Client

In the test folder there is a FastApi Test-Client script to test all api endpoints. FastApi testclient allows developers to check that application endpoints are working as expected, ensures junk data does not enter your database, and allows the dev to easily debug with custom pytest reports. In order to run our FAST-API client test, one must previously have pytest installed [https://www.guru99.com/pytest-tutorial.html]. In the root directory, command line run $ pytest.

Next Steps

For those interested in improving upon the data science codebase, here are some recommendations:

Explore the efficacy of separating the AWS 'postgres' database into two different databases. The first database would be the primary database for the Twitter scraper outputs and DS would redesign the schema to fit their needs. The second database would be the primary database for backend and they could extract data from the DS database and fit the schema to their needs. Currently, the primary AWS data table 'force_ranks' is accessible in both the data science and backend codebases.
Develop an evidence-based strategy to maximize the effectiveness of our Twitter queries in the scraper feature. Currently, the Twitter API has a 500 tweet limit per scraping. This would include developing metrics to compare querying methods. Metrics would allow us to determine which methods return a greater percentage of tweets describing police use-of-force in the United States.
Stakeholder would like for us to filter out incidents based on location before the incident is put into the database. This means we would have to try to gather location from the initial tweet. The scraper function may need to be re-worked slightly to accomidate this.

Labs 39 Contributors

Philip Feiran Lee	Michael Carrier	Christopher Chilton

Technical Project Manager	Machine Learning Ops	Outside-Consultant

Labs 38 Contributors

Christopher Chilton	Ian Knight	Gabriel Nosek	Michael Carrier

DS Project Manager	Data Engineer	Machine Learning Engineer	Machine Learning Ops

Labs 37 Contributors

Ryan Fikejs	Imani Kirika	Joshua Elamin	Rowen Witt

Technical Project Manager	Technical Project Manager	Technical Project Manager	Data Engineer

Brody Osterbuhr	Rhia George	Andrew Haney	Murat Benbanaste

Data Scientist: ML Ops	Machine Learning Engineer	Data Scientist: ML Ops	Machine Learning Engineer

Labs 36 Contributors

Hillary Khan	Marcos Morales	Eric Park

Data Scientist	Data Scientist	Data Scientist

Brody Osterbuhr	Rhia George	Andrew Haney	Murat Benbanaste

Data Scientist: ML Ops	Machine Learning Engineer	Data Scientist: ML Ops	Machine Learning Engineer

Labs 36 Contributors

Hillary Khan	Marcos Morales	Eric Park

Data Scientist	Data Scientist	Data Scientist

Getting Started

Dependencies

Environment Variables

In order for the app to function correctly, the user must set up their own environment variables. There should be a .env file containing the following:

1. Twitter API Connection - through tweepy - use HRF twitter developer account.
	a. CONSUMER_KEY=
	b. CONSUMER_SECRET=
	c. ACCESS_KEY=
	d. ACCESS_SECRET=
2. Postgres database connection 
	a. DB_URL= <Currently pointing at production Database>
3. Map Api credentials
	a. MAP_API= <Credentials for Google Maps API >
4. Bot variables
	a. BOT_NAME= <This can be anything. Currently being stored in env but can move locations>

Installation Instructions and running API locally

For AWS deployment we used requirement.txt to store our dependencies. Here are steps to create a virtual environment and install dependencies from our requirements.txt to run the app locally. Alternative instructions for creating a pipfile with pipenv follow.

MacOS:

clone the repo
cd into repo
create virtual environment:

$ python3 -m venv name_for_env

activate virtual environment:

$ source name_for_env/bin/activate

check activation:

$ which python
# should return:
#   name_for_env/bin/python

install all dependencies with requirements.txt:

$ python3 -m pip install -r requirements.txt

run the API locally on your machine

$ gunicorn app.main:app -w 1 -k uvicorn.workers.UvicornWorker

Or

uvicorn app.main:app --reload

close the app with control+c in terminal
deactivate environment:

$ deactivate

If you prefer to use pipenv and create a pipfile from our requirements.txt:

clone the repo
cd into repo
install pip environment

$ pipenv install

will create a pipfile for you 4. activate the environment

$ pipenv shell

run the API locally on your machine

$ gunicorn app.main:app -w 1 -k uvicorn.workers.UvicornWorker

Or

uvicorn app.main:app --reload

close the app with control+c in terminal
deactivate environment:

$ exit

Windows:

clone the repo
cd into repo
create virtual environment:

$ py -m venv env

activate virtual environment:

$ .\env\Scripts\activate

check activation:

$ which python
# should return:
#   name_for_env/bin/python

install all dependencies with requirements.txt:

$ py -m pip install -r requirements.txt

run the API locally on your machine

uvicorn app.main:app --reload

close the app with control+c in terminal
deactivate environment:

$ deactivate

If you prefer to use pipenv and create a pipfile from our requirements.txt:

clone the repo
cd into repo
install pip environment

$ pipenv install

will create a pipfile for you 4. activate the environment

$ pipenv shell

run the API locally on your machine

uvicorn app.main:app --reload

close the app with control+c in terminal
deactivate environment:

$ exit

Name		Name	Last commit message	Last commit date
Latest commit History 523 Commits
app		app
archive		archive
notebooks		notebooks
.ebignore		.ebignore
.gitignore		.gitignore
Credentials_map.png		Credentials_map.png
DS_Flowchart.png		DS_Flowchart.png
HandOff Lab37 to 38.md		HandOff Lab37 to 38.md
Hand_Off_Lab39_for_HRF.MD		Hand_Off_Lab39_for_HRF.MD
LICENSE		LICENSE
PULL_REQUEST_TEMPLATE.md		PULL_REQUEST_TEMPLATE.md
Procfile		Procfile
README.md		README.md
known_defects.md		known_defects.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Features

Deployed Product

Twitter Scraper

Twitter Bot

Alembic

BERT Model

BERT Rankings

Notebooks

DS Architecture

Old Codebase

FastAPI Test Client

Next Steps

Labs 39 Contributors

Labs 38 Contributors

Labs 37 Contributors

Labs 36 Contributors

Labs 36 Contributors

Getting Started

Dependencies

Environment Variables

Installation Instructions and running API locally

MacOS:

Windows:

How to access DB from browser

About

Releases

Packages

Contributors 33

Languages

License

BloomTech-Labs/human-rights-first-police-ds-a

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

Deployed Product

Twitter Scraper

Twitter Bot

Alembic

BERT Model

BERT Rankings

Notebooks

DS Architecture

Old Codebase

FastAPI Test Client

Next Steps

Labs 39 Contributors

Labs 38 Contributors

Labs 37 Contributors

Labs 36 Contributors

Labs 36 Contributors

Getting Started

Dependencies

Environment Variables

Installation Instructions and running API locally

MacOS:

Windows:

How to access DB from browser

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 33

Languages

Packages