This repository contains source code, data, and calculated features for visualizing recent COVID-19 cases in the U.S.
The transformed data is uploaded to Tableau Public for creating interactive dashboards with line graphs, bar charts, and bubble maps of COVID-19 cases and deaths.
Link to Dashboard on Tableau Public
NEW 11/2022 The transformed data is visualized in an interactive Plotly Dash dashboard. The dashboard script src/covid_dashboard.py
can be deployed on a local server or on a cloud service such as Heroku and Google Cloud Platform.
Link to Dash dashboard deployed using GCP App Engine
Link to GitHub repo for this deployment using GCP App Engine
- Introduction
- Dependencies
- Data Sources
- Calculated Features
- SQL
- Tableau
- Plotly Dash Dashboard
- Future Improvements
This project presents a different aspect than those seen on New York Times [1] and U.S. CDC analysis [2] by analyzing totals of new COVID-19 cases reported in the past 14 days.
Estimates of the percentage of each state population that is currently infected with COVID-19 is calculated by normalizing the past 14 day new case totals to their respective state populations. The data is presented as a percentage because it may be more intuitively understood by the general population, as opposed to using a basis of per million or per 100,000 people.
The author selected time intervals of 14-days based on studies that show that an average period of COVID-19 infectiousness and risk of transmission is between 3 days before and 8 days after symptom onset, and COVID-19 RNA usually becomes undetectable from upper respiratory tract samples about 2 weeks after symptom onset[3, 4, 5].
[1] https://www.nytimes.com/interactive/2021/us/covid-cases.html
[2] https://covid.cdc.gov/covid-data-tracker/#datatracker-home
[3] https://www.cdc.gov/coronavirus/2019-ncov/your-health/quarantine-isolation.html
[4] Peeling RW, Heymann DL, Teo Y, Garcia PJ. Diagnostics for COVID-19: moving from pandemic response to control. Lancet. Published online December 20, 2021: https://doi.org/10.1016/S0140-6736(21)02346-1
[5] https://www.nytimes.com/interactive/2022/01/22/science/charting-omicron-infection.html
This repo requires a Google Cloud Platform account, a Tableau Public account, Python3, and the following Python libraries:
- SQLite3
- Requests
- pandas
- Numpy
- Plotly
- Dash
- Gunicorn
- Werkzeug==2.0.3
U.S. CDC reports COVID-19 cases and death counts online at this link.
The data is a collection of the most recent numbers reported by states, territories, and other jurisdictions to the CDC.
Notes:
- This dataset includes Confirmed Cases and Probable Cases, as defined by CSTE [6]. Confirmed cases meet molecular laboratory testing evidence, while Probable cases meet clinical criteria without laboratory evidence. Many jurisdictions include both their confirmed and probable cases ("pnew_case") into reported "Total Cases" and "New Case" counts.
- Counts for New York City and New York State are provided separately. This data must be recombined to analyze them as one New York state.
Update as of 2023
- This daily-updated dataset was discontinued October 20, 2022 and replaced by a weekly-updated dataset. This project supports only the daily-updated dataset.
- A copy of the daily-updated dataset retrieved on November 14, 2022 is located in
./datasets/United_States_COVID-19_Cases_and_Deaths_by_State_over_Time_Nov_14_22.csv
.
The population of each state is obtained from the most recent U.S. Census Bureau American Community Survey 5-Year Estimate 2015-2019 [7]. Detailed descriptions about this data can be found through this link
This data can be viewed and exported interactively on the U.S. Census Bureau website
The conversion from state names to state abbreviations is needed to combine the U.S. CDC COVID-19 and the U.S. Census ACS datasets.
This information can be found on the U.S. Census Bureau reference library: https://www.census.gov/library/reference/code-lists/ansi/ansi-codes-for-states.html
Download the "National FIPS and GNIS Codes File" from the reference library.
[6] https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/
[7] https://data.census.gov/cedsci/table?g=0100000U.S.&d=ACS%205-Year%20Estimates%20Subject%20Tables
The following 3 features are calculated from the datasets:
Daily New Cases as Percent of State Population = New Cases/State Population
New Cases in Last 14 Days = Sum(New Cases) Between Submit Date and Previous 13 Dates
New Cases in Last 14 Days as Percent of State Population = Cases Last 14 Days/State Population
Use SQL to execute a query for extracting, joining, and performing calculations on the three datasets and create a table with calculated features appended to the original CDC COVID-19 Cases and Deaths data. This new dataset will be uploaded to the Tableau Public Viz.
Method 1: Use Google Cloud Platform BigQuery to generate a CSV file with current U.S. CDC COVID-19 data:
- Download "CDC U.S. COVID-19 Cases and Deaths By State", "U.S. Census Bureau - 2019 American Community Survey 5-year Estimate", and "ANSI Codes for States" datasets.
- Create a project database in your GCP BigQuery SQL workspace.
- In the project database, Create a Dataset.
- In the new Dataset, Create individual Tables for each dataset. Copy and paste
src/import_CDC_data_schema.sql
to set up the Schema for CDC COVID-19 Cases data, importing all columns as string type. - Create a New Query. Copy and paste the
covid19_ETL.sql
into the blank query. - Run the Query. Save output as
datasets/Generated/US_MMM_DD.csv
where MMM is month as a string and DD is the current date.
Method 2: Use Python (w/ SQLite and Requests libraries) to generate a CSV file with current U.S. CDC COVID-19 data:
- Set up your Anaconda environment.
- Clone
https://github.com/ElliotY-ML/Covid_Cases_By_Percent_Population.git
GitHub repo to your local machine. - Create and activate a new environment, named
covid_data
with Python 3.8+. Be sure to run the command from the project root directory since the environment.yml and pkgs.txt files are there. If prompted to proceed with the install(Proceed [y]/n)
type y.conda env create -f environment.yml conda activate covid_data
- Open a conda terminal and cd into the project root folder.
- Execute
src/build_dataset.py
. This script will retreive current U.S. CDC COVID-19 data and generate an output dataset todatasets/Generated/US_MMM_DD.csv
. This dataset is the same as one that is obtained following Method 1.
The SQL query result dataset is visualized with charts, bubble maps, and interactive dashboards on Tableau Public.
Direct link to author's Tableau Public Viz: https://public.tableau.com/app/profile/ellioty.ml/viz/US_Covid19_Cases_Percent_Population/Dash14Day
To refresh the Tableau Viz with the lastest CDC COVID-19 Cases data:
- Follow the SQL section to generate a CSV file with the most recent CDC data.
- Replace existing Tableau visualization data source with
datasets/Generated/US_MMM_DD.csv
.
The SQL query result dataset is also visualized with charts and bubble maps on an interactive dashboard created with Plotly Dash. This dashboard is deployed on Google Cloud Platform (App Engine).
To learn more, please visit the repo: https://github.com/ElliotY-ML/covid-dashboard-on-gcp
✅ Add scripting to automate refresh of U.S. CDC COVID-19 data. Completed src/build_dataset.py
on Mar-17-2022
✅ Interactive analysis using Jupyter Notebook. Completed Plotly Dash dashboard in src/covid_dashboard.py
on Nov-21-2022
✅ Interactive dashboard application. Completed Deployed Plotly Dash dashboard to Heroku on Nov-21-2022. Updated Deployed Plotly Dash dashboard to GCP App Engine on Jan-31-2023.
This project is licensed under the MIT License - see the LICENSE.md