Analysis of COVID impact on US households

Lakshmi Venkatasubramanian

12/14/2020

Abstract

The goal of this analysis is to gauge the impact of the pandemic on overall household characteristics such as employment status, housing, education disruptions, and dimensions of physical and mental wellness. There is a large amount of emotionally negative stimuli related to the COVID-19 pandemic. How do people prepare themselves in difficult times like this? Analyzing and exploring people's response to pandemic can provide useful insights into people's perspective about COVID and the challenges they face.

As we all know,the impacts of the pandemic and the economic fallout have been widespread, but are particularly prevalent among Black, Latino, Indigenous, and immigrant households. There is also an impact on gender. This analysis will deep dive into some of the impacts of covid by Age group, race and ethinicity and gender. We also try to compare the impacts in Washington state versus all other states in terms of certain indicators and understand the different groups of people based on various characteristics pertaining to COVID. The research questions will target specific variables. Below are some references:

Covid Recession effects

Covid data from NCHS

Data Source

The Household Pulse Survey provides timely data to help understand the experiences of American households during the coronavirus pandemic. Data for this research comes from the Phase 1 Household Pulse Survey that began on April 23 and ended on July 21, 2020 spanning 12 weeks. The dataset is very rich and informative. It dataset has 105 variables, 1088314 observations and includes employment status, food security, housing, physical and mental health, access to health care, and educational disruption. In order to support the nation’s recovery, we need to know the ways this pandemic has affected people’s lives and livelihoods. Data from these datasets will show the widespread effects of the coronavirus pandemic on individuals, families, and communities across the country.

The survey was conducted by an internet questionnaire, with invitations to participate sent by email and text message. Housing units linked to one or more email addresses or cell phone numbers were randomly selected to participate, and one respondent from each housing unit was selected to respond.

Links to Data set and Data dictionary

The datasets are available for public use under census.gov website as weekly files.
Data dictionary is available in the census.gov website under the link Phase 1 Household Pulse Survey Technical Documentation. It has also been added to the repository. Please refer to repository structure for pulse2020_data_dictionary.xlsx

Download data

Data is directly downloaded from census website using python modules. Refer to the data_cleaning.ipynb for detailed downloading steps

Terms of use of census data

The Census Bureau is committed to open government by sharing its public data as open data. Census data continues to be a key national resource, serving as a fuel for entrepreneurship and innovation, scientific discovery, and commercial activity. We continuously identify and publish datasets and Application Programming Interface’s (API’s) to Data.gov in accordance with the Office of Management and Budget (OMB) Memorandum M-10-06, the Executive Order 13642 on open data, and the overall principles outlined in the Digital Government Strategy. In accordance with the Open Data Policy, M-13-13, the Census Bureau publishes its information in machine-readable formats while also safeguarding privacy and security.

Repository structure

├── README.md
├── LICENSE
├── assets
│   └── pictures
├── clean_data
│   ├── covid_clean_data.csv
├── data_dictionary
│   ├── pulse2020_data_dictionary.xlsx
├── sample_raw_data
│   └── pulse2020_raw_data.csv
└── src
    ├── analysis.ipynb
    ├── data_cleaning.ipynb

File	Description
LICENSE	Code license
README.md	This readme
assets/pictures/	Directory containing the various images displayed in the analysis notebook
clean_data/covid_clean_data.csv	CSV file containing a cleaned version of covid dataset. This has the data of all the 12 weekly files cleaned up. This file is used as input to analysis.ipynb file
data_dictionary/pulse2020_data_dictionary.xlsx	Data dictionary for all 12 weeks of raw data downloaded from the url
sample_raw_data/pulse2020_raw_data.csv	Sample data obtained by parsing week 1 survey data from the census url that can be used as the input to data_cleaning.ipynb
src/analysis.ipynb	Contains the report and analysis code/visualizations and takes the input covid_clean_data.csv which is the output of data_cleaning.ipynb
src/data_cleaning.ipynb	Contains the logic to clean all the 12 weeks of data downloaded directly from the url

Research Questions

Understand the impacts of COVID in terms of employment loss, income loss, food insufficiency, education interruptions, inability to meet housing expenses and how does this vary by Race/Ethnicity or gender?
What is the impact on Mental health status (Anxiety and depression)? Is there a correlation between Mental health status (Anxiety and depression) and factors such as age, number of household members, gender, income, health status, race? How does the anxiety levels vary between first and last week of survey?
How does employment loss, income loss, food insufficiency, education interruptions, inability to meet housing expenses in Washington differ as compared to national average?
How do different groups based on age, race and ethnicity differ in their behavior or attitude towards COVID. Are there any patterns observed in the population based on certain characteristics pertaining to COVID?

Methodology

For question 1, Logistic regression has been used as the response indicator variables are binary in nature, all the data points are independent and the sample size is large enough. Also, chi's square test of independence has been used to compare 2 categorical ordinal variables which is the case here. Overall likelihood ratio test has been used to verify if the full model that includes gender, race/ethnicity tell us more about the outcome (or response) variable than a model that does not include these 2 variables.

For question 2 , ordinal logistic regression has been used because the response variable is categorical and ordered in nature, all the data points are independent and the sample size is large enough. We also used Random features feature importance to identify the top 10 features impacting Anxiety/depression. Overall likelihood ratio test has been used to verify if the full model that includes the predictors in question namely gender, worry, interest, income loss, food insufficiency, Age group, number of household members, income level, health status, race/ethnicity tell us more about Anxiety/depression than a model that does not include these variables.

For question 3, Logistic regression has been used as the response indicator variables are binary in nature, all the data points are independent and the sample size is large enough. Also, chi's square test of independence has been used to compare 2 categorical variables which is the case here. Overall likelihood ratio test has been used to verify if the full model that includes state tell us more about the outcome (or response) variable than a model that does not include this variable

For question 4, Principal component Analysis and K-means clustering have been used to identify any patterns and classify groups of people based on similar characteristics

Results: The results will be presented as intepretation of coefficients, significance of hypothesis tests and comprehensive compilation of visualizations.

How to run the notebook

Install Anaconda

Using a terminal or cmd, navigate to the src folder.

Launch jupyter by running: jupyter notebook

Select the notebook of interest. (Start at data_cleaning.ipynb for the full process or analysis.ipynb for the final report.)

Schema of the files created

There is one CSV file extracted and compiled as part of this analysis which is clean_data/covid_clean_data.csv. Below is its schema

Columns	Description	Data type
WEEK	Wee k of the survey	numeric
EGENDER	The gender of the respondent. Takes a value in {'MALE', 'FEMALE'}.	string
THHLD_NUMPER	The number of members in the household. Takes a value between 1 and 10	numeric
HLTHSTATUS	Overall health status of the respondent. Takes on values in {'POOR', 'FAIR', 'EXCELLENT'	string
WORRY	Worry level of the respondent. Takes on values in {'NONE','MODERATE','VERY HIGH'}	string
INTEREST	Interest level of the respondent. Takes on values in {'NONE','MODERATE','VERY HIGH'}	string
RACE_ETHNICITY	Race or Ethnicity of the respondent. Takes on values in {'Hispanic','White alone','Black alone','Asian alone','Other races'}	string
EMP_STATUS	Employment status of the respondent. Takes on values in {0,1}	numeric
EMPLOSSCOVID	Employment loss due to covid. Takes on values in {0,1}	numeric
FOOD_INSUFF	Food insufficiency due to covid. Takes on values in {0,1}	numeric
RENT_DEBT	Inability to pay the rent due to covid. Takes on values in {0,1}	numeric
RENT_DEBT	Inability to pay the rent due to covid. Takes on values in {0,1}	numeric
INCOMELOSS	Income loss due to covid. Takes on values in {0,1}	numeric
AGE_GROUP	Age group of the respondent. Takes on values in {'18 - 24','25 - 39','40 - 54','55 - 64','65 and above'}	string
EDUC	Worry level of the respondent. Takes on values in {'Less than a high school diploma','High school diploma or GED','Some college/associate's degree','Bachelor's degree or higher'}	string
INCOME_LEV	Income level of the respondent. Takes on values in {'Less than $25,000','$25,000 - $74,999','$75,000 - $149,999','$150,000 and above'}	string

Source of Bias

Nonsampling errors can also occur and are more likely for surveys that are implemented quickly, achieve low response rates, and rely on online response. Nonsampling errors for the Household Pulse Survey may include:

Measurement error: The respondent provides incorrect information, or an unclear survey question is misunderstood by the respondent. The Household Pulse Survey schedule offered only limited time for testing questions.
Coverage error: Individuals who otherwise would have been included in the survey frame were missed. The Household Pulse Survey only recruited households for which an email address or cell phone number could be identified.
Nonresponse error: Responses are not collected from all those in the sample or the respondent is unwilling to provide information. The response rate for the Household Pulse Survey was substantially lower than most federally sponsored surveys.
Processing error: Forms may be lost, data may be incorrectly keyed, coded, or recoded. The real-time dissemination of the Household Pulse Survey provided limited time to identify and fix processing errors.

The Census Bureau employs quality control procedures to minimize these errors. However, the potential bias due to nonsampling errors has not yet been evaluated.

Resources used

This analysis was prepared using Python 3.8 running in a Jupyter Notebook environment.
Documentation for Python can be found here: https://docs.python.org/3.8/
Documentation for Jupyter Notebook can be found here: http://jupyter-notebook.readthedocs.io/en/latest/

The following Python packages were used and their documentation can be found at the accompanying links:

Unknowns and dependencies

These data are experimental and samples may not be representative of the population.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis of COVID impact on US households

Abstract

Data Source

Links to Data set and Data dictionary

Download data

Terms of use of census data

Repository structure

Research Questions

Methodology

How to run the notebook

Schema of the files created

Source of Bias

Resources used

Unknowns and dependencies

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
assets/pictures		assets/pictures
clean_data		clean_data
data_dictionary		data_dictionary
sample_raw_data		sample_raw_data
src		src
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

License

lakshmi2688/COVID_Impact_on_US_Households

Folders and files

Latest commit

History

Repository files navigation

Analysis of COVID impact on US households

Abstract

Data Source

Links to Data set and Data dictionary

Download data

Terms of use of census data

Repository structure

Research Questions

Methodology

How to run the notebook

Schema of the files created

Source of Bias

Resources used

Unknowns and dependencies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages