Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Update README.md #14

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 23 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

## 👋 Hi there

Welcome to the Data Engineering code interview! This small data challenge is designed to test out your skills in python, sql, git, and geospatial data processing. The challenge will go from easy to difficult, there's no preassure to finish all the tasks, so try your best and get as far as you can!
Welcome to the Data Engineering code interview! This small data challenge is designed to test out your skills in python, sql, git, and geospatial data processing. The challenge will go from easy to difficult. Try your best and complete as much as you can!

To start this challenge, create a new **private** repo under your github username. We would like you to include all the code, notes, visualizations, and data inside of the repo. You will have **48 hours** to complete this data challenge. Once you are done, please provide read access to your repo by inviting `@mbh329`, `@td928` and `@AmandaDoyle`
To start this challenge, create a new **private** repo under your github username. We would like you to include all the code, notes, visualizations, and data inside of the repo. You will have **48 hours** to complete this data challenge. Once you are done, please provide read access to your repo by inviting `@damonmcc`, `@fvankrieken` and `@AmandaDoyle`

> ⚠️ Note: **the repo has to be `<ins>`private`</ins>`, otherwise you will be automatically `<ins>`disqualified`</ins>`**. Also we will check your commit timestamp to only account for the first 48 hours of coding activities.
> ⚠️ Note: **the repo has to be private, otherwise you will be automatically disqualified**. Also we will check your commit timestamp to only account for the first 48 hours of coding activities.

## What we are looking for

Expand All @@ -33,11 +33,11 @@ Your code interview will be evaluated based on your repo, so make sure all files

- [Introduction](#introduction)
- [Task 1: Data Download](#task-1-data-download)
- [Task 2: Data Aggregation](#task-2-data-aggregation)
- [Task 3: Data Visualization](#task-3-data-visualization)
- [Task 4: Spatial Data Processing](#task-4-spatial-data-processing)
- [Task 5: SQL](#task-5-sql)
- [Task 6: Spatial SQL](#task-6-spatial-sql)
- [Task 2: Data Aggregation](#task-2-data-aggregation-via-python)
- [Task 3: Data Visualization](#task-3-data-visualization-via-python)
- [Task 4: Spatial Data Processing](#task-4-spatial-data-processing-via-python)
- [Task 5: SQL](#task-5-Data-Aggregation-via-SQL)
- [Task 6: Spatial SQL](#task-6-Spatial-data-processing-via-SQL)
- [Resources](#resources)

## Introduction
Expand All @@ -46,29 +46,29 @@ We love the NYC 311 service and the open data products that come with it. In thi

### Task 1: Data Download

Write a python script/notebook to download all service request records created in the **last week** (7 days) and has **HPD** as the responding agency, and store the data in a csv named `raw.csv` in a folder called `data`.
Write a script to download all service request records created in the **last week** (7 days) and has **HPD** as the responding agency, and store the data in a csv named `raw.csv` in a folder called `data`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind this change is to make the download step more agnostic? If so, I think the omission of python works well for this - is it worth adding in more specificity in our directions or do we think that the "ambiguity" is a good window into how the candidate thinks/problem solves?


### Task 2: Data Aggregation
### Task 2: Data Aggregation via Python

Create a time series table based on the `data/raw.csv` file we created from **Task 1** that has the following fields
Using a Python script/notebook create a time series table based on the `data/raw.csv` file we created from **Task 1** that has the following fields

- `created_date_hour`: the timestap of request creation by date and hour
- `complaint_type`: the type of the complaint
- `count`: the count of service requests by `complaint_type` by `created_date_hour`

Store this table in a csv under the `data` folder with a csv file name of your choice.

### Task 3: Data Visualization
### Task 3: Data Visualization via Python

Create a multi-line plot to show the total service request counts by `created_date_hour` for each `complaint_type`. Make sure you store the image of the plot in the `data` folder as a `.png` file.
Using a Python script/notebook create a multi-line plot to show the total service request counts by `created_date_hour` for each `complaint_type`. Make sure you store the image of the plot in the `data` folder as a `.png` file.

### Task 4: Spatial data processing
### Task 4: Spatial data processing via Python

At Data Engineering, we enhance datasets with geospatial attributes, such as point locations and administrative boundaries. To help us better understand the data from **Task 1**, we would like you to join the initial raw data to the **[2020 NTA (Neighborhood Tabulation Area) boundaries](https://www1.nyc.gov/site/planning/data-maps/open-data/census-download-metadata.page)** and create a choropleth map of 7 day total count by NTA of a specific `complaint_type` of your choice.

Depending on how you generate the map, you can store the map as a `.png` or `.html` under the `data` folder.

### Task 5: SQL
### Task 5: Data Aggregation via SQL

We ❤️ SQL! At Data Engineering, we deal with databases a lot and we write a lot of fast and simple ETL pipelines using SQL. In this task, you will:

Expand All @@ -77,7 +77,7 @@ We ❤️ SQL! At Data Engineering, we deal with databases a lot and we write a

> Note: Depending on your preference, you can use or [Postgres](https://www.postgresql.org/), which is prefered; however, if you are familiar with [SQLite](https://docs.python.org/3/library/sqlite3.html) (much easier to set up and use), you can use that too.

### Task 6: Spatial SQL
### Task 6: Spatial data processing via SQL

A lot of popular databases have geospatial extensions, which makes spatial data processing in SQL super easy to use. In this task you will:

Expand All @@ -88,9 +88,15 @@ A lot of popular databases have geospatial extensions, which makes spatial data

> Note: At this point you might notice that spatial software is not as straight forward as a simple `pip install`. If you are stuck with database installation or pacakge installation, you might consider adopting **[docker](https://www.docker.com/)**. Docker has a steep learning curve, so don't waste too much time on it.

### Bonus
If you would like to take your work to the next level you will receive bonus points for doing any of the following.
- Allowing different parameters to be passed to the scripts from the command line and/or writing bash scripts to take command line arguments and call the code. For example, you can pass the agency value as a parameter when downloading the 311 data.
- Demonstrating your experience working with Docker by building a Docker image and pushing an image with your setup and code to Docker hub and giving the Data Engineering team instructions on how to pull it down and run the code. This bonus section will be graded on how easily we can access your image and make it work on your machines.
If you do not have time or experience doing these "bonus" tasks that's okay! These tasks are not required and are an optional addition to the tasks described above. As a guideline, a strong submission of the core tasks will be weighed more heavily than a poorly completed data challenge with bonuses.

## Resources

- Reach out to Te (TDu @ planning.nyc.gov) if you have any questions. We love people who ask questions.
- Reach out to Damon (DMcCullough @ planning.nyc.gov) if you have any questions. We love people who ask questions.
- [PostgreSQL Installation Guide](https://www.postgresql.org/download/)
- [Postgis Docker image](https://registry.hub.docker.com/r/postgis/postgis/)
- [Postgis Installation Guide](https://postgis.net/workshops/postgis-intro/installation.html)
Expand Down