You are a data engineer at a data analytics consulting company. You have been assigned to a project that aims to de-congest the national highways by analyzing the road traffic data from different toll plazas. Each highway is operated by a different toll operator with a different IT setup that uses different file formats. Your job is to collect data available in different formats and consolidate it into a single file.
- In this project you will create a shell script using bash commands to:
- Extract data from a CSV file
- Extract data from a TSV file
- Extract data from a fixed-width file
- Transform the data
- Load the transformed data into a new CSV file
- You will then create a
DAG
to call the shell script.
- Start Apache Airflow.
- Download the dataset from the source to the destination /your path directory/airflow/dags using wget command. Source:
https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0250EN-SkillsNetwork/labs/Final%20Assignment/tolldata.tgz
Note: While downloading the file in the terminal use the sudo command before the command used to download the file.
- Define the DAG arguments as per the following details:
Parameter | 🔰 Value |
---|---|
owner | < You may use any dummy name> |
start_date | today |
< You may use any dummy email> | |
email_on_failure | True |
email_on_retry | True |
retries | 1 |
retry_delay | 5 minutes |
- Create a DAG as per the following details.
Parameter | 🔰 Value |
---|---|
DAG id | ETL_toll_data |
Schedule | Daily once |
default_args | as you have defined in the previous step |
description | Apache Airflow Project |
Task 1.3 - Create a shell script Extract_Transform_data.sh
and add the following commands to your tasks:
- Write a command to
unzip the data
. - Use the downloaded data from the URL given in the first part of this project and uncompress it into the destination directory.
Hint: Read through the file
fileformats.txt
to understand the column details.
- You should extract the fields
Rowid
,Timestamp
,Anonymized Vehicle number
, andVehicle type
from thevehicle-data.csv
file and save them into a file namedcsv_data.csv
.
- You should extract the fields
Number of axles
,Tollplaza id
, andTollplaza code
from thetollplaza-data.tsv
file and save it into a file namedtsv_data.csv
.
- You should extract the fields
Type of Payment code
, andVehicle Code
from the fixed width filepayment-data.txt
and save it into a file namedfixed_width_data.csv
.
Task 1.7 -Update the shell script to add a command to consolidate data Extracted from previous tasks
- You should create a single CSV file named
extracted_data.csv
by combining data from the following files:
- csv_data.csv
- tsv_data.csv
- fixed_width_data.csv
- The final CSV file should use the fields in the order given below:
Rowid
,Timestamp
,Anonymized Vehicle number
,Vehicle type
,Number of axles
,Tollplaza id
,Tollplaza code
,Type of Payment code
, andVehicle Code
Hint: Use the bash
paste
command.paste
command merges lines of files. Example :paste file1 file2 > newfile
- You should transform the vehicle_type field in
extracted_data.csv
into capital letters and save it into a file namedtransformed_data.csv
in the staging directory.
Note: Copy the shell script to /your path/airflow/dags folder
- Save the DAG you defined into a file named
ETL_toll_data.py
- Define the task pipeline as per the details given below:
Task | 🔰 Functionality |
---|---|
First task | extract_transform_load |
- I provided my solution for this project a python file go and check it out.
- After implementations your results at Airflow should look like this:
Contributions are welcome! Please open an issue or pull request for any changes or improvements.