GitHub - GabrielDan92/AWS_Terraform_PySpark-ETL_Job: Terraform configuration that creates several AWS services, uploads data in S3 and starts the Glue Crawler and Glue Job.

Objective:

The project's objective is to use Terraform's IaC (Infrastructure as Code) solution for creating several AWS services for an ETL job. The ETL job reads two different input JSON files from an S3 bucket, processes them through PySpark in a Glue Job, then outputs a third JSON file in S3 as result.

Prerequisites:

To succesfully run the Terraform configuration and create the AWS resources you will need:

The Terraform CLI installed - install instructions here
The AWS CLI installed - install instructions here
An AWS account
Your AWS credentials (AWS Access Key ID & AWS Secret Access Key) + being successfully authenticated in AWS CLI, using aws configure
An IAM role with the AWSGlueServiceRole policy plus an inline policy for read/write access to S3
The IAM role ARN must be saved in the variables.tf file, variable iam_role

To start the Terraform configuration run the following commands:

Using the terminal, cd in the folder where you saved the repo files (stations.json, trips.json, glue-spark-job.py, main.tf, variables.tf)
terraform init
terraform fmt (optional - used for formatting the configuration file)
terraform validate (optional - used for validating the configuration fie)
terraform apply
terraform destroy (optional - used for deleting the AWS resources after using them)

The following steps are applied automatically by Terraform:

Create an S3 bucket
Create S3 input, output, script, logs and temp folders - I know that Amazon S3 has a flat structure instead of a hierarchy like you would see in a file system, but for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects
Upload the JSON input files and the .py Pyspark script in their corresponding S3 folders
Create the AWS Glue Crawler, referencing the input files
Start the Glue Crawler
Wait until the Crawler creates the metadata Glue Data Catalog
Create the Glue Job, referencing the Data Catalog created by the Glue Crawler and the .py PySpark script from S3
Run the Glue Job and save the output back to the S3 bucket

The Glue Job:

The entire logic of the Glue Job is based on the .py PySpark script. Using PySpark, I'm reading the following JSON data sets from the S3 bucket:

Data Set 1 - stations:

{
    "stations": {
        "internal_bus_station_id": [
            0, 1, 2, 3, 4, 5, 6, 7, 8, 9
        ], 
        "public_bus_station": [
            "BAutogara", "BVAutogara", "SBAutogara", "CJAutogara", "MMAutogara","ISAutogara", "CTAutogara", "TMAutogara", "BCAutogara", "MSAutogara"
        ]
    }
}

Data Set 2 - trips:

{
    "trips": {
            "origin": [
                "B","B","BV","TM","CJ"
            ],
            "destination": [
                "SB","MM", "IS","CT","SB"
            ],
            "internal_bus_station_ids": [
                [0,2],[0,2,4],[1,8,3,5],[7,2,9,4,6],[3,9,5,6,7,8]
            ],
            "triptimes": [
                ["2021-03-01 06:00:00", "2021-03-01 09:10:00"],
                ["2021-03-01 10:10:00", "2021-03-01 12:20:10", "2021-03-01 14:10:10"],
                ["2021-04-01 08:10:00", "2021-04-01 12:20:10", "2021-04-01 15:10:00", "2021-04-01 15:45:00"],
                ["2021-05-01 10:45:00", "2021-05-01 12:20:10", "2021-05-01 18:30:00", "2021-05-01 20:45:00", "2021-05-01 22:00:00"],
                ["2021-05-01 07:10:00", "2021-05-01 10:20:00", "2021-05-01 12:30:00", "2021-05-01 13:25:00", "2021-05-01 14:35:00", "2021-05-01 15:45:00"]
            ]
        }
}

And after calculating the trip times of each trip in column duration_min and replacing the bus_stations_ids column with stations' names, I'm outputing the following JSON file in S3 as response:

PySpark JSON result:

{
    "result": [
        {
            "row_num": 1,
            "origin": "B",
            "destination": "SB",
            "pubic_bus_stops": [
                "BAutogara",
                "SBAutogara"
            ],
            "duration_min": "190.0 min"
        },
        {
            "row_num": 2,
            "origin": "B",
            "destination": "MM",
            "pubic_bus_stops": [
                "BAutogara",
                "SBAutogara",
                "MMAutogara"
            ],
            "duration_min": "240.0 min"
        },
        {
            "row_num": 3,
            "origin": "BV",
            "destination": "IS",
            "pubic_bus_stops": [
                "BVAutogara",
                "BCAutogara",
                "CJAutogara",
                "ISAutogara"
            ],
            "duration_min": "455.0 min"
        },
        {
            "row_num": 4,
            "origin": "CJ",
            "destination": "SB",
            "pubic_bus_stops": [
                "CJAutogara",
                "MSAutogara",
                "ISAutogara",
                "CTAutogara",
                "TMAutogara",
                "BCAutogara",
                "SBAutogara"
            ],
            "duration_min": "730.0 min"
        },
        {
            "row_num": 5,
            "origin": "TM",
            "destination": "CT",
            "pubic_bus_stops": [
                "TMAutogara",
                "SBAutogara",
                "MSAutogara",
                "MMAutogara",
                "CTAutogara"
            ],
            "duration_min": "675.0 min"
        }
    ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Objective:

Prerequisites:

To start the Terraform configuration run the following commands:

The following steps are applied automatically by Terraform:

The Glue Job:

Data Set 1 - stations:

Data Set 2 - trips:

PySpark JSON result:

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
glue-spark-job.py		glue-spark-job.py
main.tf		main.tf
stations.json		stations.json
trips.json		trips.json
variables.tf		variables.tf

GabrielDan92/AWS_Terraform_PySpark-ETL_Job

Folders and files

Latest commit

History

Repository files navigation

Objective:

Prerequisites:

To start the Terraform configuration run the following commands:

The following steps are applied automatically by Terraform:

The Glue Job:

Data Set 1 - stations:

Data Set 2 - trips:

PySpark JSON result:

About

Topics

Resources

Stars

Watchers

Forks

Languages