Skip to content

LSparkzwz/atac-roma-open-data-ingestion-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Atac Roma open data ingestion system

Detailed paper (in Italian): Link.

Data ingestion system to analyze the status of public transports in Rome.

The open data is given in real time by Atac Roma.

This project has been developed using ATAC-Monitor as an example of data ingestion infrastructure.

The results obtained from the data ingestion and analysis represent: Average waiting minutes, Longest waiting time and Average waiting time by Location and Waiting times divided by Rome districts and neighborhoods.

The queries used to elaborate the data can be found in '/lambda functions/layers/updates_feed/python/queries.py'.

Setup:

  1. This project is meant to be run in a Standard AWS Account (a free trial standard account is fine).
  2. Clone the repository.

AWS buckets

  1. Create three S3 buckets:

    • One for storing the data ingestion feed,
    • One for storing the Athena query results,
    • One for storing static files.
  2. In the root of the static files bucket create two folders: locations and routes,

    • Insert in the locations folder the locations.csv file found in the /static_files/locations/ folder of this project.
    • Insert in the routes folder the routes.csv file found in the /static_files/routes/ folder of this project.
  3. Change the read policy of the Athena query results bucket according to your preferences, that is where the final results will be stored.

AWS Athena

  1. Create an AWS Athena database.
  2. Create three tables using the three queries found in the /athena/create_table.md file of this project.
    • Modify the LOCATION query line by replacing the square brackets with the name of the respective bucket you create.
    • ex. "LOCATION 's3://[ NAME OF THE BUCKET FOR THE DATA INGESTION FEED ]/'" becomes "LOCATION 's3://stops-feed/'", where stops-feed is the name of the bucket.

Data ingestion AWS Lambda Function

  1. Create a Lambda Function from the AWS Management Console, with Python 3.8 as Runtime.
  2. In the Lambda Function page, in the Permissions tab, click on the Execution Roles:
    • Attach one policy to grant the Lambda S3 Access permissions and one policy to grant it Athena Access permissions.
  3. Delete the code found in lambda_function.py found in the Function Code windows found in the Lambda Function page.
  4. Copy and paste the code found in /lambda_functions/lambda_function.py file of this project into the Function Code windows found in the Lambda Function page.
  5. From the AWS Lambda page, click on Layers found in the menu on the left. layer menu
  6. Go to the /lambda functions/layers/gtfs/ folder of the project, and create a zip file with the entire python folder that was inside the folder.
  7. Create a layer named 'gtfs' with Python 3.8 as Runtime and upload the zip file you just created by clicking the upload button.
  8. Go to the /lambda functions/layers/updates_feed/ folder of the project, and create a zip file with the entire python folder that was inside the folder.
  9. Create a layer named 'updates_feed' with Python 3.8 as Runtime and upload the zip file you just created by clicking the upload button.
  10. Go back into the page of the Lambda you created before and click on Layers. layer menu
  11. Choose Add Layer -> Custom Layer, and add the two Layers you just created.
  12. From the main page of the Lambda function, set these four environment variables:

Elaboration results AWS Lambda Function

  1. Create a Lambda Function from the AWS Management Console, with Python 3.8 as Runtime.
  2. In the Lambda Function page, in the Permissions tab, click on the Execution Roles:
    • Attach one policy to grant the Lambda S3 Access permissions.
  3. Delete the code found in lambda_function.py found in the Function Code windows found in the Lambda Function page.
  4. Copy and paste the code found in /lambda_functions/results_manager.py file of this project into the Function Code windows found in the Lambda Function page.
  5. Set the following environment variable:
    • RESULTS_BUCKET = [name of the bucked used to store the Athena query results]
  6. From the main page of the Lambda function, click the button "Add trigger" and insert the following parameters:
    • Select trigger = S3
    • Bucket = [name of the bucket used for data ingestion]
    • Event type = All object create events
    • Prefix = results/
    • Suffix = .csv

Run:

After following the setup steps you just need to run first Lambda you created and retrieve the final results inside the bucket you created to store Athena query results. If you want the entire system to keep running on a fixed schedule you can use Amazon EventBridge to create a rule that runs the first Lambda on a fixed schedule. While the system runs by itself you just need to retrieve the results you find in the apposite bucket.

For any questions please feel free to contact me.

About

Roma Tre Big Data project about Atac Roma's open data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published