This project consists of a data pipeline which extracts Spotify data using the Spotify Web API to store it as .csv file which is then uploaded to Amazon S3. The most recent file is then extracted from S3 and ingested into and RDS, and for that PostgreSQL is ran locally. Airflow with Docker is used to orchestrate the data pipeling where DAGs are used. This project is to be extended to include PowerBI to have a visual dashboard of the data.
Spotify Web API for the extraction of data, which enables the creation of applications that can interact with Spotify's streaming service, such as retrieving content metadata, getting recommendations, creating and managing playlists, or controlling playback, for our case getting recently played songs.
Extraction of the data required Access Token
which is a total different process but How to Authenticate and use Spotify Web API this can be helpful to extract the access token. Make sure that the Access Token
has a time period and it has to be refreshed 3600s.
A .json
file is extracted which is then transformed into a Pandas
DataFrame and then saved as a .csv
file into the local machine. This file is then loaded into S3
where all the file based on their timestamp are saved. This whole process requires S3 access and secret key, the process for which this post can be helpful.
After this most recent file from S3 is downloaded which is then loaded into PostgreSQL database within a table.
This whole ETL pipeline is orchestrated with Airflow launched using Docker. Each of the ETL task and defined as operators (object) in a DAG encapsulating the ETL jobs as seen below:
This project will be extended to include a visual dashboad, probably PowerBI just to visually see the data what the data entails visually.