Crawl film information from Ohitv.info and incrementally load updates daily into Postgres and MongoDB for analysis and visualization.
- Docker: Manages services like Airflow, Minio, Postgres, and MongoDB.
- Airflow: Orchestrates the ETL pipeline.
- Minio: Object storage for raw and processed data.
- Postgres: Relational database to store processed data.
- MongoDB: NoSQL database for testing queries and learning.
- PowerBI: Visualizes insights from the data.
- Python: Language used for scripting.
BeautifulSoup
pandas
numpy
requests
- Crawl film data from Ohitv.info using
requests
andBeautifulSoup
. - Save the raw data to Minio under the bucket
ohitv-raw
.
- Use Python to clean and transform the data.
- Store the transformed data in Minio under the bucket
ohitv-processed
.
- Load transformed data into Postgres for visualization and analysis with PowerBI.
- Load data into MongoDB to test and learn NoSQL querying.
- Use PowerBI to visualize and report insights from the data stored in Postgres.
- Use Apache Airflow to manage and orchestrate the ETL pipeline, which runs daily.
- Follow the instructions to install Docker from the Docker website.
- Run the
docker_compose.bat
file.
- Run the
docker_compose.sh
file.
- Open Minio in your browser at
localhost:9001
. - Login:
- Username:
admin12345
- Password:
admin12345
- Username:
- Create Access Keys:
- Navigate to "Access Keys" in the left menu and generate new access and secret keys.
- Make sure to store these keys securely.
- Create a
keys.json
file in theplugins
directory with the following content:
{
"access_key": "replace your access_key",
"secret_key": "replace your secret_key",
"mongodb_user": "admin",
"mongodb_password": "admin",
"postgres_user": "airflow",
"postgres_password": "airflow"
}
- Open Airflow in your browser at
localhost:8080
Run the Pipeline
: Click theRun
button on the DAG to execute the workflow.