Meet Customer Data Ingestion Pipeline, a cloud native full stack data pipeline built at Barclays' data stellar hackathon.
crafted with ♥ by your besties Anupama , Mayank, Harshit and Shashank.
This is a data management project that involves collecting, processing, and storing data from various sources in a centralized location for further analysis. The Goal of this project is to make data readily available for analysis by extracting data from a variety of sources and integrating it into a single repository, such as a data warehouse
aim:
The aim of this product is to parse and aggregate all customer data into a single database. We aim to process structured data (from an RDBS), unstructured data (from APIs and other sources), and semi-structured data. Additionally, we aim to ensure that our system is as cloud-native as possible, primarily utilizing AWS services with minimal manual intervention. Our objective is to create a pipeline that can handle large files, parse and process them in an event-driven infrastructure, and serve JSON data. Ultimately, our primary objective is to take various types of data, parse and process them, and store them in a designated database for easy access.
Our secondary objective is to provide our fellow developers with as much control as possible. To achieve this, we will implement a logging service to monitor and alert in case of errors and batch failures. We also plan to create a dashboard where different teams and developers in Barclays (or any other organization) can generate secure API keys and create new projects. Each new project will have a separate table and bucket to ensure data isolation. We also plan on giving the option to migrate all of the data from our database (exporting data).Implementation:
We plan to use AWS services as much as possible, with a few exceptions. We have attached a flow diagram to make it easier to understand how our system will work. Everything starts with a NextJS application. We plan to use CloudFlare Zero Trust to ensure that only admins can add new developers. Once a developer is added, they will receive a unique API key that they can use to extract processed data, upload new files, or push data in JSON or other formats.
Since we want all access to be authenticated, we will only allow authorized requests with IP whitelisting and rate limiting. We plan to use NodeJS with Typescript for most of the scripting since the community support is great, and our team is more familiar with it, but we will use Python if we need to perform something that is difficult to do with Node. We plan to use DynamoDB to store our structured and processed data because it is highly scalable, serverless (we do not have to manage anything), and has excellent support for event-driven infrastructure. We can easily connect a Lambda function to process data.
We will use S3 to store BLOB, and we will connect Lambdas to it so that in case of any file uploads or updates, we can trigger that Lambda and parse, process, and insert data into DynamoDB.
Additionally, we plan to use OpenRefine/AWS Glue for ETL and Papertrail for error logging and monitoring. We may choose to go with a different provider, but due to the limited time and our prior experience, we may go with Papertrail.CloudTrail and CloudWatch are also must for AWS services logging and we'll run our APIs on either EC2 with ubuntu or dockerize it and run on something like Fargate with ECS so that we do not have to manage our APIs. AWS API gateway is also in our plans because we need to add rate limit and also control DDOS and other attacks. While our end goal in the pipeline is to store data in Dynamodb, the data might take different paths. If it is a file, it goes on s3 first and then db, if it is coming from a structured or semi-structured source, it gets cleaned and directly transferred to the DB.Application:
Once completed, our pipeline can be used to ingest multiple types of data, process it, and store it in a database, making it easier to query or analyze. It can be customer data, or we can modify it to support other forms of data. Some applications that come to our minds are financial data or healthcare data, where there is a vast amount of PDFs and other health reports that need to be parsed, stored, and analyzed.
For the hackathon, our aim is to build a working prototype and a proof of concept. Once the hackathon is over, we plan to implement changes and work on feedback from mentors and judges to apply them in real life. Some of our teammates are already working as interns at companies where a lot of healthcare or e-commerce data is processed. We will look into the possibilities of using this pipeline to solve issues there.We also plan on adding major support for migration from an RDBMS. Because right now, we will have to send the JSON data from APIs or run a local script, which is not very effective. We plan on using tools like Prima to migrate data at scale!
BARCLAYS CUSTOMER DATA INGESTION.pdf
For a later presentation at Barclays