-
Notifications
You must be signed in to change notification settings - Fork 3
Data Ingestion
Heet Sankesara edited this page Jul 14, 2023
·
5 revisions
Data ingestion is one of the main component of the pipeline. We have used PySpark, a big data pipeline, to read data using Avro or CSV schema. This has proved to be much more efficient than reading data using standard reader module. The current module support reading data from the local storage and from the SFTP server. Our next aim is to integrate S3 integration to data ingestion as well.
We have also created a custom data reading function which can be used outside the pipeline. This would facilitate researchers to read data much more quickly with the least amount of effort.