This Luigi pipeline is designed to process large .tif
images generated by a FlowCam device. The pipeline breaks down these large images into smaller "vignette" images, adds metadata (e.g., latitude, longitude, date, and depth) to the resulting images, and then uploads the processed images to a specified destination (e.g., an S3 bucket or an external API).
The pipeline is structured as a series of Luigi tasks, each handling a specific step in the workflow:
- Reading Metadata: Parses
.lst
files to extract metadata. - Decollaging: Extracts individual images from large
.tif
files. - Uploading: Uploads processed images to a specified endpoint.
The pipeline consists of the following Luigi tasks:
- Purpose: Reads the
.lst
file to extract metadata for image slicing. - Input:
.lst
file generated by the FlowCam device. - Output: A
.csv
file (metadata.csv
) containing parsed metadata.
- Purpose: Uses metadata to slice a large
.tif
image into smaller vignette images. - Input: The
metadata.csv
file generated byReadMetadata
. - Output: Individual vignette images with EXIF metadata, saved in the specified output directory.
- Purpose: Uploads processed vignette images to a specified S3 bucket or an external API.
- Input: Processed vignette images generated by
DecollageImages
. - Output: A confirmation file (
s3_upload_complete.txt
) indicating successful uploads.
- Purpose: A wrapper task that runs all the above tasks in sequence.
- Dependencies: It manages the dependencies and order of execution of the entire pipeline.
- Python 3.7 or above
- The following Python packages:
luigi
pandas
numpy
scikit-image
requests
pytest
(for testing)boto3
(for S3 interactions)aioboto3
(for async S3 interactions)fastapi
anduvicorn
(for the external API)
-
Clone the Repository
git clone https://github.com/your_username/plankton_pipeline_luigi.git cd flowcam-pipeline
-
Setup JASMIN credentials
If using S3 for uploading, make sure your AWS credentials are set in a .env file in the root directory:
AWS_ACCESS_KEY_ID=your_access_key AWS_SECRET_ACCESS_KEY=your_secret_key AWS_URL_ENDPOINT=your_endpoint_url
-
Start the Luigi Central Scheduler
luigid --background
-
Run the Pipeline Script
python -m luigi --module pipeline.pipeline_decollage FlowCamPipeline \ --directory /path/to/flowcam/data \ --output-directory /path/to/output \ --experiment-name test_experiment \ --s3-bucket your-s3-bucket-name