Skip to content

Latest commit

 

History

History
3 lines (2 loc) · 318 Bytes

README.md

File metadata and controls

3 lines (2 loc) · 318 Bytes

darwin-image-preprocessing

Functions to get all darwin cut notes based on image dimensions and throw away full-page notes (non cut notes). Works by comparing image dimensions to mean image dimensions within folder. Written in PySpark for efficient parallel processing due to dataset size of ~350GB and ~60k images.