Releases: MI-DPLA/combine
Releases · MI-DPLA/combine
v0.1
Release Notes - v0.1
This first release of 🚜 Combine -- whoohoo! -- is an arguably full function, but still rough around the edges version.
Big thanks to a few individuals who have very generously given their time for testing and feedback, but it's been a small group of generous people, and Combine would undoubtedly benefit from more eyes, testing, and perspectives.
But, this version represents a version supporting some actual metadata aggregation work in production, and is a good line in the sand for future testing and development.
This release can be installed by checking out the v0.1
tag from the Combine-Playbook repository and building, which automatically pulls the v0.1
tag from this repository.
Release notes:
- baseline original proposed functionality
- harvesting, transforming, merging, and analyzing metadata records
- embraced and implemented MongoDB where DB storage required high churn CRUD, as was experiencing significant performance problems at scale with relatively untuned MySQL
- determined that gauranteed persistence and ACID compliance is not as important in a "pipeline" setting like Combine, where iteration of Jobs and Records is encouraged
- settled on XML2kvp, "XML to Key/Value Pairs" for mapping XML records to support indexing and cross-document analysis
- integration with DPLA API and Bulk Downloads
- primary mechanism for matching is mapping value in Record to DPLA
isShownAt
field, which represents to record's actual URL online - not perfect, but when the limitations and matching is understood, can provide handy insights into status of records and harvests
- primary mechanism for matching is mapping value in Record to DPLA
- Spark backend -- original reason for using, being DPLA Ingestion3 engine -- scales extremely well
- ability to handle 10, 10k, 1m, 5m records equally well
- adding more RAM and CPU cores (particularly the latter), increase throughput proportionally
- Transformations of Records provide before-and-after diffs
- Embrace "scenarios" as way to save, modify, and re-use transformations, validations, and field mapping
- would support sharing between instance of Combine
- Exporting data
- mapped fields from a Job (CSV, TSV, JSON)
- record documents from Job, with optional file partitioning to avoid an XML file you'd need a supercomputer to open (JSON, XML)
- validation reports (CSV, TSV, JSON)
- Performant static imports of XML data as single file, compressed archives, or compressed directory of files, via Hadoop file reading and globbing
- Publishing
- outgoing OAI server, with
publish_set_id
possible for each published Job that will act as, and/or aggregate under, an OAI set
- outgoing OAI server, with
Looking ahead:
- exploring integration with OpenRefine (OR) that would allow bulk actions performed in OR to be applied to Records in Combine
- relies on XML2kvp to flatten to OR-friendly fields, make alterations, then "reconstitute" mapped fields to XPath values to alter original XML Record (equal parts dangerous and exciting)
- exploratory implementation in place, but not documented
- improve background and long-running services
- avoid requiring users to manage their own Livy sessions
- better feedback from background tasks and Spark jobs
- consider replacing Django Background-Tasks with more robust Celery
- bg tasks were originally minimal, but have grown over time
- better integration with DPLA's Ingestion 3 engine
- currently utilizing OAI harvester, but would be interesting to see if mapping to DPLA profile a la Ingestion3 might be possible
- inspiried by desired use cases, and conceptual adjacency to software like Apache Nifi, looking into "chaining" Jobs such that a re-running a Job high in the "chain" would trickle through and update all "downstream" Jobs
- lineage between Jobs and Records is fairly well established already in Combine
- would require capturing and saving more of the parameters for running a Job, such that it could be re-run without user intervention
- would indirectly support and require ability to re-run Jobs and chain Jobs together (use one as input for another, before running)
- build out unit testing
- explore possibility of ElasticSearch for all Record-heavy tables over Mongo
- support much better searching
- more responsive
- if using DataTables, would require ES/Datatables connector