Releases · MI-DPLA/combine

Release Notes - v0.1

This first release of 🚜 Combine -- whoohoo! -- is an arguably full function, but still rough around the edges version.

Big thanks to a few individuals who have very generously given their time for testing and feedback, but it's been a small group of generous people, and Combine would undoubtedly benefit from more eyes, testing, and perspectives.

But, this version represents a version supporting some actual metadata aggregation work in production, and is a good line in the sand for future testing and development.

This release can be installed by checking out the v0.1 tag from the Combine-Playbook repository and building, which automatically pulls the v0.1 tag from this repository.

Release notes:

baseline original proposed functionality
- harvesting, transforming, merging, and analyzing metadata records
embraced and implemented MongoDB where DB storage required high churn CRUD, as was experiencing significant performance problems at scale with relatively untuned MySQL
- determined that gauranteed persistence and ACID compliance is not as important in a "pipeline" setting like Combine, where iteration of Jobs and Records is encouraged
settled on XML2kvp, "XML to Key/Value Pairs" for mapping XML records to support indexing and cross-document analysis
integration with DPLA API and Bulk Downloads
- primary mechanism for matching is mapping value in Record to DPLA isShownAt field, which represents to record's actual URL online
- not perfect, but when the limitations and matching is understood, can provide handy insights into status of records and harvests
Spark backend -- original reason for using, being DPLA Ingestion3 engine -- scales extremely well
- ability to handle 10, 10k, 1m, 5m records equally well
- adding more RAM and CPU cores (particularly the latter), increase throughput proportionally
Transformations of Records provide before-and-after diffs
Embrace "scenarios" as way to save, modify, and re-use transformations, validations, and field mapping
- would support sharing between instance of Combine
Exporting data
- mapped fields from a Job (CSV, TSV, JSON)
- record documents from Job, with optional file partitioning to avoid an XML file you'd need a supercomputer to open (JSON, XML)
- validation reports (CSV, TSV, JSON)
Performant static imports of XML data as single file, compressed archives, or compressed directory of files, via Hadoop file reading and globbing
Publishing
- outgoing OAI server, with publish_set_id possible for each published Job that will act as, and/or aggregate under, an OAI set

Looking ahead:

exploring integration with OpenRefine (OR) that would allow bulk actions performed in OR to be applied to Records in Combine
- relies on XML2kvp to flatten to OR-friendly fields, make alterations, then "reconstitute" mapped fields to XPath values to alter original XML Record (equal parts dangerous and exciting)
- exploratory implementation in place, but not documented
improve background and long-running services
- avoid requiring users to manage their own Livy sessions
- better feedback from background tasks and Spark jobs
- consider replacing Django Background-Tasks with more robust Celery
  - bg tasks were originally minimal, but have grown over time
better integration with DPLA's Ingestion 3 engine
- currently utilizing OAI harvester, but would be interesting to see if mapping to DPLA profile a la Ingestion3 might be possible
inspiried by desired use cases, and conceptual adjacency to software like Apache Nifi, looking into "chaining" Jobs such that a re-running a Job high in the "chain" would trickle through and update all "downstream" Jobs
- lineage between Jobs and Records is fairly well established already in Combine
- would require capturing and saving more of the parameters for running a Job, such that it could be re-run without user intervention
- would indirectly support and require ability to re-run Jobs and chain Jobs together (use one as input for another, before running)
build out unit testing
explore possibility of ElasticSearch for all Record-heavy tables over Mongo
- support much better searching
- more responsive
- if using DataTables, would require ES/Datatables connector

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release Notes - v0.1

Release notes:

Looking ahead:

Releases: MI-DPLA/combine

v0.1

Release Notes - v0.1

Release notes:

Looking ahead: