Skip to content

Releases: MI-DPLA/combine

v0.1

31 Aug 23:12
Compare
Choose a tag to compare

Release Notes - v0.1

This first release of 🚜 Combine -- whoohoo! -- is an arguably full function, but still rough around the edges version.

Big thanks to a few individuals who have very generously given their time for testing and feedback, but it's been a small group of generous people, and Combine would undoubtedly benefit from more eyes, testing, and perspectives.

But, this version represents a version supporting some actual metadata aggregation work in production, and is a good line in the sand for future testing and development.

This release can be installed by checking out the v0.1 tag from the Combine-Playbook repository and building, which automatically pulls the v0.1 tag from this repository.

Release notes:

  • baseline original proposed functionality
    • harvesting, transforming, merging, and analyzing metadata records
  • embraced and implemented MongoDB where DB storage required high churn CRUD, as was experiencing significant performance problems at scale with relatively untuned MySQL
    • determined that gauranteed persistence and ACID compliance is not as important in a "pipeline" setting like Combine, where iteration of Jobs and Records is encouraged
  • settled on XML2kvp, "XML to Key/Value Pairs" for mapping XML records to support indexing and cross-document analysis
  • integration with DPLA API and Bulk Downloads
    • primary mechanism for matching is mapping value in Record to DPLA isShownAt field, which represents to record's actual URL online
    • not perfect, but when the limitations and matching is understood, can provide handy insights into status of records and harvests
  • Spark backend -- original reason for using, being DPLA Ingestion3 engine -- scales extremely well
    • ability to handle 10, 10k, 1m, 5m records equally well
    • adding more RAM and CPU cores (particularly the latter), increase throughput proportionally
  • Transformations of Records provide before-and-after diffs
  • Embrace "scenarios" as way to save, modify, and re-use transformations, validations, and field mapping
    • would support sharing between instance of Combine
  • Exporting data
    • mapped fields from a Job (CSV, TSV, JSON)
    • record documents from Job, with optional file partitioning to avoid an XML file you'd need a supercomputer to open (JSON, XML)
    • validation reports (CSV, TSV, JSON)
  • Performant static imports of XML data as single file, compressed archives, or compressed directory of files, via Hadoop file reading and globbing
  • Publishing
    • outgoing OAI server, with publish_set_id possible for each published Job that will act as, and/or aggregate under, an OAI set

Looking ahead:

  • exploring integration with OpenRefine (OR) that would allow bulk actions performed in OR to be applied to Records in Combine
    • relies on XML2kvp to flatten to OR-friendly fields, make alterations, then "reconstitute" mapped fields to XPath values to alter original XML Record (equal parts dangerous and exciting)
    • exploratory implementation in place, but not documented
  • improve background and long-running services
    • avoid requiring users to manage their own Livy sessions
    • better feedback from background tasks and Spark jobs
    • consider replacing Django Background-Tasks with more robust Celery
      • bg tasks were originally minimal, but have grown over time
  • better integration with DPLA's Ingestion 3 engine
    • currently utilizing OAI harvester, but would be interesting to see if mapping to DPLA profile a la Ingestion3 might be possible
  • inspiried by desired use cases, and conceptual adjacency to software like Apache Nifi, looking into "chaining" Jobs such that a re-running a Job high in the "chain" would trickle through and update all "downstream" Jobs
    • lineage between Jobs and Records is fairly well established already in Combine
    • would require capturing and saving more of the parameters for running a Job, such that it could be re-run without user intervention
    • would indirectly support and require ability to re-run Jobs and chain Jobs together (use one as input for another, before running)
  • build out unit testing
  • explore possibility of ElasticSearch for all Record-heavy tables over Mongo
    • support much better searching
    • more responsive
    • if using DataTables, would require ES/Datatables connector