The vision of the architecture for the OpenMinVWS platform consists of three main principles:
The VWS Woo publishing platform is developed directly as an open-source project, with every aspect built on top of open-source foundations. All components of the platform rely on open-source services and components. An essential component such as the Document Storage is designed with a flexible abstraction. This means it can be deployed on a locally available filesystem, AWS S3 bucket, SFTP, WebDAV, or even Google Cloud Storage without requiring any modifications to the code.
The platform is scalable to any quantity and number of documents, as the processing of these documents is carried out by linearly scalable workers. Additionally, the flexible storage of documents and data in the Document Storage and PostgreSQL database is also linearly scalable.
The emphasis on robustness is evident in the redundancy and clustering of Databases, RabbitMQ, Elasticsearch, Document Storage, and even Redis cache. This redundancy entails duplication and scalability, allowing for additional machines to be deployed if necessary, thereby preventing any single point of failure. Furthermore, through the deep integration of Elasticsearch, we enhance the search capabilities to a higher level.
The document storage serves as a central hub for source documents and thumbnail uploaded to the platform. It exclusively
accommodates released documents of the .jpg
, .docx
, or .pdf
formats and its derivatives. The document derivatives
are used to make the presentation of the document on the website easier. This setup is housed within a conventional file
storage cluster which is connected to various processing machines and
uses PHP library Flysystem as a filesystem. The functional setup can be
compared to Dropbox, Amazon S3 or Google Drive.
The PostgreSQL database serves as a repository for the coherence and structure between dossiers, documents and files, as well as the coherence between documents themselves and the user information within the portal. PostgreSQL is an indudstry standard, more information can be found here.
RabbitMQ serves as the document processing queue for tasks assigned to the workers. Each document triggers the generation of a single task. It only funtions as a queue server from which tasks can be taken.
The Redis cache serves as the caching layer for processing page information. This accelerates processing by avoiding the need to fetch the same information from the same documents repeatedly.
Elasticsearch serves as a search engine, housing indexed data from every document, which includes both full-text search and metadata. It comprises two distinct indexes: one for writing and another for reading.
Workers are automated document processors tasked with treating PDFs as images and extracting raw text using open-source OCR packages like Tika and Tesseract, thus improving document searchability. The document files are segmented into pages, and from these pages, the text is extracted. During this process, we examine whether there is any text not delivered in a recognizable format. In cases where text recognition fails, workers resort to OCR (Optical Character Recognition). OCR operates by analyzing shapes, then letters, and finally words. For office documents, the extraction of text is relatively straightforward, eliminating the need for OCR. In our workflow, each worker is capable of handling any task. These workers start with a blank slate, receiving tasks from RabbitMQ. This decentralized approach ensures efficient task allocation and completion.