Skip to content

DataManagementLab/lab23_wannadb_scale-up

 
 

Repository files navigation

WannaDB: Ad-hoc SQL Queries over Text Collections

Document collection and corresponding table.

WannaDB allows users to explore unstructured text collections by automatically organizing the relevant information nuggets in a table. It supports ad-hoc SQL queries over text collections using a novel two-phased approach. First, a superset of information nuggets is extracted from the texts using existing extractors such as named entity recognizers. The extractions are then interactively matched to a structured table definition as requested by the user.

Usage

Run main.py to start the WannaDB GUI.

There are also various auxiliary scripts in scripts/ and the experimentation repository (coming soon).

Installation

This project requires Python 3.9.

1. Create a virtual environment.
python -m venv venv
source venv/bin/activate
export PYTHONPATH="."
2. Install the dependencies.
pip install --upgrade pip
pip install --use-pep517 -r requirements.txt
pip install --use-pep517 pytest

You may have to install torch by hand if you want to use CUDA:

https://pytorch.org/get-started/locally/

3. Run the tests.
pytest

Citing WannaDB

The code in this repository is the result of several scientific publications. If you build upon WannaDB, please cite:

@inproceedings{wannadb@BTW23,
author = {Hättasch, Benjamin AND Bodensohn, Jan-Micha AND Vogel, Liane AND Urban, Matthias AND Binnig, Carsten},
title = {WannaDB: Ad-hoc SQL Queries over Text Collections},
booktitle = {BTW 2023},
year = {2023},
editor = {König-Ries, Birgitta AND Scherzinger, Stefanie AND Lehner, Wolfgang AND Vossen, Gottfried} ,
doi = { 10.18420/BTW2023-08 },
publisher = {Gesellschaft für Informatik e.V.},
address = {}
}

If you want to reference specific features/parts, our further publications might be relevant:

@inproceedings{aset@SIGMOD22,
author = {H\"{a}ttasch, Benjamin and Bodensohn, Jan-Micha and Binnig, Carsten},
title = {Demonstrating ASET: Ad-Hoc Structured Exploration of Text Collections},
year = {2022},
isbn = {9781450392495},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3514221.3520174},
doi = {10.1145/3514221.3520174},
abstract = {In this demo, we present ASET, a novel tool to explore the contents of unstructured data (text) by automatically transforming relevant parts into tabular form. ASET works in an ad-hoc manner without the need to curate extraction pipelines for the (unseen) text collection or to annotate large amounts of training data. The main idea is to use a new two-phased approach that first extracts a superset of information nuggets from the texts using existing extractors such as named entity recognizers. In a second step, it leverages embeddings and a novel matching strategy to match the extractions to a structured table definition as requested by the user. This demo features the ASET system with a graphical user interface that allows people without machine learning or programming expertise to explore text collections efficiently. This can be done in a self-directed and flexible manner, and ASET provides an intuitive impression of the result quality.},
booktitle = {Proceedings of the 2022 International Conference on Management of Data},
pages = {2393–2396},
numpages = {4},
keywords = {matching embeddings, text to table, interactive text exploration},
location = {Philadelphia, PA, USA},
series = {SIGMOD '22}
}
@inproceedings{aset@AIDB21,
    author = {H{\"a}ttasch, Benjamin and Bodensohn, Jan-Micha and Binnig, Carsten},
    year = "2021",
    title = "ASET: Ad-hoc Structured Exploration of Text Collections",
    eventdate = "16.-20.08.2021",
    language = "en",
    booktitle = "3rd International Workshop on Applied AI for Database Systems and Applications (AIDB21). In conjunction with the 47th International Conference on Very Large Data Bases, Copenhagen, Denmark, August 16 - 20, 2021.",
    location = "Copenhagen, Denmark"
}
@inproceedings{wannadb@DESIRES21,
    author = {H{\"{a}}ttasch, Benjamin},
    title = "WannaDB: Ad-hoc Structured Exploration of Text Collections Using Queries",
    booktitle = "Proceedings of the Second International Conference on Design of Experimental Search Information REtrieval Systems, Padova, Italy, September 15-18, 2021",
    series = "{CEUR} Workshop Proceedings",
    volume = "2950",
    pages = "179--180",
    publisher = "CEUR-WS.org",
    year = "2021",
    url = "http://ceur-ws.org/Vol-2950/paper-23.pdf",
    timestamp = "Mon, 25 Oct 2021 15:03:55 +0200",
    biburl = "https://dblp.org/rec/conf/desires/Hattasch21.bib",
    bibsource = "dblp computer science bibliography, https://dblp.org"
}

License

WannaDB is dually licensed under both AGPLv3 for the free usage by end users or the embedding in Open Source projects, and a commercial license for the integration in industrial projects and closed-source tool chains. More details can be found in our licence agreement.

Availability of Code & Datasets

We publish the source code four our system as discussed in the papers here. Additionally, we publish code to reproduce our experiments in a separate repository (coming soon).

Unfortunately, we cannot publish the datasets online due to copyright issues. We will send them via email on request to everyone interested and hope they can be of benefit for other research, too.

Implementation details

The core of WannaDB (extraction and matching) was previously developed by us under the name ASET (Ad-hoc Structured Exploration of Text Collections). To better reflect the whole application cycle vision we present with this paper, we switchted the name to WannaDB.

Repository structure

This repository is structured as follows:

  • wannadb, wannadb_parsql, and wannadb_ui contain the implementation of ASET and the GUI.
  • scripts contains helpers, like a stand-alone preprocessing script.
  • tests contains pytest tests.

Architecture: Core

The core implementation of WannaDB is in the wannadb package and implemented as a library. The implementation allows you to construct pipelines of different data processors that work with the data model and may involve user feedback.

Data model

data contains WannaDB's data model. The entities are InformationNuggets, Attributes, Documents, and the DocumentBase.

A nugget is an information piece obtained from a document. An attribute is a table column that gets populated with information from the documents. A document is a textual document, and the document base is a collection of documents and provides facilities for BSON serialization, consistency checks, and data access.

InformationNuggets, Attributes, and Documents can have BaseSignals, which provide a way to easily store additional information with them. Each signal is identified with a unique identifier and implements the serialization and deserialization. Furthermore, some signals may not be serialized. There are base implementations for different data types like floats or numpy arrays.

Configurations

configuration.py contains the abstract pipeline code. An Pipeline allows you to execute multiple BasePipelineElements one after the other. These pipeline elements work on an DocumentBase and receive a BaseInteractionCallback and BaseStatusCallback to facilitate user interactions and convey status updates. Furthermore, they receive a Statistics object that allows them to record information during runtime.

Both BasePipelineElements and the Pipeline are BaseConfigurableElements. This means that they come with a unique identifier and provide methods to instantiate them from a given configuration dictionary and to serialize their configuration as a dictionary.

Each BasePipelineElement specifies which BaseSignals it requires and generates for the nuggets, attributes, and documents. This ensures the consistency of the pipeline. In other words, when a pipeline element is executed, all signals it requires must be set.

Callbacks

interaction.py and status.py contain BaseInteractionCallback and BaseStatusCallback, which allow the pipeline elements to request user interactions and convey status updates. They come with default implementations InteractionCallback and StatusCallback that receive a callback function when initialized, and EmptyInteractionCallback and EmptyStatusCallback that simply do nothing.

Resources

resources.py contains a resource manager that allows different parts of WannaDB to share resources like embeddings or transformer models. The module implements the singleton pattern, so there is always only one ResourceManager accessed via resources.MANAGER, which handles the loading, access, and unloading of BaseResources. You should use a context manager (with ResourceManager() as resource_manager:) to ensure that all resources are properly closed when the program stops/crashes.

Each BaseResource comes with a unique identifier and implements methods for loading, unloading, and access.

Statistics

The Statistics object allows you to easily record information during runtime. It is handed from the Pipeline to the BasePipelineElements, and from the BasePipelineElements to other components like distance functions.

Architecture: GUI

The GUI implementation can be found in the wannadb_ui package. wannadb_api.py provides an asynchronous API for the wannadb library using PyQt's slots and signals mechanism. main_window.py, document_base.py, and interactive_window.py contain different parts of the user interface, and common.py contains base classes for some recurring user interface elements.

Reproducing Multithreading Cluster-Experiments:

Two new files have been added to benchmark the preprocessing step: test_baseline.py in the main branch and test_multiprocessing.py in the multiprocessing branch.

In its current implementation, the multiprocessing assumes that all the models have been downloaded, therefore you need to first start on main branch and launch test_baseline.py with following steps:

  1. Set-up and actiavte a Python venv:
    1. python3.9 -m venv <path_to_venv>
    2. source <path_to_venv>/bin/activate
  2. Comment the three pyqt6 dependencies in requirements.txt
  3. pyton3.9 -m pip install -r requirements.txt
  4. Edit the input_path variable in the test_baseline.py file to point to the location of your raw documents (e.g ./data/corona/raw-documents)
  5. touch results.txt
  6. python3.9 ./test_baseline.py

To start the multiprocessing, first checkout to the abagabe-multiprocessing branch and apply the 4th and 5th steps above to the test_multiprocessing.py file instead of test_baseline.py

About

WannaDB: Ad-hoc SQL Queries over Text Collections

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%