Analytics module #1569

winged · 2021-09-28T08:26:20Z

Requirements

We need an analytics module that allows us to run queries, get summaries, and find insights into the data stored in Caluma.

Requirements are as follows:

Analytics over multiple entity types possible
Possibility to run aggregate functions over any part of a case - work item attributes, document answers, and so on
Respect permission and visibility rules

Implementation - Specification

Architecture overview

The analytics module's functionality will be guided by the "Pivot Table" feature found in Microsoft Excel and LibreOffice Calc. There are two steps for a full analysis view:

Generate a table of data to be analyzed
Apply the pivot table mechanism on selected fields

The queries shall be stored as objects in the database, and be queried via GraphQL. This way, the module splits into a design time part and a runtime part, the same as the rest of Caluma.

Table building

The first step in building an analysis is selecting the relevant data. A starting point is required to identify each row. From there, the data needed for the analysis can be selected.

For example, let's start with Cases. From there, all answers in the case's document are available for selection. To also be able to select answers from workitem (and sub-workitem!) documents, those should be available as well.

class CaseStartingPoint(BaseStartingPoint):
    def get_fields(self):
        """Return a list of (slug, label) tuples that can be selected as fields.
        """
        fields = set([
            ("document.form", _("Case doc: form slug")),
            ("worflow", _("Workflow slug")),
        ])
        for case in Case.objects.all():
            # naive impl!
            for metakey in case.meta:
                fields.add((f'meta.{metakey}', _("Case: Meta key {}").format(metakey)))
            for question in case.document.form.questions.all():
                fields.add(
                    (f'document.{question.slug}', _("Case Doc: Question {}").format(question.label))
                )
                # Descend into subforms etc as needed
            # descend into work items, repeat
        return sorted(fields)
    
    def join_field(self, field):
        """Generate SQL to JOIN the given field"""
        # ...

For every such field, there must be a method to tells the system on how to JOIN the value, such that an efficient SQL query can be built.

The above example shows the basic structure on how this could look.

To improve performance in the design phase, we must consider returning a partial list of available fields, filtered by prefix: At first, only the top-level available fields are returned, and subsequent queries can then drill down to get more fields with a given prefix. The returned fields should be typed such that the frontend will know if a field is only a path segment or an actually usable field.

Users should then be able to filter the resulting table. As every field contains a representable number of values, a simple list of values to show for each column should work fine: For example, one could request that only cases with a given workflow slug should be shown.

Of the selected fields, users should also be able to rename them (make aliases), such that they make more sense in the context they're being used. Some fields also may be hidden when the query is being run, as they may be used for filtering only.

Thus, a "table" specification model could look roughly like the following in the database:

An example table specification could look as follows. We assume some procurement workflow.

field_source	alias	show	filters
case.document.form	form	false	["main-form"]
case.created_at	created_at	true
case.document.answers.total_price	total_price	true
case.workitems[ordering].status		false	["completed"]
case.document.answers.supplier_name	supplier	true

This would generate a table with three columns (the two columns used for filtering are
not shown, via the show flag.)

Important: For every given starting point, we only allow one output row per item. In case of the "Case" starting point, we allow only one row per case in the output. The field sources must ensure that this is always the case. If an analysis needs to check multiple work items, we will need a corresponding "work item" starting point (that may then go on and join related cases for filtering, for example). This implies as well that field sources (such as the workitems[ordering] above) will need to provide variants for the case that there's multiple work items. For example, they could provide a workitems[task=ordering,first-by-created] and workitems[task=ordering,last-by-created] so the user can select the relevant one.

A possible frontend mockup is shown below:

Alternative idea: We should check if we could provide a GraphQL query as input instead, but this would probably mean we need some massive optimisation in the query structure, as the returned data could become rather large. Graphene currently isn't doing a well-enough job optimizing the queries if we want more than a hand full of data (ie via pagination).

UNIONs

There are cases where multiple data sources need to be queried to build a single analytics view. For example, multiple workflows could contain the same approval form, but possibly in different locations in the work-item structure.

Thus, we should support UNIONs of tables. The condition of allowing tables to be used together in such a way would be that they have the same set of aliases.

Pivot tables

A pivot table specification consists of a few sets of fields:

row fields - each combination of row field will be unique in the output table. For example, if you had "year" and "quarter" in the row fields, you would get four rows for each year. If you only had "quarter", there would be a total of four rows.
column fields - Each combination of values in this list will generate a custom column. (This may be empty, and probably won't be implemented at first)
data fields - Each data field will become a column in the pivot table. With each data field comes an aggregate function like sum, average, count, etc.

Faster SELECTs

-The above structure is quite heavy on the database, due to the fact that a caluma case can be arbitrarily deeply nested, using work items and child cases. If one were to query such a structure naively, it would imply an iterative/recursive approach just to get the possible fields. The generated SQL would also be very inefficient.-

We need a way to access all relevant data for a case in an efficient way. For this, we need to de-normalize some aspects of the case structure. We already do this within documents, via the family attribute, which always points to the root document, for example within a table row.

The same concept now needs to be extended to case structures as well. We need to have not only the root object, but a full "path" on every case, work item, and document.

A new path attribute is an array field that denotes identifiers of the parent objects. For example, in a work item's document, the path should look like this: [case_id, workitem_id, document_id]. This should allow for faster querying of hierarchical data.

To keep the database consistent, the value shall be updated in a pre_save event handler, and a migration needs to be written that updates legacy application's data.

Notes on performance

When building analytics tools, or generally business intelligence (BI), there is the choice of doing on-line or off-line analytics. Off-line in this context means copying and de-normalizing data periodically and store it in a separate analytics database, and on-line implies querying the live production database.

We chose to build the analytics module as an on-line implementation. For one, we assume relatively low usage of the feature, and second, we want to apply the same visibilities as for regular data access.

There are a few low-hanging fruit regarding analytics performance that can be applied when needed:

Create indexes on data being queried (such as date answers), even functional indexes (extracting quarter from a data)
implement a secondary read-only database mirror for analytic processing
cache results on an application level

Generally, we assume that, despite analytics queries being rather slow, the expected volume won't impact the transactional load too much.

Visibilities and permissions

All analytics should also be subject to the visibility/permission framework:

Analyses and their input tables shall be filtered via their own visibility rule. This allows developers to define who can see any analysis.
Access to analyses and tables should be governed by the permissions as well, so that creation and management of analysis views can be allowed and disallowed as needed.
All data that is run though the analysis must be subject to the visibility rules that are defined for the corresponding models. In other words: if a user cannot see a given set of cases, those cases won't be counted when a user looks at an analysis that queries cases. However, we may add an option to disable this at design time, such that certain analyses can be run across all data, even if the user seeing the analysis wouldn't have access to the underlying entities.

GraphQL interface

class TableCell(FormDjangoObjectType):
    value = graphene.types.Union(
        graphene.types.Number,
        graphene.types.String,
        graphene.types.DateTime,
    )
    label = graphene.types.String()

class Table(FormDjangoObjectType):
    rows = ConnectionField(TableCell ...)

Discussion points

We could probably use the same entity for "regular" tables and pivot tables. This could give us in theory the ability to use the output of a pivot table as another input... However I'm not sure this is a good idea performance-wise. It would reduce interface clutter however.
We need to discuss the GraphQL interface of table output. They are not typed in the classical way, where the GraphQL client could query a selection of fields.

WBS / Tasks

The text was updated successfully, but these errors were encountered:

winged added enhancement New feature or request help wanted Extra attention is needed needs discussion labels Sep 28, 2021

winged mentioned this issue Sep 29, 2021

feat(analysis): add path attribute / mixin to structural models #1570

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analytics module #1569

Analytics module #1569

winged commented Sep 28, 2021 •

edited

Loading

Analytics module #1569

Analytics module #1569

Comments

winged commented Sep 28, 2021 • edited Loading

Requirements

Implementation - Specification

Architecture overview

Table building

UNIONs

Pivot tables

Faster SELECTs

Notes on performance

Visibilities and permissions

GraphQL interface

Discussion points

WBS / Tasks

winged commented Sep 28, 2021 •

edited

Loading