Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analytics module #1569

Open
12 of 25 tasks
winged opened this issue Sep 28, 2021 · 0 comments
Open
12 of 25 tasks

Analytics module #1569

winged opened this issue Sep 28, 2021 · 0 comments
Labels
enhancement New feature or request help wanted Extra attention is needed needs discussion

Comments

@winged
Copy link
Contributor

winged commented Sep 28, 2021

Requirements

We need an analytics module that allows us to run queries, get summaries, and find insights into the data stored in Caluma.

Requirements are as follows:

  • Analytics over multiple entity types possible
  • Possibility to run aggregate functions over any part of a case - work item attributes, document answers, and so on
  • Respect permission and visibility rules

Implementation - Specification

Architecture overview

The analytics module's functionality will be guided by the "Pivot Table" feature found in Microsoft Excel and LibreOffice Calc. There are two steps for a full analysis view:

  1. Generate a table of data to be analyzed
  2. Apply the pivot table mechanism on selected fields

The queries shall be stored as objects in the database, and be queried via GraphQL. This way, the module splits into a design time part and a runtime part, the same as the rest of Caluma.

Table building

The first step in building an analysis is selecting the relevant data. A starting point is required to identify each row. From there, the data needed for the analysis can be selected.

For example, let's start with Cases. From there, all answers in the case's document are available for selection. To also be able to select answers from workitem (and sub-workitem!) documents, those should be available as well.

class CaseStartingPoint(BaseStartingPoint):
    def get_fields(self):
        """Return a list of (slug, label) tuples that can be selected as fields.
        """
        fields = set([
            ("document.form", _("Case doc: form slug")),
            ("worflow", _("Workflow slug")),
        ])
        for case in Case.objects.all():
            # naive impl!
            for metakey in case.meta:
                fields.add((f'meta.{metakey}', _("Case: Meta key {}").format(metakey)))
            for question in case.document.form.questions.all():
                fields.add(
                    (f'document.{question.slug}', _("Case Doc: Question {}").format(question.label))
                )
                # Descend into subforms etc as needed
            # descend into work items, repeat
        return sorted(fields)
    
    def join_field(self, field):
        """Generate SQL to JOIN the given field"""
        # ...

For every such field, there must be a method to tells the system on how to JOIN the value, such that an efficient SQL query can be built.

The above example shows the basic structure on how this could look.

To improve performance in the design phase, we must consider returning a partial list of available fields, filtered by prefix: At first, only the top-level available fields are returned, and subsequent queries can then drill down to get more fields with a given prefix. The returned fields should be typed such that the frontend will know if a field is only a path segment or an actually usable field.

Users should then be able to filter the resulting table. As every field contains a representable number of values, a simple list of values to show for each column should work fine: For example, one could request that only cases with a given workflow slug should be shown.

Of the selected fields, users should also be able to rename them (make aliases), such that they make more sense in the context they're being used. Some fields also may be hidden when the query is being run, as they may be used for filtering only.

Thus, a "table" specification model could look roughly like the following in the database:

image

An example table specification could look as follows. We assume some procurement workflow.

field_source alias show filters
case.document.form form false ["main-form"]
case.created_at created_at true
case.document.answers.total_price total_price true
case.workitems[ordering].status false ["completed"]
case.document.answers.supplier_name supplier true

This would generate a table with three columns (the two columns used for filtering are
not shown, via the show flag.)

Important: For every given starting point, we only allow one output row per item. In case of the "Case" starting point, we allow only one row per case in the output. The field sources must ensure that this is always the case. If an analysis needs to check multiple work items, we will need a corresponding "work item" starting point (that may then go on and join related cases for filtering, for example). This implies as well that field sources (such as the workitems[ordering] above) will need to provide variants for the case that there's multiple work items. For example, they could provide a workitems[task=ordering,first-by-created] and workitems[task=ordering,last-by-created] so the user can select the relevant one.

A possible frontend mockup is shown below:
Fromarte Auswertung - Auswertung (neue Auswertung erstellen) (2)

Alternative idea: We should check if we could provide a GraphQL query as input instead, but this would probably mean we need some massive optimisation in the query structure, as the returned data could become rather large. Graphene currently isn't doing a well-enough job optimizing the queries if we want more than a hand full of data (ie via pagination).

UNIONs

There are cases where multiple data sources need to be queried to build a single analytics view. For example, multiple workflows could contain the same approval form, but possibly in different locations in the work-item structure.

Thus, we should support UNIONs of tables. The condition of allowing tables to be used together in such a way would be that they have the same set of aliases.

Pivot tables

A pivot table specification consists of a few sets of fields:

  • row fields - each combination of row field will be unique in the output table. For example, if you had "year" and "quarter" in the row fields, you would get four rows for each year. If you only had "quarter", there would be a total of four rows.
  • column fields - Each combination of values in this list will generate a custom column. (This may be empty, and probably won't be implemented at first)
  • data fields - Each data field will become a column in the pivot table. With each data field comes an aggregate function like sum, average, count, etc.

image

Faster SELECTs

-The above structure is quite heavy on the database, due to the fact that a caluma case can be arbitrarily deeply nested, using work items and child cases. If one were to query such a structure naively, it would imply an iterative/recursive approach just to get the possible fields. The generated SQL would also be very inefficient.-

We need a way to access all relevant data for a case in an efficient way. For this, we need to de-normalize some aspects of the case structure. We already do this within documents, via the family attribute, which always points to the root document, for example within a table row.

The same concept now needs to be extended to case structures as well. We need to have not only the root object, but a full "path" on every case, work item, and document.

A new path attribute is an array field that denotes identifiers of the parent objects. For example, in a work item's document, the path should look like this: [case_id, workitem_id, document_id]. This should allow for faster querying of hierarchical data.

To keep the database consistent, the value shall be updated in a pre_save event handler, and a migration needs to be written that updates legacy application's data.

Notes on performance

When building analytics tools, or generally business intelligence (BI), there is the choice of doing on-line or off-line analytics. Off-line in this context means copying and de-normalizing data periodically and store it in a separate analytics database, and on-line implies querying the live production database.

We chose to build the analytics module as an on-line implementation. For one, we assume relatively low usage of the feature, and second, we want to apply the same visibilities as for regular data access.

There are a few low-hanging fruit regarding analytics performance that can be applied when needed:

  • Create indexes on data being queried (such as date answers), even functional indexes (extracting quarter from a data)
  • implement a secondary read-only database mirror for analytic processing
  • cache results on an application level

Generally, we assume that, despite analytics queries being rather slow, the expected volume won't impact the transactional load too much.

Visibilities and permissions

All analytics should also be subject to the visibility/permission framework:

  • Analyses and their input tables shall be filtered via their own visibility rule. This allows developers to define who can see any analysis.

  • Access to analyses and tables should be governed by the permissions as well, so that creation and management of analysis views can be allowed and disallowed as needed.

  • All data that is run though the analysis must be subject to the visibility rules that are defined for the corresponding models. In other words: if a user cannot see a given set of cases, those cases won't be counted when a user looks at an analysis that queries cases. However, we may add an option to disable this at design time, such that certain analyses can be run across all data, even if the user seeing the analysis wouldn't have access to the underlying entities.

GraphQL interface

class TableCell(FormDjangoObjectType):
    value = graphene.types.Union(
        graphene.types.Number,
        graphene.types.String,
        graphene.types.DateTime,
    )
    label = graphene.types.String()

class Table(FormDjangoObjectType):
    rows = ConnectionField(TableCell ...)

Discussion points

  • We could probably use the same entity for "regular" tables and pivot tables. This could give us in theory the ability to use the output of a pivot table as another input... However I'm not sure this is a good idea performance-wise. It would reduce interface clutter however.
  • We need to discuss the GraphQL interface of table output. They are not typed in the classical way, where the GraphQL client could query a selection of fields.

WBS / Tasks

  • Ground work
    • Store "path" information on every case, work item, document (feat(analysis): add path attribute / mixin to structural models #1570):
      The path attribute is an array field that denotes identifiers of the parent objects. For example, in a work item's document, the path should look like this: [case_id, workitem_id, document_id]
      This should allow for faster querying of hierarchical data
  • Minimum viable implementation
    • Model, GraphQL interface for analytics table (Including support for visibility / permissions) (feat(analytics): introduce analytics table #1572)
    • Simple "Case" base object for analytics
      • Generate a list of identifiers / values that are selectable for any given case
        • Basic field types (direct on primary object (case))
        • Support fields on case's document (Answers).
          At first, we probably won't support table questions, as they would cause cartesian products.
        • Support descending through work item structure
          Note: Same thing applies as for table questions, if we have multiple work items with the same task. We probably won't support that in the first version, without some explicit ruling as to how to deal with that
      • Build SQL query to output a table of the selected values
    • GraphQL view to query the tables
    • Integrate visibilities on generated Queryset/SQL: Users should not see things in analytics that they wouldn't see otherwise (Might be switchable on a per-table setting)
    • CHECKPOINT: Check performance of SQL generated and see if we need to optimize here, or abort and redirect to an offline solution
  • Implement Pivot Table
    • Model, GraphQL interface for pivot table table (Including support for visibility / permissions)
    • Implement SQL/Query generator to actually query the database
    • GraphQL view to query the pivot tables
  • Extensions
    • Support UNIONs as input of pivot tables
    • Support filters that are already in use in the rest of Caluma
    • Support for table questions: Potentially, we'd already aggregate the values there, but per-parent-document instead of globally. Would require an additional "aggregate function" field
    • Same for values from multiple work items
    • Support custom lookups for users from internal user ids/user names. This would potentially be interesting for other use cases as well. However this could just as well be done exclusively in the frontend, provided the frontend has a way of knowing which fields will be user IDs or group IDs etc.
  • Possible optimisations / Further investigations
    • check if we could provide a GraphQL query as input instead:
      This would probably mean we need some massive optimisation in the query structure, as the returned data could become rather large. Graphene currently isn't doing a well-enough job optimizing the queries if we want more than a hand full of data (ie via pagination).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed needs discussion
Projects
None yet
Development

No branches or pull requests

1 participant