Relax the concepts of records and data transformations #1154

dmos62 · 2022-03-10T19:32:03Z

dmos62
Mar 10, 2022
Collaborator

Prelude

This is a discussion that started in this PR thread with this comment by @kgodey (slightly redacted):

@dmos62 As-built, this feature breaks our API standards, specifically the Responses section.

We have a REST API, which takes an object-oriented approach. That means that each endpoint represents a resource (which is a single type of object). The records list endpoint should always return list of records, regardless of what query parameters have been applied to it. In this PR, applying certain query parameters completely transform the response.

Please find an alternate way to solve this that is either RESTful or introduce an explicit RPC based API functions/run/ as we discussed on Matrix.

Initially, before reposting here, below text was a reply to the above comment.

Proposal

The records list endpoint should always return list of records, regardless of what query parameters have been applied to it.

@kgodey This has been a pain point that I wrestled with. Thing is that everything on the backend is set up to think of this endpoint (and its foundational infrastructure) not in terms of records (I'd like to not rehash the discussion on what that is precisely), but in terms of persistence-agnostic rows. On the backend we're essentially building SQL pipelines and SQL is not aware of the difference between rows and records.

I advocate, with determination, that we embrace that and rename the endpoint from /db/tables/1/records to the more general /db/tables/1/rows. Its definition is then broadened to whatever tabular output that can be arrived at by any sequence of transformations made available by this endpoint, the starting point being that table, and the ending point being anything tabular. Or, in other terms, the rows endpoint response would be the result of a pipeline, whose starting point is the table, intermediate steps are the transformations like order_by or filter or db_function, and the output is tabular.

For illustration of what I mean when I say that the backend is building pipelines, this is the workhorse behind the records endpoint:

def get_query(
    table,
    limit=None,
    offset=None,
    order_by=None,
    filter=None,
    group_by=None,
    duplicate_only=None,
    db_function=None,
    deduplicate=False,
):
    if duplicate_only:
        select_target = _get_duplicate_only_cte(table, duplicate_only)
    else:
        select_target = table

    if isinstance(group_by, group.GroupBy):
        selectable = group.get_group_augmented_records_query(select_target, group_by)
    else:
        selectable = select(select_target)

    if order_by:
        selectable = apply_sort(selectable, order_by)

    if filter:
        db_function_instance = get_db_function_from_ma_function_spec(filter)
        selectable = apply_db_function_as_filter(selectable, db_function_instance)

    if db_function:
        db_function_instance = get_db_function_from_ma_function_spec(db_function)
        selectable = apply_db_function_as_function(selectable, db_function_instance)

    if deduplicate:
        selectable = selectable.distinct()

    if limit:
        selectable = selectable.limit(limit)

    if offset:
        selectable = selectable.offset(offset)

    return selectable

Notice its pipish nature. Here, db_function is just another step, same as offset, deduplicate, group_by, etc.

You mentioned that our API is based on objects: I don't think that my suggestion is in conflict with that. I'm saying let's take another step in that direction and say that our API is based on SQL objects: schemas, tables, rows and columns. But, let's not constrain the definition of those objects to only things that are currently persisted on the disk. I mentioned this in another thread today: I don't think that data transience should play a role in the definitions of these basic objects.

I also don't think that there's a resource/RPC conflict here. Let's consider a resource to be whatever a given pipeline outputs when applied to a given resource's path. Path, as in an address like database -> schema -> table. Let's consider all transformations as fundamentally equivalent, whatever concrete transformations they apply, wheter it's the redefinition of the order of rows (order_by) or the passing of the entire table through a function (e.g. db_function).

I think that this approach would scale great. For example, summarizing transformations, just like db_function, would not require a separate endpoint, or deviate into RPC territory.

Also, it's very flexible, since it's basically using a single pipeline builder, the get_query function above. Currently it's pretty primitive (e.g. fixed order of transformations), but who knows what it will grow into. This also mirrors the fact that any SQL query/view is a single pipeline.

Advantages briefly

To address the probable future comment about me prioritizing architecture over product, this is not it! My intent here is to manipulate architecture to reduce the workload on developers, make hard problems easier, make annoying problems less common, and to get the product out better and faster. I'll try to summarize the benefits briefly, since this has been a long post:

in this case a single endpoint for all table transformations is easier to maintain and implement than multiple endpoints: less duplicated logic and bloat on the service-layer;
the semantics would closer mirror SQL and thus reduce cognitive load, and make leveraging SQL generally easier;
- I think this would have noteworthy, positive knock-on effects, especially when it comes to implementing the data explorer.

Next steps

To be clear, implied next steps are:

rename /api/db/v0/tables/0/records to [...]/rows, to underline that we're dropping any transience/persistance assertions about the data returned;
allow the rows (previously records) endpoint to accept and apply any transformation, even if it "changes" the data;
stop categorising data transformations into those that "change" and "don't change" data: consider that all transformations are fundamentally equivalent, whether it's sorting or applying a function.

kgodey · 2022-03-10T19:57:40Z

kgodey
Mar 10, 2022
Maintainer

@dmos62 I'd like to get on the same page about the goals of the API first before discussing any changes to the API or rows vs. records.

Current API goals

As I see it, the main goal of our API is to transform SQL operations into the REST paradigm. The API provides an interface of objects, clients can rely on the structure of those objects and use the usual HTTP verbs with those objects to update them. The backend does the (difficult and complicated) work of translating those HTTP verbs into the appropriate SQL. This does involve significant cognitive load for the backend engineers, but it makes the API idempotent and very easy to work with without any SQL knowledge. It also transfers the burden of state management to the backend and works well with HTTP request/response cycles, since those are stateless.

This is the philosophy behind our API standards. I think if we follow these, we'll also offer an API experience that no other product does (because it's hard to do).

Thoughts on proposed API goals

What you're suggesting is that the goal of the API be to mirror the structure of SQL queries. This transfers the burden of state management and cognitive load from the backend to the API consumer. The API consumer will need to know a lot about how our API works and what parameters lead to response structure changes.

Also, the work of frontend clients is made much easier when they can associate an endpoint with a standard JSON structure – MVC frontend frameworks build this assumption into their modals. Frontend code will be much more complicated if we vary the JSON structure.

If we do want a "query-like" API, the best course of action would be using something like GraphQL (which is designed for query-like APIs) instead of trying to adapt our current structure so that it's not REST and not GraphQL either. I'm open to having an additional GraphQL API, but not right now.

9 replies

mathemancer Mar 14, 2022
Maintainer

I'm trying to resist jumping too deeply into this extremely interesting conversation. That said, I have a couple thoughts about what y'all have said so far.

The backend does the (difficult and complicated) work of translating those HTTP verbs into the appropriate SQL. This does involve significant cognitive load for the backend engineers, but it makes the API idempotent and very easy to work with without any SQL knowledge.

One pragmatic problem with this (that we're already experiencing) is that more and more of the actual API gets pushed off into query string parameters. This is less structured than (for example) a well setup RPC API. As @kgodey may recall, I strongly favor a good RPC API setup at least for DML operations. It might eventually supplant the RESTful API for many clients (since retrieving records is just a very simple transformation after all).

Given that @dmos62 has already gone to the trouble of creating an endpoint that lets us describe (essentially) possible function calls, I think we should use that to let the client make function calls, and not through query string parameters but through a request body. Some options:

GraphQL: (mentioned already by @kgodey ; I think this will be an improvement, but the query language is more focused on joins than other transformations (e.g., aggregations).
gRPC: This is easy to develop and peformant. However, it couples whatever part of the server and client are using the protocol pretty closely. It also doesn't use JSON for transport, so it feels somehow less "explorable". All that said, this is my favorite option. They have a python server side library and apparently a svelte hookup for the frontend. The python side is officially maintained, but not sure how great the svelte hookup is.
JSON-RPC: This would be pretty much just as ad-hoc as the query string parameters, but at least it would be easier to return sensible and useful error messages. On the plus side, it seems more explorable than gRPC since it uses JSON.

To re-emphasize: I think the biggest problem with our API is already how much logic is pushed off into query string parameters, and I think most of the logic we develop from here on will involve more and more data transformation (or at least retrieval queries that are cumbersome to describe ad-hoc). I would be really happy if whatever we settle on for the problem at hand starts moving in the direction of structuring more of the logic of requests in the request bodies.

kgodey Mar 14, 2022
Maintainer

One pragmatic problem with this (that we're already experiencing) is that more and more of the actual API gets pushed off into query string parameters.

Agreed. I was originally envisioning that query parameters passed in for filters would be the same parameters as the resource (e.g. in my above examples, you'd use id, name, and recipes as filter parameters to the ingredients columns to filter those columns). This is what we would've gotten if we were able to use DRF's native features (as we do for non-records APIs like schemas, etc.)

Instead, our filtering API is currently fairly complex (I don't think there's any way around that). I do think we should have a more general "query" API and perhaps move filtering there instead of being on the existing records API. I can see both REST and RPC approaches to that but a REST approach would depend on modeling a Query resource well.

mathemancer Mar 14, 2022
Maintainer

a REST approach would depend on modeling a Query resource well.

That would be a really cool approach, and would also make it possible to keep queries around for later re-use. Of course, we'd need to think hard about whether there's a value add given that we'll also have views. I can think of ways to use them differently, but I'm not sure they're that useful. For example, maybe query resources could represent something more general, with placeholders for table and column names.

kgodey Mar 14, 2022
Maintainer

Of course, we'd need to think hard about whether there's a value add given that we'll also have views.

It will definitely be useful for our query builder, we need to be able to allow the user to build up a query incrementally and show previews as they go. This implies some backend persistence and an API, even if they're not shown in the UI.

dmos62 Mar 14, 2022
Collaborator Author

Messaging protocol is a secondary priority, I think. That said,

I like the idea of using JSON-RPC;
gRPC is optimized for safety and messaging throughput,
- I don't think that's useful in our case;
I think that Brent is right about GraphQL,
- like REST, it's too opinionated about what kinds of queries we'll want to make.

dmos62 · 2022-03-14T19:06:45Z

dmos62
Mar 14, 2022
Collaborator Author

Records abstraction

@kgodey The pepperoni/ingredients use case you outlined is a common task and it's nice to have a simple interface for that. The records endpoint and its RESTful properties that I'm attacking suit this use case well.

It's important, however, that records not become our fundamental abstraction for making queries. More interesting queries should be based on the concept of rows and columns, not records and fields. Otherwise you can only use transformations that output records and fields, which is not enough for us. db_function in my latest PR is an example of that, but it's just the tip of the iceberg.

REST vs RPC

Our records endpoint query parameters are all verbs describing transformations: order by, filter, group by, limit, offset, [apply] db function, deduplicate, show duplicates only. That's a deviation from (pure) RESTfulness, but I think that's ok.

The relevant aspect of REST is largely based around the file metaphore, and it is only applicable to the most basic of the problems we're trying to solve. For example, as soon as you want to filter some records, you must leave pure REST behind, because FILTER isn't an HTTP verb. Further, there's no way to compose verbs in REST, so you can't do FILTER | GROUP |..., and we need even more than that. We need something closer to FILTER(x,y,z) | GROUP(x,y,z) | ....

Changes to queries

I agree with @mathemancer that we'll need to support increasingly more "transformations" and it's becoming apparent that we don't have the structure to manipulate them well.

For example, how would we specify the order of transformations when they are declared via a query parameter? All query parameter based solutions are awkward in this case, since a request's query parameters are an unordered dict (unordered by strong convention).

We essentially need a graph-based approach, like {{inputs: [table1], id: stage1, ...}, {inputs: [stage1], id: stage2, ...}, ...}. I'll be referring to this "approach" below in a few places.

Suggested refactors

Since the records endpoint is based on the records/fields abstraction, which is an abstraction around a subset of manipulations on rows/columns:

Maybe move records endpoint to UI namespace,
- so as to explicitly treat it as a secondary abstraction for DML;
Create a DB namespace endpoint for making queries by specifying a directed graph (pipeline),
- this would be the primary abstraction for DML;
- would use the graph-based approach:
  - {{inputs: [table1], id: stage1, ...}, {inputs: [stage1], id: stage2, ...}, ...};
- it would be used to implement the records endpoint;
- it would be part of functionality necessary for query builder.

That way, API clients get the comfort of the records abstraction, while we can offer other features without being hindered by it.

Connection to query builder

I think that refactor 2 above, and its structure, is part of how query builder will be implemented on the service and data layers.

For query builder, we need to be able to go from an SQL query to an internal representation, and from an internal representation to an SQL query. The suggested refactor would be the "internal representation to SQL query" part of the query builder.

1 reply

kgodey Mar 14, 2022
Maintainer

It's important, however, that records not become our fundamental abstraction for making queries.

I think we're generally all in agreement on this point. We need a separate API that's focused on making queries.

Our records endpoint query parameters are all verbs describing transformations: order by, filter, group by, limit, offset, [apply] db function, deduplicate, show duplicates only. That's a deviation from (pure) RESTfulness, but I think that's ok.

I would argue that anything that anything that only transforms the content of the response i.e. not the structure of the response is RESTful. Also, "order", "filter", "group", "limit", "offset" are all also nouns that can be used to describe the property of the ouput (the output can have an order, but it can't have a deduplicate and it doesn't have a db function).

Anyway, that's mainly semantics and doesn't affect the general point of "we need a new API for queries".

Next steps

These are next steps as I see them, they also serve as a response to @dmos62's suggested refactors.

New Query API

I think we should have a new read-only API for making complex SELECT queries. This will be the basis for the query builder. This will also be the basis for deduplicating, showing duplicates only, and so on.

I don't think this API should be "the primary abstraction for DML". We do not need to support complex INSERT, UPDATE, or DELETE queries for any functionality in our alpha release and we can extend this API to support those queries when we actually need them. It can be the primary abstraction for SELECT queries, however.

I'd like to wait to design the structure for this until we have UX for the query builder.

Next steps here would be:

Close API for getting all valid parameters for an equality filter #1148
Move Suggest popular/standard URI schemes as values for appropriate filter parameters #1097 to the Views milestone and mark it as blocked until the query builder APIs are done.
Wait for the query builder design spec to be finalized
Come up with an API spec that works for the query builder, running functions, deduplicating, showing duplicates only, etc. It can use a directed graph or whatever else is needed.
Implement the API spec.
Complete work on Suggest popular/standard URI schemes as values for appropriate filter parameters #1097
Remove showing duplicates only from the records API and move frontend to using the query API

Current Records API

I think we should leave the records endpoint in the DB namespace. The UI namespace is for abstractions that the Mathesar team has come up with, we did not come up with the concept of tables having records.

The records API will be the only API used to insert, update, and delete records (also part of DML). It will be the primary API used to read records from a single table when there's no complex queries involved.

For now, I think we should leave filtering, sorting, and grouping as-is (in the records API). In the long run, we will probably want to move these to the new query API as well, but since they are already implemented, I don't think it's a good use of our time right now to prioritize the refactor over building features necessary for our alpha.

dmos62 · 2022-03-15T01:13:14Z

dmos62
Mar 15, 2022
Collaborator Author

@kgodey

I think we should leave the records endpoint in the DB namespace. The UI namespace is for abstractions that the Mathesar team has come up with, we did not come up with the concept of tables having records.

I'd like to see the records endpoint in the UI namespace, because I consider it an abstraction on top of a more fundamental abstraction.

"Who came up with this" is not a useful criteria for choosing a namespace for an abstraction, I'd say.

There's a void in the DB namespace for where the new, more fundamental query abstraction will be. We can put the records endpoint there, or we can move it to the UI namespace, thus making the mentioned "void" more explicit. That was my thinking. I think latter is better API structure, but the difference is of negligible pragmatic value.

Discussions about what is REST don't seem to be leading anywhere, but I can't resist.

I would argue that anything that anything that only transforms the content of the response i.e. not the structure of the response is RESTful.

The distinction between content and structure is arbitrary. We can make it whatever fits the resource we've devised, resource itself being an arbitrary, noun-ish abstraction on top of "real" entities.

But, that's only if we disregard the REST architectural requirement that having a resource's representation should mean that you know how to update or delete it. In that sense, you're right that some representations (like applying an arbitrary function) are not RESTful. Deduplicate is though.

Also, "order", "filter", "group", "limit", "offset" are all also nouns that can be used to describe the property of the ouput (the output can have an order, but it can't have a deduplicate and it doesn't have a db function).

Any transformation can be seen as a property of the output. It doesn't matter how the transformation is phrased in English.

I'd like to wait to design the structure for [new query API] until we have UX for the query builder.

Sure. Though, I don't feel that it would be cheaper (even in the short term) to start off with a limited query API and add significant features as we go along, if that's what you have in mind.

I'm ok with closing #1148 and delaying #1097.

2 replies

kgodey Mar 15, 2022
Maintainer

It looks like we're agreed on next steps. Some responses to points above:

I'd like to see the records endpoint in the UI namespace, because I consider it an abstraction on top of a more fundamental abstraction.

It is "abstraction on top of a more fundamental abstraction" if you consider the fundamental abstraction of a database to be rows and columns. I don't think that's the only way to look at it, though - databases fundamentally store data, and I think there's a strong case to be made that where the data is stored is a fundamental abstraction.

"Who came up with this" is not a useful criteria for choosing a namespace for an abstraction, I'd say.

As I understand it, the definition of what goes in the UI namespace is "things that aren't actual DB concepts that we came up with for the UI". If you disagree with this, we need to figure that out (and regardless, we should document our definition on the API standards page). I think we all need to be on the same page about how we're structuring our API so that future work is consistent.

The distinction between content and structure is arbitrary.

I strongly disagree with this. When you're rendering a page with the same kind of content, you rely on the structure of the API to be consistent so that you know what variables are available for you to use. The content of those variables can be different.

I don't feel that it would be cheaper (even in the short term) to start off with a limited query API and add significant features as we go along, if that's what you have in mind.

I'm not sure what you mean in this sentence.

The current milestone has been unblocked and we're currently just waiting for the query builder UX to be completed, so I think we can close this discussion.

We do have some outstanding items to discuss, specifically:

UI vs. DB namespace definition
General REST vs. RPC definitions (maybe?)

Since neither of these are blocking current work, I think we should defer the discussions to when we're working on related issues. @dmos62 I'll leave it to you to start a separate discussion or add an agenda item to the weekly meeting when it is timely. It might be better to add it to the weekly meeting, it seems like the kind of conversation that would go faster synchronously.

dmos62 Mar 16, 2022
Collaborator Author

I don't feel that it would be cheaper (even in the short term) to start off with a limited query API and add significant features as we go along, if that's what you have in mind.

I'm not sure what you mean in this sentence.

When you suggested waiting for the query builder UX to be finalized, it occured to me that you'd like to use it to model the database query. My concern is that that might give us tunnel vision and it might make more sense to model the query API around the goal of ultimately supporting any DML query. We'd still prioritize implementing the features needed for query builder's current UX, but we'd do what we can to make sure that it can be extended to support arbitrary DML use cases. I'm not saying that this is always a good approach, but that my intuition is saying that it is in this case.

databases fundamentally store data, and I think there's a strong case to be made that where the data is stored is a fundamental abstraction.

I've not considered that viewpoint. I think, in our case, it's less useful to think that Postgres is about storing objects/records, than that Postgres is about doing whatever with tabular data (alternative viewpoint I'm advocating). I might be biased because I've been working on sculpting SELECT queries for some months now. We can agree to disagree. I'm content with having a clearer picture of your thinking in this context.

As I understand it, the definition of what goes in the UI namespace is "things that aren't actual DB concepts that we came up with for the UI". If you disagree with this, we need to figure that out

I do think that the definition we need is a different one. I agree that we need to figure that out.

General REST vs. RPC definitions (maybe?)

I think it would be best to come up with REST/RPC conventions and definitions after we've implemented our first real RPC interface. The discussion would be abstract and misinformed if we did it earlier, imo.

I'll leave it to you to start a separate discussion or add an agenda item to the weekly meeting when it is timely.

I'll add it to my backlog.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax the concepts of records and data transformations #1154

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Relax the concepts of records and data transformations #1154

dmos62 Mar 10, 2022 Collaborator

Prelude

Proposal

Advantages briefly

Next steps

Replies: 3 comments · 12 replies

kgodey Mar 10, 2022 Maintainer

Current API goals

Thoughts on proposed API goals

mathemancer Mar 14, 2022 Maintainer

kgodey Mar 14, 2022 Maintainer

mathemancer Mar 14, 2022 Maintainer

kgodey Mar 14, 2022 Maintainer

dmos62 Mar 14, 2022 Collaborator Author

dmos62 Mar 14, 2022 Collaborator Author

Records abstraction

REST vs RPC

Changes to queries

Suggested refactors

Connection to query builder

kgodey Mar 14, 2022 Maintainer

Next steps

New Query API

Current Records API

dmos62 Mar 15, 2022 Collaborator Author

kgodey Mar 15, 2022 Maintainer

dmos62 Mar 16, 2022 Collaborator Author

dmos62
Mar 10, 2022
Collaborator

Replies: 3 comments 12 replies

kgodey
Mar 10, 2022
Maintainer

mathemancer Mar 14, 2022
Maintainer

kgodey Mar 14, 2022
Maintainer

mathemancer Mar 14, 2022
Maintainer

kgodey Mar 14, 2022
Maintainer

dmos62 Mar 14, 2022
Collaborator Author

dmos62
Mar 14, 2022
Collaborator Author

kgodey Mar 14, 2022
Maintainer

dmos62
Mar 15, 2022
Collaborator Author

kgodey Mar 15, 2022
Maintainer

dmos62 Mar 16, 2022
Collaborator Author