Relax the concepts of records and data transformations #1154
Replies: 3 comments 12 replies
-
@dmos62 I'd like to get on the same page about the goals of the API first before discussing any changes to the API or rows vs. records. Current API goalsAs I see it, the main goal of our API is to transform SQL operations into the REST paradigm. The API provides an interface of objects, clients can rely on the structure of those objects and use the usual HTTP verbs with those objects to update them. The backend does the (difficult and complicated) work of translating those HTTP verbs into the appropriate SQL. This does involve significant cognitive load for the backend engineers, but it makes the API idempotent and very easy to work with without any SQL knowledge. It also transfers the burden of state management to the backend and works well with HTTP request/response cycles, since those are stateless. This is the philosophy behind our API standards. I think if we follow these, we'll also offer an API experience that no other product does (because it's hard to do). Thoughts on proposed API goalsWhat you're suggesting is that the goal of the API be to mirror the structure of SQL queries. This transfers the burden of state management and cognitive load from the backend to the API consumer. The API consumer will need to know a lot about how our API works and what parameters lead to response structure changes. Also, the work of frontend clients is made much easier when they can associate an endpoint with a standard JSON structure – MVC frontend frameworks build this assumption into their modals. Frontend code will be much more complicated if we vary the JSON structure. If we do want a "query-like" API, the best course of action would be using something like GraphQL (which is designed for query-like APIs) instead of trying to adapt our current structure so that it's not REST and not GraphQL either. I'm open to having an additional GraphQL API, but not right now. |
Beta Was this translation helpful? Give feedback.
-
Records abstraction@kgodey The pepperoni/ingredients use case you outlined is a common task and it's nice to have a simple interface for that. The records endpoint and its RESTful properties that I'm attacking suit this use case well. It's important, however, that records not become our fundamental abstraction for making queries. More interesting queries should be based on the concept of rows and columns, not records and fields. Otherwise you can only use transformations that output records and fields, which is not enough for us. REST vs RPCOur records endpoint query parameters are all verbs describing transformations: order by, filter, group by, limit, offset, [apply] db function, deduplicate, show duplicates only. That's a deviation from (pure) RESTfulness, but I think that's ok. The relevant aspect of REST is largely based around the file metaphore, and it is only applicable to the most basic of the problems we're trying to solve. For example, as soon as you want to filter some records, you must leave pure REST behind, because Changes to queriesI agree with @mathemancer that we'll need to support increasingly more "transformations" and it's becoming apparent that we don't have the structure to manipulate them well. For example, how would we specify the order of transformations when they are declared via a query parameter? All query parameter based solutions are awkward in this case, since a request's query parameters are an unordered dict (unordered by strong convention). We essentially need a graph-based approach, like Suggested refactorsSince the records endpoint is based on the records/fields abstraction, which is an abstraction around a subset of manipulations on rows/columns:
That way, API clients get the comfort of the records abstraction, while we can offer other features without being hindered by it. Connection to query builderI think that refactor 2 above, and its structure, is part of how query builder will be implemented on the service and data layers. For query builder, we need to be able to go from an SQL query to an internal representation, and from an internal representation to an SQL query. The suggested refactor would be the "internal representation to SQL query" part of the query builder. |
Beta Was this translation helpful? Give feedback.
-
I'd like to see the records endpoint in the UI namespace, because I consider it an abstraction on top of a more fundamental abstraction. "Who came up with this" is not a useful criteria for choosing a namespace for an abstraction, I'd say. There's a void in the DB namespace for where the new, more fundamental query abstraction will be. We can put the records endpoint there, or we can move it to the UI namespace, thus making the mentioned "void" more explicit. That was my thinking. I think latter is better API structure, but the difference is of negligible pragmatic value. Discussions about what is REST don't seem to be leading anywhere, but I can't resist.
The distinction between content and structure is arbitrary. We can make it whatever fits the resource we've devised, resource itself being an arbitrary, noun-ish abstraction on top of "real" entities. But, that's only if we disregard the REST architectural requirement that having a resource's representation should mean that you know how to update or delete it. In that sense, you're right that some representations (like applying an arbitrary function) are not RESTful. Deduplicate is though.
Any transformation can be seen as a property of the output. It doesn't matter how the transformation is phrased in English.
Sure. Though, I don't feel that it would be cheaper (even in the short term) to start off with a limited query API and add significant features as we go along, if that's what you have in mind. |
Beta Was this translation helpful? Give feedback.
-
Prelude
This is a discussion that started in this PR thread with this comment by @kgodey (slightly redacted):
Initially, before reposting here, below text was a reply to the above comment.
Proposal
@kgodey This has been a pain point that I wrestled with. Thing is that everything on the backend is set up to think of this endpoint (and its foundational infrastructure) not in terms of records (I'd like to not rehash the discussion on what that is precisely), but in terms of persistence-agnostic rows. On the backend we're essentially building SQL pipelines and SQL is not aware of the difference between rows and records.
I advocate, with determination, that we embrace that and rename the endpoint from
/db/tables/1/records
to the more general/db/tables/1/rows
. Its definition is then broadened to whatever tabular output that can be arrived at by any sequence of transformations made available by this endpoint, the starting point being that table, and the ending point being anything tabular. Or, in other terms, therows
endpoint response would be the result of a pipeline, whose starting point is the table, intermediate steps are the transformations likeorder_by
orfilter
ordb_function
, and the output is tabular.For illustration of what I mean when I say that the backend is building pipelines, this is the workhorse behind the
records
endpoint:Notice its pipish nature. Here,
db_function
is just another step, same asoffset
,deduplicate
,group_by
, etc.You mentioned that our API is based on objects: I don't think that my suggestion is in conflict with that. I'm saying let's take another step in that direction and say that our API is based on SQL objects: schemas, tables, rows and columns. But, let's not constrain the definition of those objects to only things that are currently persisted on the disk. I mentioned this in another thread today: I don't think that data transience should play a role in the definitions of these basic objects.
I also don't think that there's a resource/RPC conflict here. Let's consider a resource to be whatever a given pipeline outputs when applied to a given resource's path. Path, as in an address like
database -> schema -> table
. Let's consider all transformations as fundamentally equivalent, whatever concrete transformations they apply, wheter it's the redefinition of the order of rows (order_by
) or the passing of the entire table through a function (e.g.db_function
).I think that this approach would scale great. For example, summarizing transformations, just like
db_function
, would not require a separate endpoint, or deviate into RPC territory.Also, it's very flexible, since it's basically using a single pipeline builder, the
get_query
function above. Currently it's pretty primitive (e.g. fixed order of transformations), but who knows what it will grow into. This also mirrors the fact that any SQL query/view is a single pipeline.Advantages briefly
To address the probable future comment about me prioritizing architecture over product, this is not it! My intent here is to manipulate architecture to reduce the workload on developers, make hard problems easier, make annoying problems less common, and to get the product out better and faster. I'll try to summarize the benefits briefly, since this has been a long post:
Next steps
To be clear, implied next steps are:
/api/db/v0/tables/0/records
to[...]/rows
, to underline that we're dropping any transience/persistance assertions about the data returned;rows
(previouslyrecords
) endpoint to accept and apply any transformation, even if it "changes" the data;Beta Was this translation helpful? Give feedback.
All reactions