Which column identifier should we use for DML operation #1147

silentninja · 2022-03-07T22:08:49Z

silentninja
Mar 7, 2022
Collaborator

We have decided to use attnums for DDL operations based on our previous discussion, but there are a few issues that come up when we use attnums for our DML operations.

Certain functions use aliased columns which won't be backed by an attnum or an id. Relevant Matrix discussion

Brent and I had a sync call regarding the same and these are the notes for the discussion

Brent also feels like this is a good time to bring up the conversation as not considering non persisted columns could hamper the possible features of filters, query builder etc.

So how should we go about dealing with such issues that need a temporary aliased column?

mathemancer · 2022-03-08T09:20:31Z

mathemancer
Mar 8, 2022
Maintainer

I strongly support adding the feature that allows the front end to give an alias to non-persisted columns. We'd then consider them "anonymous" otherwise, and so couldn't be referred to by steps further down some pipeline. Adding this feature would allow things like "filter by col1 + col2 > 10" and the like to be built up even if we don't have it pre-rolled. Currently, we aren't implementing building upon preexisting views. We also have no way to refer to a "column" of data resulting from a function application, since it won't have an id, and can't have an attnum. It seems like one or the other of those two features would be essential for any but the most basic or most curated tasks.

0 replies

kgodey · 2022-03-08T16:44:27Z

kgodey
Mar 8, 2022
Maintainer

I think that by default, the word "column" in the API/frontend should refer to something that is persisted and has an ID. We should always use IDs to work with these objects in the API. I don't have a strong opinion on whether to use IDs or attnums or names internally for DML operations, I'll defer to @mathemancer here.

For non-persisted columns, we should come up with a different word to describe the concept in the API. I am fine with using names instead of IDs to refer to these. Perhaps we could call these "temporary columns" or "virtual columns".

3 replies

silentninja Mar 8, 2022
Collaborator Author

For non-persisted columns, we should come up with a different word to describe the concept in the API. I am fine with using names instead of IDs to refer to these. Perhaps we could call these "temporary columns" or "virtual columns".

How would we reference them? It would be inconsistent to use id for certain column and name for others. Most of the requirement comes from having to build a chain of filters(a common use case when using SQL). Initially, the solution for that was to use views, but even that would have some inconvenience on UX as well as resulting in too many unnecessary views.

kgodey Mar 8, 2022
Maintainer

They're not the same type of object, so I don't see how it's inconsistent to use id for columns and name for virtual columns. Internally, we can convert id to name if needed.

In general, I'd like to try to avoid figuring out the architecture for features we haven't yet built. We don't have any way for the frontend to use virtual columns yet. When we do, we can then figure out how to represent them at the API level. It's too easy to get sidetracked thinking about UX or architectural concerns for features that don't exist yet.

If I'm missing something about how it affects a current feature, please let me know.

dmos62 Mar 10, 2022
Collaborator

I don't think it's a good idea to consider columns to be only data that is persisted. I think it's much more powerful to think of tables, columns and rows as tabular data formats that have nothing to do with storage (or transience) of that data. As I see it, one query returns "persisted" data, another returns "temporary" data, another returns a mix of both; but, all of them return it in tabular format and we manipulate it in a tabular format too.

silentninja · 2022-03-08T23:51:01Z

silentninja
Mar 8, 2022
Collaborator Author

They're not the same type of object

That is what makes it inconsistent when referring to them along with a persisted column in the same API. For example, when chaining, persisted column would be referenced by a column id and virtual column would be referenced by a name

db_function(col1+col2, col1)

If I'm missing something about how it affects a current feature, please let me know.

@mathemancer or @dmos62 Would be better equipped to answer this question.

Most of the concern is with regards to how to chain/nest queries since we are won't be using views for it.

Edit: Looking at the View spec and the comments, it seems like the alpha release won't be supporting pipelining queries either. So it feels too early at this point. I am leaning towards using column id as it is much more consistent on the API level with the current set of features.

2 replies

kgodey Mar 9, 2022
Maintainer

Edit: Looking at the View spec and the comments, it seems like the alpha release won't be supporting pipelining queries either. So it feels too early at this point. I am leaning towards using column id as it is much more consistent on the API level with the current set of features.

Agreed. As I mentioned in my previous comment, we'll figure out the architecture for temporary columns and chaining queries when we need support for them in the frontend.

dmos62 Mar 10, 2022
Collaborator

I've submitted a PR that returns a non-persisted column that's derived by applying a DB function to a table, so that column doesn't have a Django ID. What I did is I just returned its Postgres-given name, which is something like either anon_1 or some_db_function_1. It feels a bit hacky, but it's not obvious that it's not robust: the frontend can just check if the column identifier is an integer of a string and know whether it's dynamic or a reference to a persisted column.

mathemancer · 2022-03-09T09:34:33Z

mathemancer
Mar 9, 2022
Maintainer

Currently, I think our only DML operations (for the user DB) are under db/records/operations/. We should:

double-check that this is correct
Enforce this as-yet unspoken convention
Enforce a further convention that would disallow any DDL-DML combination operations from the API, and disallow any such combination functions being public from the db library.
use names for referring to columns internally in the db.records.operations module.

This would make things a bunch simpler for reasons we've discussed many times, and make DML operations more performant (performance matters much more for DML than DDL in general). If we enforce the conventions separating DDL and DML, and keep an eye on where DDL operations aren't (i.e., they can't be in the db.records.operations module), then it is safe to use names for DML requests, IMO.

5 replies

kgodey Mar 9, 2022
Maintainer

Sounds good to me.

silentninja Mar 10, 2022
Collaborator Author

Should we use column name on the service layer too, I would prefer we use id on the service layer and convert it to names when passing data to the data layer?
How can we ensure the operations used/imported by the records module is not a DDL operation?

dmos62 Mar 10, 2022
Collaborator

I like the idea of enforcing a clear separation of DDL and DML operations.

kgodey Mar 10, 2022
Maintainer

Should we use column name on the service layer too, I would prefer we use id on the service layer and convert it to names when passing data to the data layer?

@silentninja The service layer should always use IDs, the interface exposed by the API is RESTful and we should follow REST conventions.

How can we ensure the operations used/imported by the records module is not a DDL operation?

We could do one/more of the following:

Add copious code comments that say "Do not import DDL operations here"
Write a test that checks for imports from any known DDL files. We'd need to keep this updated, but even if we forget to update it, it's better than nothing.

mathemancer Mar 14, 2022
Maintainer

I have way more faith in the testing idea than commenting personally.

dmos62 · 2022-03-10T12:34:02Z

dmos62
Mar 10, 2022
Collaborator

Is it conceivable to use attnums instead of Django IDs? That would free us from having to do input/output translation to translate between Django IDs and names, since attnums (and other robust database-level identifiers) are allowed on the database layer.

The translation between column ids and names has been costing a lot of time for me, and for others as well, I think.

6 replies

dmos62 Mar 10, 2022
Collaborator

@silentninja wouldn't the attnum+table_oid-based service-layer look up be preferable to an identifier that can't be used beyond the service-layer (the Django ID). A Django ID necessitates rewriting input/output JSON in the service-layer, attnum+table_oid would solve that.

What I'm proposing is something like GET db/tables/{oid}/columns/{attnum}/. I.e. using database-level identifiers for columns, tables, and databases too (if databases have something more robust than names).

silentninja Mar 10, 2022
Collaborator Author

While using attnum as an identifier is certainly possible on the service layer and most of the problems can be solved by restructuring the API. I am not sure how it would solve the problem of computed (transient) columns which is the major concern

The problem associated with using a database-level identifier is that it always has to be a tuple of (database,schema oid,table oid,column attnum).

Which is not ideal if we want a flat structure
It would be hard to use them in a request body. Especially when cross-referencing across tables. For example, if I want to join two-column, I would need to send the complete database identifier(database,schema oid,table oid,column attnum) for each column.
Linking objects will be a problem as Django does not support composite foreign keys.

dmos62 Mar 10, 2022
Collaborator

I am not sure how it would solve the problem of computed (transient) columns which is the major

Yes, this is slightly off topic. Apologies for hijacking.

Which is not ideal if we want a flat structure

Having a non-flat structure should be fine, since database objects are non-flat. If we flatten the API (what we're doing now), we're shifting that complexity over to an implicit translation sub-layer in the service-layer, which is worse, I think.

It would be hard to use them in a request body. Especially when cross-referencing across tables. For example, if I want to join two-column, I would need to send the complete database identifier(database,schema oid,table oid,column attnum) for each column.

I think using a more elaborate JSON structure is preferable to the translation/parsing that we currently have to do in the service-layer to switch from/to a flat structure.

Linking objects will be a problem as Django does not support composite foreign keys.

That's an interesting point, but it's solvable one way or another. It might take more queries, but I think right now it's preferable to prioritize developer time to query efficiency. Also, using database identifiers throughout the stack might improve CPU/IO efficiency in other places.

silentninja Mar 10, 2022
Collaborator Author

That's an interesting point, but it's solvable one way or another. It might take more queries, but I think right now it's preferable to prioritize developer time to query efficiency. Also, using database identifiers throughout the stack might improve CPU/IO efficiency in other places.

By linking I mean referring to an object using a Foreign key. For example, let's say we want to connect multiple display options to a column, we would need to store the complete database reference tuple as individual columns and would need all the information when fetching the related display option.

display_options | database_name | schema_oid | table_oid | colum_oid

Having denormalised data is a problem of its own and I don't see a good reason to do that.

It might take more queries

I am not sure how it would take more queries.

Having a non-flat structure should be fine, since database objects are non-flat. If we flatten the API (what we're doing now), we're shifting that complexity over to an implicit translation sub-layer in the service layer, which is worse, I think.

We already had a discussion about this in #974 and the current discussion does not have enough reasons to circumvent that.

kgodey Mar 10, 2022
Maintainer

Let's hold this discussion for now – we're not going to be rewriting the API structure anytime soon because the API works and our current goal is getting the release out. The features we have planned for the alpha release would not be improved by this rewrite.

Also, our API structure is built on the REST paradigm, which involves unique IDs for resources. We're using REST because it's familiar, makes the API easy to work with, and gives us enough guidelines to help us design our APIs. While we can talk about whether it's the best paradigm for Mathesar, now is not the time for that conversation.

dmos62 · 2022-03-10T16:10:02Z

dmos62
Mar 10, 2022
Collaborator

The way I see this working is having different kinds of identifiers work in tandem. How DB functions currently address this is that column references are represented by two functions, one for each type of identifier supported: column_name and column_id. There could be a third column reference function (maybe column_dynamic) which would take some string identifier and it would just output a reference to a column like this: (pseudocode) column_dynamic("XYZ") == column_name("DYNAMIC_PREFIX-" + "XYZ"). As long as that DYNAMIC_PREFIX- is agreed upon, we can let the API user just use that wherever in his request and we'll know what he means. Am I missing any edge cases that would invalidate this approach?

Whatever solution we choose, it has to be based on names, since that's the bread and butter of Postgres.

0 replies

Uh oh!

Which column identifier should we use for DML operation #1147

Uh oh!

Uh oh!

silentninja Mar 7, 2022 Collaborator

Replies: 6 comments · 16 replies

Uh oh!

mathemancer Mar 8, 2022 Maintainer

Uh oh!

Uh oh!

kgodey Mar 8, 2022 Maintainer

Uh oh!

Uh oh!

silentninja Mar 8, 2022 Collaborator Author

Uh oh!

kgodey Mar 8, 2022 Maintainer

Uh oh!

Uh oh!

dmos62 Mar 10, 2022 Collaborator

Uh oh!

Uh oh!

silentninja Mar 8, 2022 Collaborator Author

Uh oh!

kgodey Mar 9, 2022 Maintainer

Uh oh!

dmos62 Mar 10, 2022 Collaborator

Uh oh!

mathemancer Mar 9, 2022 Maintainer

Uh oh!

kgodey Mar 9, 2022 Maintainer

Uh oh!

Uh oh!

silentninja Mar 10, 2022 Collaborator Author

Uh oh!

Uh oh!

dmos62 Mar 10, 2022 Collaborator

Uh oh!

kgodey Mar 10, 2022 Maintainer

Uh oh!

mathemancer Mar 14, 2022 Maintainer

Uh oh!

dmos62 Mar 10, 2022 Collaborator

Uh oh!

Uh oh!

dmos62 Mar 10, 2022 Collaborator

Uh oh!

Uh oh!

silentninja Mar 10, 2022 Collaborator Author

Uh oh!

Uh oh!

dmos62 Mar 10, 2022 Collaborator

Uh oh!

silentninja Mar 10, 2022 Collaborator Author

Uh oh!

kgodey Mar 10, 2022 Maintainer

Uh oh!

Uh oh!

dmos62 Mar 10, 2022 Collaborator

silentninja
Mar 7, 2022
Collaborator

Replies: 6 comments 16 replies

mathemancer
Mar 8, 2022
Maintainer

kgodey
Mar 8, 2022
Maintainer

silentninja Mar 8, 2022
Collaborator Author

kgodey Mar 8, 2022
Maintainer

dmos62 Mar 10, 2022
Collaborator

silentninja
Mar 8, 2022
Collaborator Author

kgodey Mar 9, 2022
Maintainer

dmos62 Mar 10, 2022
Collaborator

mathemancer
Mar 9, 2022
Maintainer

kgodey Mar 9, 2022
Maintainer

silentninja Mar 10, 2022
Collaborator Author

dmos62 Mar 10, 2022
Collaborator

kgodey Mar 10, 2022
Maintainer

mathemancer Mar 14, 2022
Maintainer

dmos62
Mar 10, 2022
Collaborator

dmos62 Mar 10, 2022
Collaborator

silentninja Mar 10, 2022
Collaborator Author

dmos62 Mar 10, 2022
Collaborator

silentninja Mar 10, 2022
Collaborator Author

kgodey Mar 10, 2022
Maintainer

dmos62
Mar 10, 2022
Collaborator