-
Notifications
You must be signed in to change notification settings - Fork 27
RFC for supporting mixed case identifiers #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,289 @@ | ||
# **RFC-0010 for Presto** | ||
|
||
## Mixed case identifiers | ||
|
||
Proposers | ||
|
||
* Reetika Agrawal | ||
|
||
## Summary | ||
|
||
Improve Presto's identifier (schema, table & column names) handling to align with SQL standards, ensuring better interoperability with case-sensitive | ||
and case-normalizing databases while minimizing SPI-breaking changes. | ||
|
||
## Background | ||
|
||
Presto treats all identifiers as case-insensitive, normalizing them to lowercase. This creates issues when | ||
querying databases that are case-sensitive (e.g., MySQL, PostgreSQL) or case-normalizing to uppercase (e.g., Oracle, | ||
DB2). Without a standard approach, identifiers might not match the actual names in the underlying data sources, leading | ||
to unexpected query failures or incorrect results. | ||
|
||
The goal here is to improve interoperability with storage engines by aligning identifier handling with SQL standards | ||
while ensuring a seamless user experience. Ideally, the change should be implemented in a way that minimizes | ||
breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant impact. | ||
|
||
### Goals | ||
|
||
- Align Presto’s identifier handling with SQL standards to improve interoperability with case-sensitive and | ||
case-normalizing databases. | ||
- Minimize SPI-breaking changes to maintain backward compatibility for existing connectors. | ||
- Introduce a mechanism for connectors to define their own identifier normalization behavior. | ||
- Allow identifiers to retain their original case where necessary, preventing unexpected query failures. | ||
- Ensure Access Control SPI can correctly normalize identifiers. | ||
- Preserve a seamless user experience while making these changes. | ||
|
||
### Proposed Plan | ||
|
||
Presto's default behavior is - | ||
|
||
- Identifiers are converted to lowercase by default unless a connector enforces a specific behavior. | ||
- Identifiers are normalized when: | ||
- Resolving schemas, tables, columns, views. | ||
- Retrieving metadata from connectors. | ||
- Displaying entity names in metadata introspection commands like SHOW TABLES and DESCRIBE. | ||
|
||
Presto uses identifiers in several ways: | ||
|
||
- Matching identifiers to locate entities such as catalogs, schemas, tables, views. | ||
- Resolving column names based on table metadata provided by connectors. | ||
- Passing identifiers to connectors when creating new entities. | ||
- Processing and displaying entity names retrieved from connectors, including column resolution and metadata introspection commands like SHOW and DESCRIBE. | ||
|
||
## Proposed Implementation | ||
|
||
#### Core Changes | ||
|
||
* In the presto-spi, add new API to pass original identifier (Schema, table and column names) | ||
* Introduce new API in Metadata for preserving lower case identifier by default to preserve backward compatibility | ||
* Introduce new Connector specific API in ConnectorMetadata | ||
|
||
Metadata.java | ||
|
||
```java | ||
String normalizeIdentifier(Session session, String catalogName, String identifier); | ||
``` | ||
|
||
MetadataManager.java | ||
|
||
```java | ||
@Override | ||
public String normalizeIdentifier(Session session, String catalogName, String identifier) | ||
{ | ||
Optional<CatalogMetadata> catalogMetadata = getOptionalCatalogMetadata(session, transactionManager, catalogName); | ||
if (catalogMetadata.isPresent()) { | ||
ConnectorId connectorId = catalogMetadata.get().getConnectorId(); | ||
ConnectorMetadata metadata = catalogMetadata.get().getMetadataFor(connectorId); | ||
return metadata.normalizeIdentifier(session.toConnectorSession(connectorId), identifier); | ||
} | ||
return identifier.toLowerCase(ENGLISH); | ||
} | ||
``` | ||
|
||
ConnectorMetadata.java | ||
```java | ||
|
||
/** | ||
* Normalize the provided SQL identifier according to connector-specific rules | ||
*/ | ||
default String normalizeIdentifier(ConnectorSession session, String identifier) | ||
{ | ||
return identifier.toLowerCase(ENGLISH); | ||
} | ||
``` | ||
|
||
JDBC Connector specific implementation | ||
|
||
JdbcMetadata.java | ||
|
||
```java | ||
@Override | ||
public String normalizeIdentifier(ConnectorSession session, String identifier) | ||
{ | ||
return jdbcClient.normalizeIdentifier(session, identifier); | ||
} | ||
``` | ||
|
||
JdbcClient.java | ||
```java | ||
String normalizeIdentifier(ConnectorSession session, String identifier); | ||
``` | ||
|
||
BaseJdbcClient.java | ||
```java | ||
@Override | ||
public String normalizeIdentifier(ConnectorSession session, String identifier) | ||
{ | ||
return identifier.toLowerCase(ENGLISH); | ||
} | ||
``` | ||
|
||
Example - Connector specific implementation - | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider adding an example for the use of the |
||
MySqlClient.java | ||
|
||
```java | ||
@Override | ||
public String normalizeIdentifier(ConnectorSession session, String identifier) | ||
{ | ||
return identifier; | ||
} | ||
``` | ||
|
||
#### Example Queries | ||
|
||
#### MySQL Table Handling | ||
|
||
``` | ||
presto> show schemas from mysql; | ||
Schema | ||
-------------------- | ||
Test | ||
TestDb | ||
information_schema | ||
performance_schema | ||
sys | ||
testdb | ||
(6 rows) | ||
|
||
presto> show tables from mysql.TestDb; | ||
Table | ||
----------- | ||
Test | ||
TestTable | ||
testtable | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds like, we are making presto case sensitive while remaining backwards compatible. To me the term "Mixed case" across the doc sounds less specific, not sure what others think, but |
||
(3 rows) | ||
|
||
presto> SHOW CREATE TABLE mysql.TestDb.Test; | ||
Create Table | ||
-------------------------------------- | ||
CREATE TABLE mysql."TestDb"."Test" ( | ||
"id" integer, | ||
"Name" char(10) | ||
) | ||
(1 row) | ||
|
||
presto> select * from mysql.TestDb.Test; | ||
id | Name | ||
----+------------ | ||
2 | Tom | ||
(1 row) | ||
``` | ||
|
||
## Behavioral Examples with `case-sensitive-name-matching` Flag | ||
|
||
Presto will allow the connector to handle identifier normalization if the `case-sensitive-name-matching` configuration flag is | ||
supported by the connector. For example, if the Postgres connector does not normalize identifiers to lowercase, the | ||
original case from the Presto DDL is preserved — including for unquoted identifiers. | ||
|
||
**When case-sensitive-name-matching = false (default behavior)** | ||
This is the default behavior in Presto. Identifiers are normalized to lowercase, regardless of quoting. | ||
|
||
Presto DDL: | ||
|
||
```sql | ||
CREATE TABLE TeSt1 ( | ||
ID INT, | ||
"Name" VARCHAR | ||
); | ||
``` | ||
|
||
Underlying DDL sent to Postgres, since Postgres identifierQuote is double quotes: | ||
|
||
```sql | ||
CREATE TABLE "test1" ( | ||
"id" INT, | ||
"name" VARCHAR | ||
); | ||
``` | ||
|
||
* Table and column names are normalized to lowercase. | ||
* Quoting is added as needed by the connector, but the case is not preserved. | ||
|
||
**When case-sensitive-name-matching = true** | ||
|
||
Connector is responsible for identifier normalization, allowing case preservation or other casing. | ||
|
||
```sql | ||
CREATE TABLE test1 ( | ||
UPR INTEGER, | ||
lwr INTEGER, | ||
"Mixed" INTEGER | ||
); | ||
``` | ||
|
||
Resulting Postgres DDL: | ||
|
||
```sql | ||
CREATE TABLE "test1" ( | ||
"UPR" INTEGER, | ||
"lwr" INTEGER, | ||
"Mixed" INTEGER | ||
); | ||
|
||
``` | ||
|
||
* Table name is preserved as "test1" (unquoted input becomes quoted). | ||
* Column names retain their original case — whether quoted or unquoted. | ||
* This behavior aligns with SQL standard semantics and matches user intent. | ||
|
||
If users prefer lowercase identifiers, they can write: | ||
|
||
Presto DDL: | ||
|
||
```sql | ||
CREATE TABLE test1 ( | ||
upr INTEGER, | ||
lwr INTEGER, | ||
mixed INTEGER | ||
); | ||
``` | ||
|
||
Underlying DDL sent to Postgres: | ||
|
||
```sql | ||
CREATE TABLE "test1" ( | ||
"upr" INTEGER, | ||
"lwr" INTEGER, | ||
"mixed" INTEGER | ||
); | ||
``` | ||
### Rationale | ||
|
||
This behavior gives users full control over identifier casing, matching SQL standard semantics and improving | ||
compatibility with case-sensitive engines like Postgres. It also ensures a smooth migration path by defaulting to | ||
existing behavior (case-sensitive-name-matching = false), avoiding surprises for current users. | ||
|
||
## Backward Compatibility Considerations | ||
|
||
* Existing connectors that do not implement normalizeIdentifier will default to lowercase normalization. | ||
* Any connectors requiring case preservation can override the default behavior. | ||
* A configuration flag could be introduced to allow backward-compatible identifier handling at the catalog level. | ||
|
||
## Test Plan | ||
|
||
* Ensure that existing CI tests pass for connectors where no specific implementation is added. | ||
* Add unit tests for testing mixed-case identifiers support in a JDBC connector (e.g., MySQL, PostgreSQL). | ||
* Cover cases such as: | ||
- Queries with mixed-case identifiers. | ||
- Metadata retrieval commands (SHOW SCHEMAS, SHOW TABLES, DESCRIBE). | ||
- Joins, subqueries, and alias usage with mixed-case identifiers. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can also add |
||
|
||
To ensure backward-compatibility current connectors where connector specific implementation is not added, existing CI tests should pass. | ||
Add support for mixed case for a JDBC connector (ex. mysql, postgresql etc) and add relevant Unit tests for same. | ||
|
||
## Modules involved | ||
- `presto-main` | ||
- `presto-common` | ||
- `presto-spi` | ||
- `presto-parser` | ||
- `presto-base-jdbc` | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider adding the link to the POC implementation, if you have one. |
||
## Final Thoughts | ||
|
||
This RFC enhances Presto's identifier handling for improved cross-engine compatibility. The proposed changes ensure | ||
better adherence to SQL standards while maintaining backward compatibility. Implementing connector-specific identifier | ||
normalization will help prevent unexpected query failures and improve user experience when working with different | ||
databases. | ||
Would appreciate feedback on any additional cases or edge scenarios that should be covered! | ||
|
||
## WIP - Draft PR Changes | ||
https://github.com/prestodb/presto/pull/24551 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agrawalreetika thanks for this very helpful proposal. Bring a few questions for discussing. Please let me know if there is anything I didn't understand correctly.
It seems that when setting
case-sensitive-name-matching=true
, the effect of delimited identifier oncatalog/schema/table
is the same as that of undelimited identifier. This means that unlimited identifiers also retain the case, which seems a little different from the SQL spec. So should we add some explanations about the behavior of delimited and undelimited identifiers?Besides, for databases like PostgreSQL, it's case sensitive for delimited column names, which means that the table can contain both column "ABCol" and column "AbCol" at the same time. Have we considered this situation? If not, should we clearly state that this situation is not supported in presto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hantangwangd Thank you for the thoughtful review!
You're absolutely right about the behavior around delimited identifiers. As part of the initial proposal, the goal was to remove the default lowercasing of identifiers and delegate normalization to the connector level — this was introduced in PR #24551.
The current PR focuses specifically on handling schema and table names. Column names aren't included yet, as they are still lowercased at the SPI level (ColumnMetadata.java#L45). Supporting case-sensitive column names will require removing this default behavior and updating each connector to handle normalization through the metadata API. This work can be planned for a follow-up PR. LMK, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agrawalreetika I saw the PR first, and then came to read the proposal 😄. The column issue is OK as long as we're explicitly aware of the current limitations.
I'm a little concerned about how we support
Postgresql
. Basically, when it comes to case sensitivity, the default behavior ofPostgresql
follows the SQL specification most faithfully, that is, identifiers in quotation marks are treated as case-sensitive in SQL statement, whereas unquoted identifiers are automatically converted to lowercase.So, do you think it makes sense to pass the state of
delimited
as a parameter when define the SPI methods? For example:Metadata.java
ConnectorMetadata.java
As I understand, if some connectors (such as Postgresql) is unaware of this
delimited
flag, it cannot make the right case-handling logic. Other connectors (such as Mysql) may optionally ignore this flag.Haven't dug deep into the code, so not very sure about the implementation complexity. Please let me know if this is a feasible way.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the insightful follow-up, @hantangwangd!
You're right about the importance of the delimited flag, especially for connectors like PostgreSQL that handle identifier casing differently based on whether the identifier is quoted.
In fact, during the initial proposal, I had a similar thought and included the delimited flag as a parameter in the proposed SPI method for exactly this reason—to give connectors enough context to apply the correct case-handling logic based on whether the identifier was quoted or not.
However, in the current state of Presto, the information about whether a schema or table name was quoted in the original SQL isn't preserved by the time it reaches the SPI layer. Adding support for retaining that quoting information would require changes. So I scoped that out for now and focused on getting the base support in place, with the intention to handle quoting and delimited identifiers more fully in a follow-up iteration.
You're absolutely right that for PostgreSQL and similar connectors, this context is important.
Let me know if that sounds reasonable, or if you'd prefer we try to tackle quoting preservation sooner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agrawalreetika Thanks for your explanation.
As you said, the delimited flag is very important for handling the case sensitivity of identifier, so IMO we should at least take it into account in the proposal, since the solution designed in the proposal should be as comprehensive as possible.
For actual implementation, I agree with you that we can accomplish the whole proposal progressively though several PRs ---- especially if retaining the quoting information for schema/table name need a lot of changes. Do you think this makes sense?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hantangwangd The SQL standard is interepreted differently by different vendors, e.g DB2 and Oracle choose to UPPER-case by default 1, Postgres chooses to lower-case
IMO, whatever way we slice it, given that we federate to connectors for storage, there is going to be ambiguity and disagreement on the 'right' behavior for DDL statements run via Presto
We could call this out in the RFC and point to these examples as what behavior to expect
Do you see any gotchas for SELECT / DML statements though ?
I think this RFC is primarliy enabling reading case-sensitive schemas/tables/columns and we should shake out those issues first
Footnotes
https://dba.stackexchange.com/questions/321413/why-are-unquoted-identifiers-upper-cased-per-sql-92 ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaneja 100% agree with this, it's always helpful to demonstrate the specific expected behavior through concrete examples.
Seems that the core of our discussion revolves around how we intend to handle unquoted identifiers. If the goal is to quickly enable reading case-sensitive schemas/tables/columns, I agree that — with
case-sensitive-name-matching
flag be enabled — treating unquoted identifiers as strictly case-sensitive is a reasonable way, and the approach is OK for now since the parsing and maintaining of the delimited flag for schema/table identifiers would likely require a lot of efforts.In future, if we implement maintaining and handling of the delimited flag for all identifiers, it seems that we no longer need this dedicated
case-sensitive-name-matching
configuration. At that point, it could be more align with SQL standard behavior: when users use unquoted identifiers, it implies case insensitivity (with specific case conversion handled by each connector); whereas when users use quoted identifiers, they meant to enforce strict case sensitivity. FYI, this behavior has been discussed for the case sensitivity ofRowType
field names andJSON
column names, see here and here.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your pointers and review @aaneja & @hantangwangd. I will add the example as well in the RFC for clarity.
My understanding here is, once we start sending
delimiter
flag as well fornormalizeIdentifier
API then in case of,delimiter = false
: it could be legacy (converting it to lower case) anddelimiter = true
: it would be converted to connector specific case.We don't have to have
case-sensitive-name-matching
configuration check in thenormalizeIdentifier
API.Current behaviour is -
If
case-sensitive-name-matching = false
case-sensitive-name-matching = true
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agrawalreetika, thanks for your summarizing, great understanding overall, just on little point to clarify when we start sending delimiter flag as well for
normalizeIdentifier
API.As I understand, we send the original-case identifiers along with their delimiter flags to the
normalizeIdentifier
API, and then in case of,delimiter = false
: the identifiers need to be normalized. Presto's default behavior is lowercase conversion, but connectors can customize their own normalization logic ---- like Oracle uppercasing them, PostgreSQL lowercasing them, and MySQL keep the case as-is.delimiter = true
: the identifiers always need to preserve their original case.Do you think this behavior makes sense? If there's anything I misunderstood, please let me know.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds fair to me.
We will also have flexibility in here, since we have
normalizeIdentifier
API layout already. In case we still want to have a flag to keep the current legacy behavior (lower-case), we would have that option as well.@aaneja @hantangwangd I’ve added detailed Behavioral Examples here to the RFC based on our earlier discussion. Please take a look at your convenience and share any feedback.
Also, when you get a chance, kindly drop a review on the corresponding PR: prestodb/presto#24551. Thanks a lot!