Columns passed in RAG seem ignored in generated SQL #194

pierreoberholzer · 2024-01-25T19:24:12Z

pierreoberholzer
Jan 25, 2024

Hi,

I'm trying to create a SQL query that would make use of valid column names passed to the RAG via INFORMATION_SCHEMA.COLUMN. However, the obtained query does not mention any real column.

import vanna
from vanna.remote import VannaDefault
from vanna.openai.openai_chat import OpenAI_Chat
from vanna.chromadb.chromadb_vector import ChromaDB_VectorStore

# Globals

PROJECT_ID = "my_gcp_project"
DATASET_ID = "my_dataset"
TABLE_NAME = "my_table"
OPENAI_API_KEY = "sk-xxxxx"

# Class instantiation

class MyVanna(ChromaDB_VectorStore, OpenAI_Chat):
    def __init__(self, config=None):
        ChromaDB_VectorStore.__init__(self, config=config)
        OpenAI_Chat.__init__(self, config=config)

vn = MyVanna(config={'api_key': OPENAI_API_KEY, 'model': 'gpt-4'})

vn.connect_to_bigquery(project_id=PROJECT_ID)

# The below query is working on BigQuery console.

METADATA_QUERY = f"""
SELECT * FROM `{PROJECT_ID}.{DATASET_ID}.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = {TABLE_NAME}
"""

df_information_schema = vn.run_sql(METADATA_QUERY)

# This will break up the information schema into bite-sized chunks that can be referenced by the LLM
plan = vn.get_training_plan_generic(df_information_schema)

# If you like the plan, then uncomment this and run it to train
vn.train(plan=plan)

vn.ask(question="Where does John Deere live ?")

# Obtained query

"""
SELECT residence 
FROM Authors 
WHERE name = 'John Deere';
"""

Neither the fields residence or name exist, nor the table Authors

I also get the following error by the way..

Couldn't run sql:  exceptions must derive from BaseException

Thanks for your help !

mike-niemand · 2024-01-25T19:32:43Z

mike-niemand
Jan 25, 2024

Try describing your table using create table syntax as a ddl?

…

On Thu, 25 Jan 2024, 21:24 Pierre Oberholzer, ***@***.***> wrote: Hi, I'm trying to create a SQL query that would make use of valid column names passed to the RAG via INFORMATION_SCHEMA.COLUMN. However, the obtained query does not mention any real column, and therefore fails on the DB. ` import vanna from vanna.remote import VannaDefault from vanna.openai.openai_chat import OpenAI_Chat from vanna.chromadb.chromadb_vector import ChromaDB_VectorStore Globals PROJECT_ID = "my_gcp_project" DATASET_ID = "my_dataset" TABLE_NAME = "my_table" OPENAI_API_KEY = "sk-xxxxx" Class instantiation class MyVanna(ChromaDB_VectorStore, OpenAI_Chat): def *init*(self, config=None): ChromaDB_VectorStore.*init*(self, config=config) OpenAI_Chat.*init*(self, config=config) vn = MyVanna(config={'api_key': OPENAI_API_KEY, 'model': 'gpt-4'}) vn.connect_to_bigquery(project_id=PROJECT_ID) The below query is working on BigQuery console. METADATA_QUERY = f""" SELECT * FROM {PROJECT_ID}.{DATASET_ID}.INFORMATION_SCHEMA.COLUMNS WHERE table_name = {TABLE_NAME} """ df_information_schema = vn.run_sql(METADATA_QUERY) This will break up the information schema into bite-sized chunks that can be referenced by the LLM plan = vn.get_training_plan_generic(df_information_schema) If you like the plan, then uncomment this and run it to train vn.train(plan=plan) vn.ask(question="Where does John Deere live ?") Obtained query """ SELECT residence FROM Authors WHERE name = 'John Deere'; """ Neither the fields `residence` or `name` exist, nor the table `Authors` I also get the following error by the way.. Couldn't run sql: exceptions must derive from BaseException Thanks for your help ! — Reply to this email directly, view it on GitHub <#187>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGK4ZK63QDWUJUKV3W3HSETYQKWOTAVCNFSM6AAAAABCLBSCWCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYDCMBSGM2DONY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

pierreoberholzer · 2024-01-25T19:51:14Z

pierreoberholzer
Jan 25, 2024
Author

Trying the DDL as you suggest: same issues as in the first trial above.

df_information_schema = vn.run_sql(METADATA_QUERY)

res = []
for index, row in df_information_schema.iterrows():
    col_name = row["column_name"]
    type = row["data_type"]
    elem = f"{col_name} {type}"
    res.append(elem)

res_string =  "(" + ", ".join(res) + ")"

DDL = f"""CREATE TABLE {TABLE_NAME_VANNA} {res_string}"""

vn.train(ddl=DDL)

0 replies

zainhoda · 2024-01-26T01:51:26Z

zainhoda
Jan 26, 2024
Maintainer

@pierreoberholzer did the training data make it in? Do you get results when you do vn.get_training_data()?

0 replies

pierreoberholzer · 2024-01-26T07:28:49Z

pierreoberholzer
Jan 26, 2024
Author

Good idea. It seems it at least received meaningful info (just showing first 6 columns here)..

vn.get_training_data()

0 replies

zainhoda · 2024-01-26T11:24:14Z

zainhoda
Jan 26, 2024
Maintainer

@pierreoberholzer are those the actual names of your columns? If so, there's no way the LLM would be able to associate the column name with that it means semantically.

With ambiguous or nonexistent column names, the best method is going to be training on example sql statements because the database schema doesn't have enough information. Like if a human wouldn't be able to figure out how to translate "Where does John Deere live ?" into SQL based on the information in that image, then the LLM wouldn't be able to either.

Try this -- try training on 3-4 sample SQL queries that you know work. So do:
vn.train(sql=...)

Then try asking questions that are related to those queries.

If that works, since you're using BigQuery, what you can do is extract your query history and loop over the query history to do vn.train(sql=...)

0 replies

pierreoberholzer · 2024-01-26T11:51:23Z

pierreoberholzer
Jan 26, 2024
Author

Those are dummy column names, the real ones have some semantic meaning.
Still, I would expect in any case the generated SQL to be constrained on existing column names (even with poor semantic distance). In addition, once that is solved, passing examples is indeed surely a good thing to do. Thanks.

0 replies

pierreoberholzer · 2024-01-26T15:28:35Z

pierreoberholzer
Jan 26, 2024
Author

Still investigating, but it seems that I am reaching some limit..

Is the entire metadata context passed to OpenAI ?
Am I supposed to purge Chroma DB ?

Indeed, the query itself very short: "Where does John Deere live ?"
Thanks !

Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens. However, your messages resulted in 10898 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

0 replies

zainhoda · 2024-01-26T15:40:30Z

zainhoda
Jan 26, 2024
Maintainer

@pierreoberholzer it picks the 10 most relevant pieces of each type of training data (ddl, documentation, question/sql pairs) and adds them to the context. It does a simple heuristic to calculate tokens to attempt not to overfill the context window.

However, it seems like your individual pieces of training data may be quite large? What's the string length of the content?

Which OpenAI model are you using btw? You likely won't reach context token limits if you use gpt-4-turbo-preview

0 replies

pierreoberholzer · 2024-01-26T19:10:39Z

pierreoberholzer
Jan 26, 2024
Author

Thanks - Using gpt-4-turbo-preview helps.
How can I reset the Chroma state, and start again with new context ?
Currently I'm accumulating context it seems.

0 replies

zainhoda · 2024-01-26T19:12:35Z

zainhoda
Jan 26, 2024
Maintainer

@pierreoberholzer you can either delete the sqlite database that Chroma creates and start again or you can go

vn.get_training_data() and then loop through all the ids and to vn.remove_training_data(id=...)

0 replies

pierreoberholzer · 2024-01-26T20:14:29Z

pierreoberholzer
Jan 26, 2024
Author

Cool. This helps defining the experiment better.
My observation so far is that amongst metadata, ddl and sql, only the latter leads to meaningful generated queries.
While this seems promising, and I need to investigate more, it's actually more difficult to assess how much the approach now might "overfit" given the query, compared to a more "neutral" case where only metadata would be passed (see first questions) in training. Needs more testing. Thanks for your help !

0 replies

zainhoda · 2024-01-26T20:20:12Z

zainhoda
Jan 26, 2024
Maintainer

@pierreoberholzer if you're doing a formal test, would you be able to kindly share the results?

0 replies

pierreoberholzer · 2024-01-27T07:27:18Z

pierreoberholzer
Jan 27, 2024
Author

Sure, if I get to that point.

0 replies

pierreoberholzer · 2024-01-28T19:11:23Z

pierreoberholzer
Jan 28, 2024
Author

I did some testing and must say that the tool seems very promising, even with little context given (only one SQL query as discussed in the above), plus it shows a friendly API. Great work !

Still, many SQL queries fail..

Not existing column : Would be great to ensure the metadata passed is actively used, also for query validation. That was the original purpose of this thread.
Not existing table : Should be a basic check. I got it fixed with prompt improvement.
Incomplete query : End of query seems truncated.
Not meaningful error (from BigQuery): Only see exceptions must derive from BaseException. Don’t know where the limit is here, but passing a meaningful error could improve iterative trials.

Looking forward to hearing if/how those points can or will be addressed.

4 replies

pierreoberholzer Feb 7, 2024
Author

@zainhoda : Curious to hear your feedback on this, or hints on what shall be improved. Thanks.

anokhimg Apr 1, 2024

Same is happening with me as well. The training data does not have the columns still the query is getting generated on the non-existent columns. Can this be validated somehow?

aslanok Sep 19, 2024

I did some testing and must say that the tool seems very promising, even with little context given (only one SQL query as discussed in the above), plus it shows a friendly API. Great work !

Still, many SQL queries fail..

Not existing column : Would be great to ensure the metadata passed is actively used, also for query validation. That was the original purpose of this thread.

Not existing table : Should be a basic check. I got it fixed with prompt improvement.

Incomplete query : End of query seems truncated.

Not meaningful error (from BigQuery): Only see exceptions must derive from BaseException. Don’t know where the limit is here, but passing a meaningful error could improve iterative trials.

Looking forward to hearing if/how those points can or will be addressed.

hi! I am facing same problem with you. I have 20 tables and various columns. When I ask a question, it returns me fail sql queries. Could you solve this problem ? If you can solve , what did you ?

Ramboom May 7, 2025

I have analyzed the SQL Prompt, I think the problem is here, because RAG has a low priority for DDL, so the real relevant DDL is not added to the SQL Prompt. So there are often no tables and no columns. @zainhoda

Columns passed in RAG seem ignored in generated SQL #194

Uh oh!

Uh oh!

pierreoberholzer Jan 25, 2024

Replies: 14 comments · 4 replies

Uh oh!

mike-niemand Jan 25, 2024

Uh oh!

pierreoberholzer Jan 25, 2024 Author

Uh oh!

zainhoda Jan 26, 2024 Maintainer

Uh oh!

Uh oh!

pierreoberholzer Jan 26, 2024 Author

Uh oh!

zainhoda Jan 26, 2024 Maintainer

Uh oh!

Uh oh!

pierreoberholzer Jan 26, 2024 Author

Uh oh!

Uh oh!

pierreoberholzer Jan 26, 2024 Author

Uh oh!

zainhoda Jan 26, 2024 Maintainer

Uh oh!

pierreoberholzer Jan 26, 2024 Author

Uh oh!

zainhoda Jan 26, 2024 Maintainer

Uh oh!

Uh oh!

pierreoberholzer Jan 26, 2024 Author

Uh oh!

zainhoda Jan 26, 2024 Maintainer

Uh oh!

pierreoberholzer Jan 27, 2024 Author

Uh oh!

Uh oh!

pierreoberholzer Jan 28, 2024 Author

Uh oh!

pierreoberholzer Feb 7, 2024 Author

Uh oh!

anokhimg Apr 1, 2024

Uh oh!

aslanok Sep 19, 2024

Uh oh!

Ramboom May 7, 2025

pierreoberholzer
Jan 25, 2024

Replies: 14 comments 4 replies

mike-niemand
Jan 25, 2024

pierreoberholzer
Jan 25, 2024
Author

zainhoda
Jan 26, 2024
Maintainer

pierreoberholzer
Jan 26, 2024
Author

zainhoda
Jan 26, 2024
Maintainer

pierreoberholzer
Jan 26, 2024
Author

pierreoberholzer
Jan 26, 2024
Author

zainhoda
Jan 26, 2024
Maintainer

pierreoberholzer
Jan 26, 2024
Author

zainhoda
Jan 26, 2024
Maintainer

pierreoberholzer
Jan 26, 2024
Author

zainhoda
Jan 26, 2024
Maintainer

pierreoberholzer
Jan 27, 2024
Author

pierreoberholzer
Jan 28, 2024
Author

pierreoberholzer Feb 7, 2024
Author