Columns passed in RAG seem ignored in generated SQL #194
Replies: 14 comments 3 replies
-
Try describing your table using create table syntax as a ddl?
…On Thu, 25 Jan 2024, 21:24 Pierre Oberholzer, ***@***.***> wrote:
Hi,
I'm trying to create a SQL query that would make use of valid column names
passed to the RAG via INFORMATION_SCHEMA.COLUMN. However, the obtained
query does not mention any real column, and therefore fails on the DB.
`
import vanna
from vanna.remote import VannaDefault
from vanna.openai.openai_chat import OpenAI_Chat
from vanna.chromadb.chromadb_vector import ChromaDB_VectorStore
Globals
PROJECT_ID = "my_gcp_project"
DATASET_ID = "my_dataset"
TABLE_NAME = "my_table"
OPENAI_API_KEY = "sk-xxxxx"
Class instantiation
class MyVanna(ChromaDB_VectorStore, OpenAI_Chat):
def *init*(self, config=None):
ChromaDB_VectorStore.*init*(self, config=config)
OpenAI_Chat.*init*(self, config=config)
vn = MyVanna(config={'api_key': OPENAI_API_KEY, 'model': 'gpt-4'})
vn.connect_to_bigquery(project_id=PROJECT_ID)
The below query is working on BigQuery console.
METADATA_QUERY = f"""
SELECT * FROM {PROJECT_ID}.{DATASET_ID}.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = {TABLE_NAME}
"""
df_information_schema = vn.run_sql(METADATA_QUERY)
This will break up the information schema into bite-sized chunks that can
be referenced by the LLM
plan = vn.get_training_plan_generic(df_information_schema)
If you like the plan, then uncomment this and run it to train
vn.train(plan=plan)
vn.ask(question="Where does John Deere live ?")
Obtained query
"""
SELECT residence
FROM Authors
WHERE name = 'John Deere';
"""
Neither the fields `residence` or `name` exist, nor the table `Authors`
I also get the following error by the way..
Couldn't run sql: exceptions must derive from BaseException
Thanks for your help !
—
Reply to this email directly, view it on GitHub
<#187>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGK4ZK63QDWUJUKV3W3HSETYQKWOTAVCNFSM6AAAAABCLBSCWCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYDCMBSGM2DONY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Trying the DDL as you suggest: same issues as in the first trial above.
|
Beta Was this translation helpful? Give feedback.
-
@pierreoberholzer did the training data make it in? Do you get results when you do |
Beta Was this translation helpful? Give feedback.
-
Good idea. It seems it at least received meaningful info (just showing first 6 columns here)..
|
Beta Was this translation helpful? Give feedback.
-
@pierreoberholzer are those the actual names of your columns? If so, there's no way the LLM would be able to associate the column name with that it means semantically. With ambiguous or nonexistent column names, the best method is going to be training on example sql statements because the database schema doesn't have enough information. Like if a human wouldn't be able to figure out how to translate "Where does John Deere live ?" into SQL based on the information in that image, then the LLM wouldn't be able to either. Try this -- try training on 3-4 sample SQL queries that you know work. So do: Then try asking questions that are related to those queries. If that works, since you're using BigQuery, what you can do is extract your query history and loop over the query history to do |
Beta Was this translation helpful? Give feedback.
-
Those are dummy column names, the real ones have some semantic meaning. |
Beta Was this translation helpful? Give feedback.
-
Still investigating, but it seems that I am reaching some limit..
Indeed, the query itself very short:
|
Beta Was this translation helpful? Give feedback.
-
@pierreoberholzer it picks the 10 most relevant pieces of each type of training data (ddl, documentation, question/sql pairs) and adds them to the context. It does a simple heuristic to calculate tokens to attempt not to overfill the context window. However, it seems like your individual pieces of training data may be quite large? What's the string length of the content? Which OpenAI model are you using btw? You likely won't reach context token limits if you use |
Beta Was this translation helpful? Give feedback.
-
Thanks - Using |
Beta Was this translation helpful? Give feedback.
-
@pierreoberholzer you can either delete the sqlite database that Chroma creates and start again or you can go
|
Beta Was this translation helpful? Give feedback.
-
Cool. This helps defining the experiment better. |
Beta Was this translation helpful? Give feedback.
-
@pierreoberholzer if you're doing a formal test, would you be able to kindly share the results? |
Beta Was this translation helpful? Give feedback.
-
Sure, if I get to that point. |
Beta Was this translation helpful? Give feedback.
-
I did some testing and must say that the tool seems very promising, even with little context given (only one SQL query as discussed in the above), plus it shows a friendly API. Great work ! Still, many SQL queries fail..
Looking forward to hearing if/how those points can or will be addressed. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I'm trying to create a SQL query that would make use of valid column names passed to the RAG via
INFORMATION_SCHEMA.COLUMN
. However, the obtained query does not mention any real column.Neither the fields
residence
orname
exist, nor the tableAuthors
I also get the following error by the way..
Thanks for your help !
Beta Was this translation helpful? Give feedback.
All reactions