Skip to content

Conversation

@dnwpark
Copy link
Contributor

@dnwpark dnwpark commented Jul 7, 2025

Related #8740

Allows the definition of ai embedding indexes using a uri:

  type Astronomy {
    content: str;
    deferred index ext::ai::index(embedding_model := 'openai:text-embedding-3-small')
      on (.content);
  }

When parsing a URI, the current schema is checked for a matching model type. If none exists, a reference json is checked and a new type is created if necessary.

For example, if text-embedding-3-small did not exist in the schema, the type __ext_generated_types__::ai_embedding_openai_text-embedding-3-small with the appropriate annotations would be generated. This type is automatically cleaned up when the index is deleted.

@dnwpark dnwpark force-pushed the ai-json-reference branch 6 times, most recently from 4b9df4c to 5f8e00b Compare July 8, 2025 21:33
@dnwpark dnwpark marked this pull request as ready for review July 8, 2025 22:56
@dnwpark dnwpark requested review from aljazerzen, anbuzin and elprans July 8, 2025 22:56
)

# If this is a generated AI index, track the created type
generated_ai_model_type = so.SchemaField(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just to keep the reference? I wonder if you can formulate the generated ref injection via find_extra_refs callback in compile_expr_field instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I am not actually compiling anything into an expression, so I'm not sure that would work.

First, I manually create a qlast.CreateObjectType for the new model type:

model_ast = qlast.CreateObjectType(

I compile it:

gel/edb/schema/indexes.py

Lines 1335 to 1339 in 5f8e00b

model_cmd = sd.compile_ddl(
schema,
model_ast,
)
model_cmd.set_attribute_value('is_schema_generated', True)

Then, return to IndexCommand._handle_ai_index_op and make it a prerequisite of the CreateIndex/AlterIndex command:

self.add_prerequisite(model_cmd)

However, to get the typeshell of the type produced by the command, I compile the command a second time and actually apply it to the schema:

gel/edb/schema/indexes.py

Lines 1341 to 1355 in 5f8e00b

# Doing this since using model_cmd or a copy doesn't work
dummy_model_cmd = sd.compile_ddl(
schema,
model_ast,
)
dummy_model_cmd.set_attribute_value('is_schema_generated', True)
new_schema = dummy_model_cmd.apply(schema, context)
model_type = new_schema.get(model_typename, None, type=s_types.Type)
assert model_type is not None
model_typeshell = so.ObjectShell(
name=model_typename,
schemaclass=type(model_type),
)
return model_cmd, model_typeshell

Comment on lines +105 to +109
is_schema_generated = so.SchemaField(
bool,
default=False,
compcoef=0.0,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have similar fields from_alias and from_global. Maybe they can be reused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_alias has a lot of associated logic with links and views. I thought it best to avoid adding more meanings to it. @msullivan will hopefully have more insight

Copy link
Contributor

@aljazerzen aljazerzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand about half of this, but also don't really understand what it does - even after reading the tests. From what I gather, it allows using AI models from the JSON file in the user schema?

I can spend more time here, but I'd need an explanation.

@dnwpark
Copy link
Contributor Author

dnwpark commented Jul 11, 2025

@aljazerzen The issue should explain it #8740. If there is anything in particular you don't understand please ask! I will also add some more comments.

@anbuzin
Copy link
Contributor

anbuzin commented Jul 11, 2025

URI selection seems to work great interface-wise. The LSP is not happy though:

undefined embedding model: no subtype of ext::ai::EmbeddingModel is annotated as 'openai:text-embedding-3-small'

There were some issues with the Ollama implementation, but that's not the PR to talk about them, so I'll do more research and file them separately.

@dnwpark
Copy link
Contributor Author

dnwpark commented Jul 11, 2025

@anbuzin What sort of issues were you having with ollama? Right now, not all models are in the reference json yet so that might be it.

@anbuzin
Copy link
Contributor

anbuzin commented Jul 11, 2025

@dnwpark It was more like, you have to insert a config with an empty string for a secret, and I couldn't get the RAG to work with llama3.2, but I haven't had a chance to figure out exactly why

@dnwpark
Copy link
Contributor Author

dnwpark commented Jul 11, 2025

@anbuzin Ah yeah, you need to config the provider no matter what. In theory, you could override api_url to have ollama to run on a server.

)


class TestExtAIDDL(tb_server.DDLTestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need tests where we add new entries to the manifest, after having done the initial creation.

We should probably only check the file once per server startup (maybe? or in the eventual case maybe once per 24 hours of something).

Since this version doesn't support the http loading anyway, I think the testing approach should be to support passing the file name via an environment variable. Then we can create a temp test dir, start up the server in that directory with start_edgedb_server, do some work, stop it, populate a new updated file, start the server pointing at that instead, and then do the rest of the testing.

Does that make sense?

@dnwpark dnwpark force-pushed the ai-json-reference branch 3 times, most recently from fb65361 to 12717db Compare July 23, 2025 02:29
@dnwpark dnwpark force-pushed the ai-json-reference branch from 12717db to ffbe647 Compare July 23, 2025 02:37
Copy link
Member

@msullivan msullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've concluded that I don't like this approach. I think that this will be cleaner if we take a "JSON-forward" approach, where instead of driving the extension worker based on querying the introspection schema, we drive it based on an in-memory datastructure driven by the JSON. It should be pretty easy to produce a data structure for the db-configured stuff with an introspection query; probably we could directly produce the JSON directly using json_object_pack, if we wanted...

One thing that we will need to do either in that new approach or if we wanted to continue this approach, is to cache the last downloaded version of the file. Otherwise, if the fetch fails on server startup or whatever, there will be a degradation of service. In the JSON-forward approach it would be worse, but even in this schema-based approach it would prevent running migration create.

Mapping[tuple[sn.Name, Optional[str]], uuid.UUID]
]=None,
compat_ver: Optional[verutils.Version] = None,
reference_paths: Optional[Mapping[str, pathlib.Path]] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pass this in as a mapping to the data. Or if we want to generalize things further, to some abstract interface type for retrieving data?

@msullivan
Copy link
Member

msullivan commented Jul 25, 2025

Hm... let me dig into something a bit more. I had forgotten that we actually copy the annotations onto the indexes themselves... but I can't remember why?

Edit: Ah, because it's used to make the data more easily accessible in delta_ext_ai.py and in the introspection queries?

gel/edb/schema/indexes.py

Lines 1468 to 1470 in fe28a9f

# Copy ext::ai:: annotations declared on the model specified
# by the `embedding_model` kwarg. This is necessary to avoid
# expensive lookups later where the index is used.

I don't really believe that the later lookups would be particularly expensive without it, though.

@msullivan
Copy link
Member

Here's a query that will generate the exact same JSON I think, from introspecting:

WITH ai_objs := (
  SELECT schema::ObjectType FILTER 'ext::ai::EmbeddingModel' IN .ancestors.name
)
SELECT json_object_pack(
  FOR obj IN ai_objs
  FOR j IN json_object_pack(
    FOR a IN obj.annotations
    SELECT (
        re_replace('^embedding_model_', '',
          re_replace('^ext::ai::', '', a.name),
        ),
        if a.name like 'ext::ai::model_%'
        then <json>a@value else to_json(a@value)
    )
    FILTER a.name LIKE 'ext::ai::%'
  )
  SELECT (<str>j['model_name'], j)
);

Though maybe you'd be better off just doing

SELECT schema::ObjectType {
    annotations: {name, @value},
}
FILTER 'ext::ai::EmbeddingModel' IN .ancestors.name;

and postprocessing in python :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants