Skip to content

feat: Support Importing Precomputed Embeddings #199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Mar 15, 2025

Conversation

Ayushjhawar8
Copy link
Contributor

/claim #149
closes #149

@Ayushjhawar8
Copy link
Contributor Author

Ayushjhawar8 commented Feb 13, 2025

Hey @ChuckHend ! I'm first time contributor to tembo. Please can you review my implementation to resolve the issue.
Thanks! 😊

@ChuckHend
Copy link
Owner

ChuckHend commented Feb 13, 2025

Hey @Ayushjhawar8, thank you for the PR. The title of this PR says "Add document chunking capability to vectorize.table()" but the changes and bounty look like more closely related to "Bring your own embeddings?"

@Ayushjhawar8 Ayushjhawar8 changed the title feat: Add document chunking capability to vectorize.table() feat: Support Importing Precomputed Embeddings Feb 14, 2025
@Ayushjhawar8
Copy link
Contributor Author

Hi @ChuckHend. Sry for bad pr title..i changed it....also im giving my idea behind the changes...so i implemented the import_embeddings feature to let users bring their pre-computed embeddings directly into pg_vectorize. I designed it to work with both join and append table methods since I wanted to maintain compatibility with existing workflows. I built this specifically thinking about users who might be using Sentence-Transformers or similar models and want to take the management capabilities without redoing their embedding work. I also made sure to document it clearly in the API docs and added comprehensive tests to verify it works as intended across different scenarios.
Please review the changes and tell me if it was helpful :)

@ChuckHend
Copy link
Owner

ChuckHend commented Feb 25, 2025

Don't worry about any tests failing because of missing API keys or permissions, such as the two below.

    test_cohere
    test_private_hf_model

Although this one and others I think needs a fix?

@Ayushjhawar8
Copy link
Contributor Author

okay I will be working to fix that

@Ayushjhawar8
Copy link
Contributor Author

Ayushjhawar8 commented Feb 26, 2025

hey @ChuckHend please can you check i tried to fix some

hmm less error now =)

@ChuckHend
Copy link
Owner

@Ayushjhawar8 , I left a few comments. This is coming along nicely!

@Ayushjhawar8
Copy link
Contributor Author

Ayushjhawar8 commented Mar 4, 2025

Hey @ChuckHend ,Thank you for your review, I tried to implement all your suggestion as:

  1. For the borrow issue, i tried to now clone the columns vector when passing it to init_table
  2. For the Spi::select issue, I've completely rewritten the approach to use bulk SQL operations

Hoping it fixes the test, Also please review it further if needed.

@Ayushjhawar8 , I left a few comments. This is coming along nicely!

Thanks :)

@Ayushjhawar8
Copy link
Contributor Author

Ayushjhawar8 commented Mar 7, 2025

umm why install dependency one is failing now

Also its only format issue and integration test now ? 😃

@ChuckHend
Copy link
Owner

Looks like some things in our CI environment need to be updated. I'll fix that today.

@ChuckHend
Copy link
Owner

Getting closer! Remove those unused variable declarations and resolve this test error

thread 'test_import_embeddings' panicked at tests/integration_tests.rs:1144:6:
failed to insert test data: Database(PgDatabaseError { severity: Error, code: "22P02", message: "invalid input syntax for type vector: \"[0.1, 0.2, 0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0]\"", detail: None, hint: None, position: Some(Original(98)), where: None, schema: None, table: None, column: None, data_type: None, constraint: None, file: Some("vector.c"), line: Some(243), routine: Some("vector_in") })
FAILED

@Ayushjhawar8
Copy link
Contributor Author

Getting closer! Remove those unused variable declarations and resolve this test error

thread 'test_import_embeddings' panicked at tests/integration_tests.rs:1144:6:
failed to insert test data: Database(PgDatabaseError { severity: Error, code: "22P02", message: "invalid input syntax for type vector: \"[0.1, 0.2, 0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0]\"", detail: None, hint: None, position: Some(Original(98)), where: None, schema: None, table: None, column: None, data_type: None, constraint: None, file: Some("vector.c"), line: Some(243), routine: Some("vector_in") })
FAILED

Nice, sounds good, I will try to fix both 👍🏻

@Ayushjhawar8
Copy link
Contributor Author

Ayushjhawar8 commented Mar 11, 2025

hm still that test failing let me see how can i work around it 🤔

@ChuckHend
Copy link
Owner

ChuckHend commented Mar 12, 2025

@Ayushjhawar8 , I pushed a few changes in 53c2ace

  • removed unused variables
  • refactored handled of that count variable so compiler would know how we're using it
  • added quotes around the "table" parameter in vectorize.table(). This was a bad choice on the API design on our part, but we'll be correcting it soon. Sorry for having to deal with that.

Copy link
Owner

@ChuckHend ChuckHend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all that remains for a test to cover vectorize.table_from()? Looks like import_embeddings() is functioning as intended now.

@Ayushjhawar8
Copy link
Contributor Author

@Ayushjhawar8 , I pushed a few changes in 53c2ace

  • removed unused variables
  • refactored handled of that count variable so compiler would know how we're using it
  • added quotes around the "table" parameter in vectorize.table(). This was a bad choice on the API design on our part, but we'll be correcting it soon. Sorry for having to deal with that.

wow thanks for the improvements! That makes things much clearer. Appreciate the help! 💖

@Ayushjhawar8
Copy link
Contributor Author

Ayushjhawar8 commented Mar 12, 2025

I think all that remains for a test to cover vectorize.table_from()? Looks like import_embeddings() is functioning as intended now.

Love to hear that, im on the test🫡

@Ayushjhawar8
Copy link
Contributor Author

Ayushjhawar8 commented Mar 12, 2025

thread 'test_table_from' panicked at tests/integration_tests.rs:1390:6:
failed to create table from embeddings: Database(PgDatabaseError { severity: Error, code: "22023", message: "invalid schedule: manual", detail: None, hint: Some("Use cron format (e.g. 5 4 * * *), or interval format '[1-59] seconds'"), position: None, where: Some("SQL statement "\n SELECT cron.schedule(\n 'table_from_test_68743',\n 'manual',\n $$select vectorize.job_execute('table_from_test_68743')$$\n )\n ;""), schema: None, table: None, column: None, data_type: None, constraint: None, file: Some("job_metadata.c"), line: Some(225), routine: Some("ScheduleCronJob") })

I'm not sure what is this error about? We are using realtime schedule though?

@ChuckHend
Copy link
Owner

I'm not sure what is this error about? We are using realtime schedule though?

I think its happening during the init_table() call earlier in from_table(). I think init_table() will need to be modified such that it can accept a value of "manual" so that it will not init the cron job or table triggers.

Copy link
Owner

@ChuckHend ChuckHend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, nice work, thank you!

@ChuckHend ChuckHend merged commit 534afcd into ChuckHend:main Mar 15, 2025
4 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bring your own embeddings
2 participants