Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support prepared statements and parameters #21

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vhiairrassary
Copy link
Collaborator

@vhiairrassary vhiairrassary commented Dec 15, 2024

Hello,

I need to support untrusted inputs so I have created this PR as a starting point to see if you would be interested to have this feature merged upstream, and if yes, to discuss about the details.

How it works

  • If there is no parameter then execution did not change
  • If there is at least one parameter then I prepare a statement, extract the named values (this PoC does not support positional parameters, but it can be easily done. As a personal note I find them confusing, and even saw they are a syntactic sugar for named parameters under the hood) and execute the prepared statement

How to test it

It can be tested using:

make

DUCKDB_HTTPSERVER_DEBUG=1 \
DUCKDB_HTTPSERVER_FOREGROUND=1 \
duckdb -unsigned \
  -c "FORCE INSTALL httpserver FROM './build/release/repository';" \
  -c "LOAD httpserver;" \
  -c "SELECT 890;" \
  -c "SELECT httpserve_start('0.0.0.0', 4000, '');"

and

curl -X POST -d 'SELECT typeof($ABC), $abc, typeof($DEF), $def' -g 'http://localhost:4000?parameters={"abc":{"type":"TEXT","value":"7"},"def":{"type":"BOOLEAN","value":true}}'
# {"typeof($ABC)":"VARCHAR","$abc":"7","typeof($DEF)":"BOOLEAN","$def":"true"}

Questions/notes

  • I am relying on exceptions to split the code in separated functions and make it easier to read (see the refactored CheckAuthentication and ExtractFormat functions for example). They are not on the happy path and should not impact performances (assuming the database is not publicly available, which sounds reasonable)
  • I tried to follow what is done by Snowflake
  • For the PoC I expect the query's parameters to be a JSON string in the HTTP parameter parameter. This sounds weird and I would be happy to move all the parameters (format, query/q and parameters) inside a single JSON body. Wdyt? We could either:
    • keep the GET (with format and query/q), the POST (with format and query/q) and POST with a JSON body (with format and query/q and parameters)
    • or keep the GET as above and unify both POST with a JSON body (with format and query/q and parameters), but it would be a breaking change
  • I am not sure if I need to do something to drop the prepared statement (In SQL there is an explicit DEALLOCATE operation)

@lmangani
Copy link
Collaborator

lmangani commented Dec 15, 2024

Wow thanks for taking the time to assemble this. It seems very cool and I'm all for it! I need to go through the changes and try it out but as long as existing API functionality is not broken (we have a few existing integrations) and once you're happy with it being feature complete, its a go and I'm happy to assist with anything useful. I'm curious what is the use-case for this?

@lmangani
Copy link
Collaborator

lmangani commented Dec 15, 2024

Well, I like this a lot and it I think its great work! If you're willing to co-maintain this code and what it might bring best effort, I'm happy to add you as co-maintainer and merge the feature once you think its ready! Welcome to #quackscience 🎉 @vhiairrassary

@vhiairrassary
Copy link
Collaborator Author

I'm curious what is the use-case for this?

Preventing SQL injections, for example when running a SQL query filtered on a specific value sent by a frontend application.

Thank you for your feedbacks 🙇 I will take to polish and prepare the PR for a review

@vhiairrassary vhiairrassary force-pushed the vhiairrassary/support-bind-variables branch from ac93207 to 43152cb Compare December 16, 2024 20:10
@vhiairrassary vhiairrassary marked this pull request as draft December 16, 2024 20:37
@vhiairrassary vhiairrassary force-pushed the vhiairrassary/support-bind-variables branch 5 times, most recently from 8310c77 to f803e12 Compare December 17, 2024 20:49
@vhiairrassary
Copy link
Collaborator Author

vhiairrassary commented Dec 17, 2024

@lmangani 👋 I am continuing to work on this PR, and I have modularised the code in several C++ units (and started to added unit tests to avoid regressions and streamline futures changes). Would you be Ok to merge this kind of changes? I would like to avoid working on something that would not be acceptable for the maintainers.

@lmangani
Copy link
Collaborator

lmangani commented Dec 17, 2024

@vhiairrassary by merging this PR you officially become a co-maintainer so I guess it depends on your will to look after it 😄 this said, your work is much appreciated and benefits future contributors too, so it's surely welcome without objections

@vhiairrassary vhiairrassary force-pushed the vhiairrassary/support-bind-variables branch from f803e12 to c3d2472 Compare December 17, 2024 22:38
@vhiairrassary vhiairrassary force-pushed the vhiairrassary/support-bind-variables branch from 3c1f8f2 to 88d25a3 Compare December 17, 2024 23:08
@mskyttner
Copy link

When I saw that you have this feature in the works, I was thinking that it would be nice to be able to POST some update or insert along with some change dataset. Consider for example a transaction to update a table with some fresh data that is sent to the server. It could be attempting to do something like below

begin transaction;
create temp table staging as from read_parquet('s3://updates/latest_delta.parquet');
insert or replace into my_main_table by name select distinct(*) from staging order by identifier;
drop table staging;
commit;

and imagine that the latest delta could be posted to the server as a "parameter".

I guess I might be blurring the lines here as this is not exactly a single "prepared statement" and therefore perhaps not relevant here, but the API accepting such a set of statements representing a "parameterized web transaction" would appear to be quite powerful...

The ws4sql fork which reminds me both of the quackpipe and the httpserver (you can find it at github on proofrock/ws4sqlite/tree/fork/ws4sql) has a similar feature where a "transaction" possibly containing several statements can be defined and bound to variables. There is an example of how the JSON posted looks like there and such a "web transaction" could also contain a final query (I guess returning results of relevance after the transactional statements have succeeded).

auto bodyJsonCStr = bodyJson.c_str();
bodyDoc = yyjson_read(bodyJsonCStr, strlen(bodyJsonCStr), 0);

return ExtractQueryApiParametersComplexImpl(bodyDoc);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bodyDoc is never freed

Copy link
Collaborator

@lmangani lmangani Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong but possibly json_line as well?

diff --git a/src/http_handler/handler.cpp b/src/http_handler/handler.cpp
index 88d25a3..e5f7a8b 100644
--- a/src/http_handler/handler.cpp
+++ b/src/http_handler/handler.cpp
@@ -179,6 +179,7 @@ QueryApiParameters ExtractQueryApiParametersComplex(const duckdb_httplib_openssl
         return ExtractQueryApiParametersComplexImpl(bodyDoc);
     }
     catch (const std::exception& exception) {
+        yyjson_doc_free(bodyDoc);
         throw;
     }
 }
@@ -217,6 +218,7 @@ QueryApiParameters ExtractQueryApiParametersComplexImpl(yyjson_doc* bodyDoc) {
 
 static std::string ConvertResultToNDJSON(MaterializedQueryResult &result) {
     std::string ndjson_output;
+    yyjson_mut_doc *doc = nullptr;

     for (idx_t row = 0; row < result.RowCount(); ++row) {
         // Create a new JSON document for each row
@@ -226,6 +228,8 @@ static std::string ConvertResultToNDJSON(MaterializedQueryResult &result) {
         yyjson_mut_doc_set_root(doc, root);
 
         for (idx_t col = 0; col < result.ColumnCount(); ++col) {
+            // Ensure doc is freed on each iteration
+            yyjson_mut_doc_free(doc);
             Value value = result.GetValue(col, row);
             const char* column_name = result.ColumnName(col).c_str();
 
@@ -246,6 +250,7 @@ static std::string ConvertResultToNDJSON(MaterializedQueryResult &result) {
 
         ndjson_output += json_line;
         ndjson_output += "\n";
+        free(json_line);
     }
     return ndjson_output;
 }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vhiairrassary shall I apply the patches or you prefer doing so?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 @vhiairrassary just checking if still interested in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants