pyoso autocomplete #3249

ryscheng · 2025-03-12T02:45:54Z

What is it?

@Jabolol mind brain-dumping what you had thought about in Denver?

linear · 2025-03-12T02:45:56Z

Jabolol · 2025-03-13T20:01:11Z

Carl and I went through some different API designs when we were at the Gitcoin house in Denver, and we came to the following conclusions:

The best outcome is a fully typed pyoso package. But that has some limitations. IMO, the best API would be as follows:

from pyoso import Client

client = Client(API_KEY)

# example:
#
# +-----------+-----------+
# | column_0  | column_1  |
# +-----------+-----------+
# | value     | data_1    |
# | value     | data_2    |
# | value     | data_3    |
# +-----------+-----------+
#

# create a virtual df, lazily evaluated
vdf = client.table("example").select(
    "column_0", # autocomplete based on "example" columns
    "column_1", # autocomplete ...
).where(lambda row: row["column_0"] == "value") # automatically typed to str

# modify columns with the vdf, do df logic with it

df = vdf.dispatch() # this builds the SQL and executes the query

print(df.head())

But this has a big implementation risk, since it is extremely complex. We need to handle all SQL keywords and give them semantic meaning, which is a really big task, even when just allowing a small subset of keywords.

An intermediate approach, which is still helpful, would be as follows:

from pyoso import Client

client = Client(API_KEY)

ctx, handle = client.table("example") # ctx has all the columns with type information

vdf = handle.query(
    f"""
        select {ctx.rows.column_0} from {ctx.table}
    """
)

# modify columns with the vdf, do df logic with it

df = vdf.dispatch() # this builds the SQL and executes the query

print(df.head())

This implementation is simpler. It provides typed access to all the columns, and possibly, extra metadata, through the context (ctx) object. The issue is that string interpolation is not a silver bullet, there's no semantic check and things can get messy real quick. But it is a viable option, it would prevent people from checking the schema on BigQuery.

These two approaches have something in common. How do we actually add the typing support? In python, there are some files called python interfaces (.pyi), which consist of a subset of python that gives information to the type checker. An example hello.pyi file could be:

def add(a: int, b: int) -> int: ...
def greet(name: str) -> str: ...

This is really powerful, because by writing a small subset of python-ish code, we can add typing support.

I would divide the library into two parts. The types, and the core functionality. The core functionality would be private (not exposed to the user except the query method, for example), stable, and would be the one in charge of all the virtual df handling, streams, authentication, etc. The typing part, would be possibly generated by another module, or even another programming language. This would fetch the schemas from our databases, and build the .pyi files, which would be bundled with our library. One caveat though, is that we'd need to re-publish our library every time schemas change.

In conclusion, on the long run I would like to see option A being the one we end up doing, but it is a big investment. We are a small team and it's a big project. Option B seems more feasible on the short/middle term, and would make our library have acceptable typing.

ryscheng added this to the [c] pyoso fetches data milestone Mar 12, 2025

github-project-automation bot added this to OSO Mar 12, 2025

github-project-automation bot moved this to Backlog in OSO Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyoso autocomplete #3249

pyoso autocomplete #3249

ryscheng commented Mar 12, 2025

linear bot commented Mar 12, 2025

Jabolol commented Mar 13, 2025

pyoso autocomplete #3249

pyoso autocomplete #3249

Comments

ryscheng commented Mar 12, 2025

What is it?

linear bot commented Mar 12, 2025

Jabolol commented Mar 13, 2025