create df in a loop ? #3109

djouallah · 2024-10-23T20:40:30Z

djouallah
Oct 23, 2024

there must be a better way ?

lineitem = daft.read_parquet(f'{path}/{sf}/lineitem/*.parquet')
nation   = daft.read_parquet(f'{path}/{sf}/nation/*.parquet')
region   = daft.read_parquet(f'{path}/{sf}/region/*.parquet')
customer = daft.read_parquet(f'{path}/{sf}/customer/*.parquet')
supplier = daft.read_parquet(f'{path}/{sf}/supplier/*.parquet')
orders   = daft.read_parquet(f'{path}/{sf}/orders/*.parquet')
partsupp = daft.read_parquet(f'{path}/{sf}/partsupp/*.parquet')
part     = daft.read_parquet(f'{path}/{sf}/part/*.parquet')

Answered by jaychia

Oct 23, 2024

I'm trying to move towards #3036 which is more like:

daft.register_dataframe("lineitem", daft.read_parquet(...))

daft.sql("SELECT * FROM lineitem")

This also opens up opportunities for something like

daft.register_aws_glue(...)
df = daft.read_table("x.y.z")
daft.sql("SELECT * FROM x.y.z")

Which lets us register entire catalog services to start reading tables from

View full answer

jaychia · 2024-10-23T20:42:29Z

jaychia
Oct 23, 2024
Maintainer

Perhaps this?

TABLE_NAMES = ["lineitem", "nation", "region", "customer", "supplier", "orders", "partsupp", "part"]
table_dfs = {
    name: daft.read_parquet(f"{path}/{sf}/{name}/*.parquet") for name in TABLE_NAMES
}

0 replies

djouallah · 2024-10-23T20:48:06Z

djouallah
Oct 23, 2024
Author

that does not help much

import daft
start = time.time()
TABLE_NAMES = ["lineitem", "nation", "region", "customer", "supplier", "orders", "partsupp", "part"]
table_dfs = {
    name: daft.read_parquet(f"{path}/{sf}/{name}/*.parquet") for name in TABLE_NAMES
}
stop = time.time()
external_table_duration = stop-start
df=execute_query(daft, sql)

I am getting Table not found: lineitem

0 replies

djouallah · 2024-10-23T20:50:02Z

djouallah
Oct 23, 2024
Author

in polars as an example, I can just do this

import polars as pl
ctx = pl.SQLContext()
for tbl in ['lineitem','orders','nation','part','customer','part','partsupp','region','supplier']:
  ctx.register(tbl, pl.scan_parquet(f'./{sf}/{tbl}/*.parquet'))

0 replies

jaychia · 2024-10-23T20:52:39Z

jaychia
Oct 23, 2024
Maintainer

Oh! I see what you're trying to do now.

@universalmind303 can probably advise better, but you might want to use our SQLCatalog abstraction here.

from daft.sql import SQLCatalog

TABLE_NAMES = ["lineitem", "nation", "region", "customer", "supplier", "orders", "partsupp", "part"]
table_dfs = {
    name: daft.read_parquet(f"{path}/{sf}/{name}/*.parquet") for name in TABLE_NAMES
}
catalog = SQLCatalog(table_dfs)
daft.sql("SELECT * FROM lineitem", catalog=catalog)

Do bear with us as we work on a better catalog API. We have some proposals in the works here for unifying the story around iceberg/delta/HMS etc.

0 replies

djouallah · 2024-10-23T20:56:19Z

djouallah
Oct 23, 2024
Author

please can we have this

catalog = SQLCatalog(table_dfs)
daft.sql("SELECT * FROM lineitem")

or

catalog = SQLCatalog(table_dfs)
catalog .sql("SELECT * FROM lineitem")

3 replies

jaychia Oct 23, 2024
Maintainer

I'm trying to move towards #3036 which is more like:

daft.register_dataframe("lineitem", daft.read_parquet(...))

daft.sql("SELECT * FROM lineitem")

This also opens up opportunities for something like

daft.register_aws_glue(...)
df = daft.read_table("x.y.z")
daft.sql("SELECT * FROM x.y.z")

Which lets us register entire catalog services to start reading tables from

Answer selected by djouallah

jaychia Oct 23, 2024
Maintainer

Let me know what you think!

jaychia Oct 23, 2024
Maintainer

Writing is a little more tricky to design an API around, so I think I'll design the read API first.

djouallah · 2024-10-23T21:01:04Z

djouallah
Oct 23, 2024
Author

that's perfect !!!

0 replies

djouallah · 2024-10-23T21:05:06Z

djouallah
Oct 23, 2024
Author

can we register iceberg catalog please

scada       = daft.read_iceberg(catalog.load_table(db+".scada"))
price       = daft.read_iceberg(catalog.load_table(db+".price"))
duid        = daft.read_iceberg(catalog.load_table(db+".duid"))
calendar    = daft.read_iceberg(catalog.load_table(db+".calendar"))
scada.show()

2 replies

jaychia Oct 23, 2024
Maintainer

Yeah. The tricky part is that some catalogs can support many different types of tables. Here's an illustrative example.

daft.register_iceberg_rest(..., name="my_iceberg_rest_catalog")
daft.register_aws_glue(..., name="my_aws_glue_catalog")

df = daft.read_table("another.iceberg.table", catalog_name="my_iceberg_rest_catalog")

df = daft.read_table("an.iceberg.table", catalog_name="my_aws_glue_catalog")
df = daft.read_table("thisisa.hive.table", catalog_name="my_aws_glue_catalog")

Where in the AWS Glue case, we need to query AWS glue, understand that this is, in fact, an Iceberg table, then initialize the PyIceberg Glue Catalog client and read the table 🤮

At least that's my current plan... Hopefully it doesn't get too complicated.

jaychia Oct 23, 2024
Maintainer

I'm actually most worried about HMS, since I have the least familiarity there (daft.register_hadoop_metastore(...))

djouallah · 2024-11-03T04:42:21Z

djouallah
Nov 3, 2024
Author

@jaychia maybe just start with register_iceberg_rest first then figure out HMS later, the combo daft/polaris is very interesting actually

1 reply

jaychia Nov 4, 2024
Maintainer

Were you able to get PyIceberg working with Polaris already?

I can prioritize Iceberg REST first, which is easier than HMS for sure.

djouallah · 2024-11-04T03:10:27Z

djouallah
Nov 4, 2024
Author

It is !! I am in the next stage 😛

0 replies

create df in a loop ? #3109

Uh oh!

djouallah Oct 23, 2024

Replies: 9 comments · 6 replies

Uh oh!

jaychia Oct 23, 2024 Maintainer

Uh oh!

djouallah Oct 23, 2024 Author

Uh oh!

djouallah Oct 23, 2024 Author

Uh oh!

jaychia Oct 23, 2024 Maintainer

Uh oh!

djouallah Oct 23, 2024 Author

Uh oh!

Uh oh!

jaychia Oct 23, 2024 Maintainer

Uh oh!

jaychia Oct 23, 2024 Maintainer

Uh oh!

jaychia Oct 23, 2024 Maintainer

Uh oh!

djouallah Oct 23, 2024 Author

Uh oh!

djouallah Oct 23, 2024 Author

Uh oh!

Uh oh!

jaychia Oct 23, 2024 Maintainer

Uh oh!

jaychia Oct 23, 2024 Maintainer

Uh oh!

djouallah Nov 3, 2024 Author

Uh oh!

jaychia Nov 4, 2024 Maintainer

Uh oh!

djouallah Nov 4, 2024 Author

djouallah
Oct 23, 2024

Replies: 9 comments 6 replies

jaychia
Oct 23, 2024
Maintainer

djouallah
Oct 23, 2024
Author

djouallah
Oct 23, 2024
Author

jaychia
Oct 23, 2024
Maintainer

djouallah
Oct 23, 2024
Author

jaychia Oct 23, 2024
Maintainer

jaychia Oct 23, 2024
Maintainer

jaychia Oct 23, 2024
Maintainer

djouallah
Oct 23, 2024
Author

djouallah
Oct 23, 2024
Author

jaychia Oct 23, 2024
Maintainer

jaychia Oct 23, 2024
Maintainer

djouallah
Nov 3, 2024
Author

jaychia Nov 4, 2024
Maintainer

djouallah
Nov 4, 2024
Author