-
Notifications
You must be signed in to change notification settings - Fork 269
Can we enable adaptive clustering? #1790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@myz540 That's not in there today. However, if you pre-cluster the table before writing, it should maintain order. |
thanks for your reply. On another matter, I am trying to write to a table that has def create_partitions_truncate():
return PartitionSpec(
PartitionField(
source_id=1, field_id=1000, transform=TruncateTransform(width=7), name="sid_truncate"
),
PartitionField(
source_id=2, field_id=2000, transform=TruncateTransform(width=1), name="gene_truncate"
),
) However, when I try writing to it table = catalog.load_table((DATABASE, table_name))
smol_table = pa.Table.from_pandas(_df, schema=create_pa_schema())
with table.transaction() as transaction:
transaction.append(smol_table) I am hit with the following error
Is this simply not supported by |
@myz540 can you double check if you're using the latest version? |
my I upgraded them to their latest versions and now have a new weird error
Rolling back to 0.8.1 resolves the issue but I am still left with the original write error. I don't think |
Truncate transform with pyarrow was added in 0.9.0
you'd need to install the extra Line 307 in 1c0e2b0
|
thanks Kevin, I wasn't sure if the error msg I was receiving was because of |
@kevinjqliu Would you be able to provide an example for the Here is the schema def create_lems_schema() -> Schema:
"""
Create and return the PyArrow schema for lems table.
"""
return Schema(
NestedField(field_id=1, name="sid", field_type=StringType(), required=True),
NestedField(field_id=2, name="gene", field_type=StringType(), required=True),
NestedField(
field_id=3, name="prediction", field_type=FloatType(), required=True
),
) I have tried two things and gotten errors for both: def create_lems_partitions_bucket():
return PartitionSpec(
PartitionField(
source_id=1,
field_id=1,
transform=BucketTransform(num_buckets=1000),
name="sid",
),
PartitionField(
source_id=2,
field_id=2,
transform=BucketTransform(num_buckets=100),
name="gene",
),
) yields error: def create_lems_partitions_bucket():
return PartitionSpec(
PartitionField(
source_id=1,
field_id=1000,
transform=BucketTransform(num_buckets=1000),
name="sid_bucket",
),
PartitionField(
source_id=2,
field_id=2000,
transform=BucketTransform(num_buckets=100),
name="gene_bucket",
),
) yields error: |
I am able to create the partition spec with for i, _df in tqdm(enumerate(chunk_dataframe(df)), desc="Processing chunk"):
catalog = get_rest_catalog()
table = catalog.load_table((DATABASE, table_name))
smol_table = pa.Table.from_pandas(_df, schema=create_lems_pa_schema())
with table.transaction() as transaction:
transaction.append(smol_table)
print(f"✅ Successfully appended data for {i}")
print(f"✅ Successfully committed data for {i}")
print("✅ Successfully committed all data") I eventually encounter this error. I've hit this error any time I need to write lots of chunks and it usually happens about an hour and a half in. I am refreshing my catalog connection on each iteration so not sure what the problem is. Any help would be appreciated
|
Question
Does pyiceberg allow us to enable adaptive clustering when creating a table or enable it on an existing table?
The relevant sql would be something like
The text was updated successfully, but these errors were encountered: