Skip to content

Deleting parquet for a partition silently fails if path is encoded #398

Open
@graza-io

Description

@graza-io

Context

@ParthaI noticed he was getting duplicate data when running a collect command twice with the same --from parameter.

Initially I was unable to reproduce this using CloudTrail logs.

Image

However, when I attempted with the config for WAF logs as Partha was using, I was able to reproduce this issue.

Image

On investigation the crux of the issue is that we have encoded local file paths; we call os.RemoveAll passing in an unencoded path and receive no error; however, the data remains.

For example:

The directory on my local machine:/Users/graza/.tailpipe/data/default/tp_table=aws_waf_traffic_log/tp_partition=partha/tp_index=arn%3Aaws%3Awafv2%3Aus-east-1%3A632902152528%3Aregional%2Fwebacl%2Ftestp-new%2Fa1bf19cb-9ae4-44fd-8e07-586eb688fa6e/tp_date=2025-03-07

The directory we attempt to remove:
/Users/graza/.tailpipe/data/default/tp_table=aws_waf_traffic_log/tp_partition=partha/tp_index=arn:aws:wafv2:us-east-1:632902152528:regional/webacl/testp-new/a1bf19cb-9ae4-44fd-8e07-586eb688fa6e/tp_date=2025-03-07

After running the deletePartitionFrom function I still have data:

Image

Potential Solution(s)

  • When removing partitions use same encoding for the folder prior to passing to os.RemoveAll
  • When writing the parquet initially, do not encode the path

Either of the above should resolve this issue.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions