Description
Context
@ParthaI noticed he was getting duplicate data when running a collect command twice with the same --from
parameter.
Initially I was unable to reproduce this using CloudTrail logs.
However, when I attempted with the config for WAF logs as Partha was using, I was able to reproduce this issue.
On investigation the crux of the issue is that we have encoded local file paths; we call os.RemoveAll
passing in an unencoded path and receive no error; however, the data remains.
For example:
The directory on my local machine:/Users/graza/.tailpipe/data/default/tp_table=aws_waf_traffic_log/tp_partition=partha/tp_index=arn%3Aaws%3Awafv2%3Aus-east-1%3A632902152528%3Aregional%2Fwebacl%2Ftestp-new%2Fa1bf19cb-9ae4-44fd-8e07-586eb688fa6e/tp_date=2025-03-07
The directory we attempt to remove:
/Users/graza/.tailpipe/data/default/tp_table=aws_waf_traffic_log/tp_partition=partha/tp_index=arn:aws:wafv2:us-east-1:632902152528:regional/webacl/testp-new/a1bf19cb-9ae4-44fd-8e07-586eb688fa6e/tp_date=2025-03-07
After running the deletePartitionFrom
function I still have data:
Potential Solution(s)
- When removing
partitions
use same encoding for the folder prior to passing toos.RemoveAll
- When writing the parquet initially, do not encode the path
Either of the above should resolve this issue.