I need to delete a Delta Lake partition with associated AWS s3 files and then need to make sure AWS Athena displays this change. The purpose is because I need to rerun some code to re-populate the data.
I tried this
deltaTable = DeltaTable.forPath(spark, path) deltaTable.delete("extract_date = '2022-03-01'") #extract date is the partition
And it completed with no errors but the files on s3 still exist and Athena still shows the data even after running MSK REPAIR TABLE
after the delete. Can someone advise the best way to delete partitions and update Athena?
Advertisement
Answer
Although you performed delete operation, data is still there because Delta tables have history, and actual deletion of the data will happen only when you execute VACUUM operation and operation time will be older than default retention period (7 days). If you want to remove data faster, then you can run VACUUM command with parameter RETAIN XXX HOURS
, but this may require setting some additional properties to enforce that – refer documentation for more details.