Skip to content
Advertisement

Correct Method to Delete Delta Lake Partion on AWS s3

I need to delete a Delta Lake partition with associated AWS s3 files and then need to make sure AWS Athena displays this change. The purpose is because I need to rerun some code to re-populate the data.

I tried this

deltaTable = DeltaTable.forPath(spark, path)
deltaTable.delete("extract_date = '2022-03-01'") #extract date is the partition

And it completed with no errors but the files on s3 still exist and Athena still shows the data even after running MSK REPAIR TABLE after the delete. Can someone advise the best way to delete partitions and update Athena?

Advertisement

Answer

Although you performed delete operation, data is still there because Delta tables have history, and actual deletion of the data will happen only when you execute VACUUM operation and operation time will be older than default retention period (7 days). If you want to remove data faster, then you can run VACUUM command with parameter RETAIN XXX HOURS, but this may require setting some additional properties to enforce that – refer documentation for more details.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement