How to write data to Delta Lake from Kubernetes

Question

Our organisation runs Databricks on Azure that is used by data scientists & analysts primarily for Notebooks in order to do ad-hoc analysis and exploration. We also run Kubernetes clusters for non spark-requiring ETL workflows. We would like to use Delta Lakes as our storage layer where both Databricks an…

Accepted Answer

You can usually can write into the Delta table using Delta connector for Spark.  Just start a Spark job with necessary packages and configuration options:spark-submit --packages io.delta:delta-core_2.12:1.0.0   --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"   --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" ...and write the same way as on Databricks:df.write.format("delta").mode("append").save("some_location")But by using OSS version of Delta you may loose some of the optimizations that are available only on Databricks, like, Data Skipping, etc. &#8211; in this case performance for the data written from Kubernetes could be lower (really depends on how do you access data).There could be a case when you couldn&#8217;t write into Delta table create by Databricks &#8211; when the table was written by writer with writer version higher that supported by OSS Delta connector (see Delta Protocol documentation). For example, this happens when you enable Change Data Feed on the Delta table that performs additional actions when writing data.Outside of Spark, there are plans for implementing so-called Standalone writer for JVM-based languages (in addition to existing Standalone reader).  And there is a delta-rs project implemented in Rust (with bindings for Python & Ruby) that should be able to write into Delta table (but I haven&#8217;t tested that myself)Update 14.04.2022: Data Skipping is also available in OSS Delta, starting with version 1.2.0

Advertisement

Answer