Skip to content
Advertisement

Spark: How to transform to Data Frame data from multiple nested XML files with attributes

How to transform values below from multiple XML files to spark data frame :

  • attribute Id0 from Level_0
  • Date/Value from Level_4

Required output:

JavaScript

file_1.xml:

JavaScript

file_2.xml:

JavaScript

Current Code Example:

JavaScript

Current Output:(Id0 column with attributes missing)

JavaScript

There are some examples, but non of them solve the problem: -I’m using databricks spark_xml – https://github.com/databricks/spark-xml -There is an examample but not with attribute reading, Read XML in spark, Extracting tag attributes from xml using sparkxml .

EDIT: As @mck pointed out correctly <Level_2>A</Level_2> is not correct XML format. I had a mistake in my example(now xml file is corrected), it should be <Level_2_A>A</Level_2_A>. After that , proposed solution works even on multiple files.

NOTE: To speedup loading of large number of xmls define schema, if no schema is defined spark is reading each file when creating dataframe to interfere schema… for more info: https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/

STEP 1):

JavaScript

STEP 2) see below @mck solution:

Advertisement

Answer

You can use Level_0 as the rowTag, and explode the relevant arrays/structs:

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement