I need to create a dictionary from Spark dataframe’s schema of type pyspark.sql.types.StructType
.
The code needs to go through entire StructType
, find only those StructField
elements which are of type StructType
and, when extracting into dictionary, use the name
of parent StructField
as key
while value
would be name
of only the first nested/child StructField
.
Example schema (StructType
):
root |-- field_1: int |-- field_2: int |-- field_3: struct | |-- date: date | |-- timestamp: timestamp |-- field_4: int
Desired result:
{"field_3": "date"}
Advertisement
Answer
You can use a dictionary comprehension navigating through the schema.
{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}
Test #1
df = spark.createDataFrame([], 'field_1 int, field_2 int, field_3 struct<date:date,timestamp:timestamp>, field_4 int') df.printSchema() # root # |-- field_1: integer (nullable = true) # |-- field_2: integer (nullable = true) # |-- field_3: struct (nullable = true) # | |-- date: date (nullable = true) # | |-- timestamp: timestamp (nullable = true) # |-- field_4: integer (nullable = true) {x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'} # {'field_3': 'date'}
Test #2
df = spark.createDataFrame([], 'field_1 int, field_2 struct<col_int:int,col_long:long>, field_3 struct<date:date,timestamp:timestamp>') df.printSchema() # root # |-- field_1: integer (nullable = true) # |-- field_2: struct (nullable = true) # | |-- col_int: integer (nullable = true) # | |-- col_long: long (nullable = true) # |-- field_3: struct (nullable = true) # | |-- date: date (nullable = true) # | |-- timestamp: timestamp (nullable = true) {x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'} # {'field_2': 'col_int', 'field_3': 'date'}