Skip to content
Advertisement

pyspark – How to define MapType for when/otherwise

I have a pyspark DataFrame with a MapType column that either contains the map<string, int> format or is None.

I need to perform some calculations using collect_list. But collect_list excludes None values and I am trying to find a workaround, by transforming None to string similar to Include null values in collect_list in pyspark

However, on my case I can’t use df.na.fill with subset, since fillna only supports int, float, string, bool datatypes, columns with other datatypes are ignored. The solution proposed in pyspark fillna is not working on column of ArrayType somehow doesn’t work for my case, since I get a type mismatch error.

import pyspark.sql.functions as F
default_value = F.lit('none')
fill_rule = F.when(F.col('map_values').isNull(), default_value).otherwise(F.col('map_values'))
df.withColumn('map_new', fill_rule)

I understand I should change the default_value format, with .cast('map<string,int>'), but I don’t know how to define it. I would like to get with something like {‘none’:1}.

I am new to pyspark, so maybe there is a more efficient way to achieve what I need.

Advertisement

Answer

we can create MapType using create_map of required literals.

df = spark.createDataFrame([(1,2,{'abc':12}),(3,4,None)],['c1','c2','c3'])
df.show()
+---+---+-----------+
| c1| c2|         c3|
+---+---+-----------+
|  1|  2|[abc -> 12]|
|  3|  4|       null|
+---+---+-----------+

from pyspark.sql import functions as f

defaultval = f.create_map(f.lit('None'),f.lit(1))
df = df.withColumn('c3',f.when(f.col('c3').isNull(),defaultval).otherwise(f.col('c3'))
df.select(f.collect_list('c3')).show(truncate=False)
+--------------------------+
|collect_list(c3)          |
+--------------------------+
|[[abc -> 12], [None -> 1]]|
+--------------------------+
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement