PySpark: Get first Non-null value of each column in dataframe

Question

I'm dealing with different Spark DataFrames, which have lot of Null values in many columns. I want to get any one non-null value from each of the column to see if that value can be converted to datetime. I tried doing df.na.drop().first() in a hope that it'll drop all rows with any null value, and of the remaining DataFrame, I'll

Accepted Answer

You can use first function with ingorenulls. Let&#8217;s say data looks like this:from pyspark.sql.types import StringType, StructType, StructFieldschema = StructType([    StructField("x{}".format(i), StringType(), True) for i in range(3)])df = spark.createDataFrame(    [(None, "foo", "bar"), ("foo", None, "bar"), ("foo", "bar", None)],    schema)You can:from pyspark.sql.functions import firstdf.select([first(x, ignorenulls=True).alias(x) for x in df.columns]).first()Row(x0='foo', x1='foo', x2='bar')

Advertisement

Answer