Skip to content
Advertisement

Spread List of Lists to Sparks DF with PySpark?

I’m currently struggling with following issue:

Let’s take following List of Lists:

[[1, 2, 3], [4, 5], [6, 7]]

How can I create following Sparks DF out of it with one row per element of each sublist:

| min_value | value |
---------------------
|          1|      1|
|          1|      2|
|          1|      3|
|          4|      4|
|          4|      5|
|          6|      6|
|          6|      7|

The only way I’m getting this done is by processing this list to another list with for-loops, which basically then already represents all rows of my DF, which is probably not the best way to solve this.

THX & BR IntoNumbers

Advertisement

Answer

You can create a dataframe and use explode and array_min to get the desired output:

import pyspark.sql.functions as F

l = [[1, 2, 3], [4, 5], [6, 7]]

df = spark.createDataFrame(
    [[l]], 
    ['col']
).select(
    F.explode('col').alias('value')
).withColumn(
    'min_value', 
    F.array_min('value')
).withColumn(
    'value', 
    F.explode('value')
)

df.show()
+-----+---------+
|value|min_value|
+-----+---------+
|    1|        1|
|    2|        1|
|    3|        1|
|    4|        4|
|    5|        4|
|    6|        6|
|    7|        6|
+-----+---------+
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement