Skip to content
Advertisement

Get value from Spark dataframe when rows are dictionaries

I have a PySpark dataframe that looks like this:

Values Column
{[0.0, 54.04, 48…. Sector A
{[0.0, 55.4800000… Sector A

If I show the first element of the column ‘Values’ without truncating the data, it looks like this:

{[0.0, 54.04, 48.19, 68.59, 61.81, 54.730000000000004, 48.51, 57.03, 59.49, 55.44, 60.56, 52.52, 51.44, 55.06, 55.27, 54.61, 55.89, 56.5, 45.4, 68.63, 63.88, 48.25, 57.19, 50.69, 48.75, 48.69, 53.17, 54.84, 49.04, 57.53, 56.14, 61.13, 57.26, 62.050000000000004, 53.96, 63.28, 57.99, 53.95, 58.69, 53.94, 46.86, 54.31, 59.15, 53.57, 50.22, 60.02, 58.04, 47.39, 53.64, 63.95, 66.87, 63.49, 57.120000000000005, 58.09, 53.870000000000005, 59.84, 50.910000000000004, 53.75, 56.730000000000004, 55.89, 49.43, 57.47, 53.25, 63.99, 50.85, 51.25, 52.83, 52.36, 63.4, 48.78, 50.85, 55.18, 51.28, 52.11, 53.72, 62.14, 55.75, 53.17, 62.07, 55.11, 57.26, 54.76, 53.19, 58.32, 55.4, 62.5, 61.04, 47.730000000000004, 63.46, 53.24, 55.6, 59.34, 56.870000000000005, 48.120000000000005, 52.0, 51.870000000000005, 55.08, 47.45, 55.01, 54.29, 55.1, 54.660000000000004, 55.85, 63.22, 47.89, 58.910000000000004, 57.88, 61.52, 60.72, 59.46, 50.120000000000005, 57.99, 64.64, 56.28, 49.82, 50.79, 50.44, 47.550000000000004, 59.35, 50.03, 63.42, 56.7, 62.550000000000004, 60.17, 55.19, 53.64, 53.65, 57.78, 53.120000000000005, 47.31, 57.39, 56.18, 62.76, 57.39, 51.93, 57.53, 59.71, 51.06, 58.31, 55.81, 56.18, 62.33, 71.91, 53.28, 60.4, 61.050000000000004, 66.49, 60.72, 57.42, 61.29, 61.1, 57.52, 52.480000000000004, 52.17, 60.49, 54.77, 50.13, 51.4, 48.27, 45.24, 57.27, 61.02, 51.24, 53.980000000000004, 59.46, 52.82, 62.01, 52.65, 60.03, 56.27, 60.2, 58.45, 60.64, 53.56, 51.78, 57.22, 60.61, 54.82, 50.17, 55.0, 61.19, 57.06, 58.33, 54.61, 56.14, 52.03, 60.42, 53.85, 53.32, 47.18, 49.38, 61.03, 65.72, 55.08, 55.46, 58.5, 62.42, 55.910000000000004, 61.120000000000005, 52.980000000000004, 60.08, 54.45, 57.29, 57.89, 56.95, 55.58, 50.27, 50.620000000000005, 57.83, 59.120000000000005, 56.92, 58.95, 52.81, 55.58, 50.7, 56.75, 67.28, 55.0, 52.46, 64.34, 48.910000000000004, 51.57, 48.64, 46.57, 52.74, 58.410000000000004, 55.99, 55.68, 59.43, 64.32000000000001, 55.01, 46.18, 56.88, 50.83, 54.49, 63.18, 55.99, 58.34, 60.870000000000005, 54.94, 56.29, 61.54, 54.72, 65.28, 59.71, 55.44, 68.29, 47.95, 60.49, 54.27, 54.28, 51.31, 59.800000000000004, 57.07, 58.34, 53.93, 54.84, 59.84, 47.69, 45.94, 61.93, 59.17, 52.15, 54.28, 60.13, 51.44, 53.03, 65.43, 57.300000000000004, 60.29, 60.11, 53.34, 54.39, 57.870000000000005, 58.45, 57.86, 58.2, 60.2, 62.22, 52.81, 41.34, 55.01, 57.26, 62.09, 55.15, 56.11, 50.24, 47.93, 57.17, 54.14, 54.6, 51.84, 54.42, 56.27, 57.32, 61.11, 60.730000000000004, 56.370000000000005, 56.72, 71.25, 56.35, 55.6, 58.410000000000004, 48.89, 66.65, 64.17, 56.17, 62.980000000000004, 54.800000000000004, 51.49, 59.660000000000004, 63.67, 53.17, 63.49, 56.97, 53.56, 51.64, 53.74, 51.35, 54.730000000000004, 53.160000000000004, 60.46, 49.370000000000005, 54.63, 54.07, 51.2, 54.31, 48.54, 56.18, 47.94, 50.11, 58.88, 56.76, 55.870000000000005, 54.01, 61.79, 54.300000000000004, 66.12, 55.15, 51.2, 52.56, 54.980000000000004, 56.800000000000004, 45.82, 52.54, 56.77, 55.44, 52.050000000000004, 57.660000000000004, 52.11, 59.24, 52.120000000000005, 56.76, 54.11, 52.25, 58.28, 71.75, 58.51, 62.910000000000004, 60.72, 51.18, 56.49, 56.76, 55.59, 52.86, 49.26, 46.94, 57.99, 62.47, 55.550000000000004, 56.49, 59.14, 59.7, 58.74, 54.78, 74.86, 49.84, 56.44, 51.6, 57.31, 58.35, 53.21, 57.82, 49.1, 57.96, 52.160000000000004, 59.95, 49.9, 64.45, 55.69, 44.7, 60.480000000000004, 64.53, 53.120000000000005, 50.02, 60.24, 59.92, 60.71, 49.93, 56.94, 44.72, 59.910000000000004, 54.050000000000004, 55.730000000000004, 50.6, 60.160000000000004, 56.81, 57.08, 56.06, 64.78, 48.21, 48.43, 50.58, 56.08, 58.21, 50.2, 50.160000000000004, 55.870000000000005, 57.870000000000005, 57.46, 55.93, 52.25, 53.4, 48.0, 56.88, 50.21, 50.730000000000004, 52.38, 55.68, 63.11, 47.18, 66.24, 63.89, 56.43, 59.15, 53.56, 55.6, 60.86, 53.39, 57.95, 54.45, 53.370000000000005, 57.54, 56.2, 52.69, 53.43, 57.300000000000004, 50.85, 55.53, 51.410000000000004, 55.92, 56.15, 59.97, 54.86, 62.15, 56.07, 58.17, 53.04, 58.46, 57.74, 56.45, 64.84, 56.160000000000004, 49.050000000000004, 53.76, 58.36, 60.480000000000004, 59.15, 61.25, 53.58, 64.2, 69.15, 58.93, 52.86, 51.1, 63.35, 55.44, 63.77, 52.730000000000004, 49.21, 56.800000000000004, 50.53, 53.34, 53.44, 54.07, 60.54, 47.13, 52.34, 63.97, 55.97, 60.5, 62.38, 52.61, 49.97, 58.29, 58.06, 49.75, 49.160000000000004, 57.21, 54.24, 55.75, 56.13, 59.7, 49.050000000000004, 60.39, 57.53, 53.38, 55.410000000000004, 54.97, 63.58, 54.76, 59.910000000000004, 51.14, 49.35, 56.44, 53.550000000000004, 53.120000000000005, 57.72, 58.68, 59.74, 59.86, 55.39, 47.39, 52.47, 63.64, 59.76, 53.15, 54.980000000000004, 53.480000000000004, 59.79, 48.6], [1, 531]}

The schema of the dataframe is as follows:

root
 |-- Values: struct (nullable = true)
 |    |-- elements: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- dimensions: array (nullable = true)
 |    |    |-- element: integer (containsNull = true)
 |-- Column: long (nullable = true)

And I’m not very sure of how I should work with this data (is it a dictionary, but without keys?).

In the end, I would like that in the column ‘Values’, the only value that appears is the one in position 1 of the first array of the dictionary. So in the end, my dataframe should look like this:

Values Column
54.04 Sector A
55.4800000 Sector A

Could someone please give me a hand to do this?

Advertisement

Answer

And I’m not very sure of how I should work with this data (is it a dictionary, but without keys?).

Since this column is of struct type, you should work with it like with struct. It’s not a dictionary (in Spark terminology, map type is closest to dictionary – it has both keys and values). But struct has similarities to dictionary too. Struct, instead of keys has fields. And fields can be of any other Spark data type (including struct itself).

To access fields inside struct you chain field name, like Values.elements. Then you do whatever the accessed data type requires. When you enter Values.elements, you have array type. So you must access its elements by index number. Putting it to code, it could look like this: F.col('Values.elements')[1]

Example dataframe:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(([0.0, 54.04   ], [1, 5]), 'Sector A'),
     (([0.0, 55.48000], [1, 4]), 'Sector A')],
    'Values:struct<elements:array<double>,dimensions:array<int>>, Column:string')
df.printSchema()
# root
#  |-- Values: struct (nullable = true)
#  |    |-- elements: array (nullable = true)
#  |    |    |-- element: double (containsNull = true)
#  |    |-- dimensions: array (nullable = true)
#  |    |    |-- element: integer (containsNull = true)
#  |-- Column: string (nullable = true)

Script:

df = df.select(
    F.col('Values.elements')[1].alias('Values'),
    'Column'
)

df.show()
# +------+--------+
# |Values|  Column|
# +------+--------+
# | 54.04|Sector A|
# | 55.48|Sector A|
# +------+--------+

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement