json explode – return filtered array of records

Question

I have some JSON I have exploded however I need to filter the return based on where the "locale" is en_GB and I only wish to return that data in the dataframe. I currently have However this obviously does as it says it returns me the rows where en_GB is in locale but I actually only want it to return

Accepted Answer

My initial sample data:+--------------------+--------------------+--------------------+|            optionId|               label|              locale|+--------------------+--------------------+--------------------+|      [45024, 45024]|[Theft the compan...|      [en_GB, en_US]||[45624, 45624, 45...|[Back to home tow...|[th_TH, en_US, en...||      [45048, 45048]|[Education, Educa...|      [en_TH, en_US]|+--------------------+--------------------+--------------------+Since d.optionId, d.label and d.locale have values as arrays and the values of d.optionId and d.label are an array of same values. So, you can use explode() funtion again on d.optionId and d.label while using distinct to convert the array with same elements to just the value.jsonDF2 = jsonDF2.select(explode(jsonDF2.optionId),jsonDF2.label,jsonDF2.locale).withColumnRenamed("col","optionId").distinct()jsonDF2 = jsonDF2.select(jsonDF2.optionId,explode(jsonDF2.label),jsonDF2.locale).withColumnRenamed("col","label").distinct()jsonDF2.show(100,truncate=False)+--------+-----------------+---------------------+|optionId|label            |locale               |+--------+-----------------+---------------------+|45624   |Back to home town|[th_TH, en_US, en_GB]||45048   |Education        |[en_TH, en_US]       ||45024   |Theft the company|[en_GB, en_US]       |+--------+-----------------+---------------------+Now your requirement is to get d.locale to also be a value using filter() function based on the condition that en_GB is in d.locale. Since you already know the value you are using to filter the dataframe, you can select only d.optionId and d.lable from jsonDF2 and insert the d.locale with &#8216;en_GB&#8217; as its value. The following code will help:solution_df = jsonDF2.select(jsonDF2.optionId, jsonDF2.label).filter(array_contains(col("locale"), "en_GB")).withColumn("locale", lit("en_GB"))solution_df.show(truncate=False)+--------+-----------------+------+|optionId|label            |locale|+--------+-----------------+------+|45624   |Back to home town|en_GB ||45024   |Theft the company|en_GB |+--------+-----------------+------+Updated Code:jsonDF2 = jsonDF2.select(explode(col('optionId')),'`d.label`','`d.locale`').withColumnRenamed("col","optionId").distinct()jsonDF2 = jsonDF2.select('optionId',explode(jsonDF2['`d.label`']),'`d.locale`').withColumnRenamed("col","d.label").distinct()solution_df = jsonDF2.select('optionId', '`d.label`').filter(array_contains(col("`d.locale`"), "en_GB")).withColumn("d.locale", lit("en_GB"))solution_df.show(truncate=False)Update #2:The better way for you to achieve your requirement is to convert the pyspark dataframe to pandas dataframe and perform operations as shown below.#same as your codejsonDF2 = jsonDF.withColumn('d', explode(col('d.picklistOptions.results.picklistLabels.results')))jsonDF2.printSchema()#selecting the columnsjsonDF3 = jsonDF2.select(jsonDF2.d.optionId, jsonDF2.d.label,jsonDF2.d.locale)#printing its schemajsonDF3.show(truncate=False)+---------------------+------------------------------------------------------------------------------------+---------------------+|d.optionId           |d.label                                                                             |d.locale             |+---------------------+------------------------------------------------------------------------------------+---------------------+|[45024, 45024, 45024]|[Theft the company property, Theft the company property, Theft the company property]|[en_GB, en_US, th_TH]||[45995, 45995]       |[ลาออก โอนย้ายระหว่างภาค, Resignation and Transfer across Region]                   |[th_TH, en_GB]       |+---------------------+------------------------------------------------------------------------------------+---------------------+Now convert the Pyspark dataframe to Pandas dataframe. Use iterrows() function of pandas dataframe to extract data as per the requirement.jsonDF_pd= jsonDF3.toPandas()final=[]for i in jsonDF_pd.iterrows():    list1=i[1][0]    list2=i[1][1]    list3=i[1][2]    for i in range(len(list1)):        final. append ([list1[i],list2[i],str(list3[i])])#print(final)#You can use this 'final' array as input data to create dataframe as belowcols = ['optionId','label','locale']finalDF = spark.createDataFrame(final,cols)finalDF.show()+--------+--------------------+------+|optionId|               label|locale|+--------+--------------------+------+|   45024|Theft the company...| en_GB||   45024|Theft the company...| en_US||   45024|Theft the company...| th_TH||   45995|ลาออก โอนย้ายระหว...| th_TH||   45995|Resignation and T...| en_GB|+--------+--------------------+------+Using filter:finalDF.select('optionId','label','locale').filter(col('locale')=='en_GB').show(truncate=False)+--------+--------------------------------------+------+|optionId|label                                 |locale|+--------+--------------------------------------+------+|45024   |Theft the company property            |en_GB ||45995   |Resignation and Transfer across Region|en_GB |+--------+--------------------------------------+------+

Advertisement

Answer