Skip to content
Advertisement

Read nested JSON in a cell with Pandas

I recovered a list of ISO 3166-2 countries and regions in this Github repository. I managed to have a first look of the regions using the following code:

import pandas as pd
import json
data = "/content/data.json"
df = pd.read_json(data)
df = df.T

Which gives the following output:

name divisions
Afghanistan {‘AF-BDS’: ‘Badakhshān’, ‘AF-BDG’: ‘Bādghīs’, ‘AF-BGL’: ‘Baghlān’, ‘AF-BAL’: ‘Balkh’, ‘AF-BAM’: ‘Bāmīān’, ‘AF-FRA’: ‘Farāh’, ‘AF-FYB’: ‘Fāryāb’, ‘AF-GHA’: ‘Ghaznī’, ‘AF-GHO’: ‘Ghowr’, ‘AF-HEL’: ‘Helmand’, ‘AF-HER’: ‘Herāt’, ‘AF-JOW’: ‘Jowzjān’, ‘AF-KAB’: ‘Kabul (Kābol)’, ‘AF-KAN’: ‘Kandahār’, ‘AF-KAP’: ‘Kāpīsā’, ‘AF-KNR’: ‘Konar (Kunar)’, ‘AF-KDZ’: ‘Kondoz (Kunduz)’, ‘AF-LAG’: ‘Laghmān’, ‘AF-LOW’: ‘Lowgar’, ‘AF-NAN’: ‘Nangrahār (Nangarhār)’, ‘AF-NIM’: ‘Nīmrūz’, ‘AF-ORU’: ‘Orūzgān (Urūzgā’, ‘AF-PIA’: ‘Paktīā’, ‘AF-PKA’: ‘Paktīkā’, ‘AF-PAR’: ‘Parwān’, ‘AF-SAM’: ‘Samangān’, ‘AF-SAR’: ‘Sar-e Pol’, ‘AF-TAK’: ‘Takhār’, ‘AF-WAR’: ‘Wardak (Wardag)’, ‘AF-ZAB’: ‘Zābol (Zābul)’}
Albania {‘AL-BR’: ‘Berat’, ‘AL-BU’: ‘Bulqizë’, ‘AL-DL’: ‘Delvinë’, ‘AL-DV’: ‘Devoll’, ‘AL-DI’: ‘Dibër’, ‘AL-DR’: ‘Durrës’, ‘AL-EL’: ‘Elbasan’, ‘AL-FR’: ‘Fier’, ‘AL-GR’: ‘Gramsh’, ‘AL-GJ’: ‘Gjirokastër’, ‘AL-HA’: ‘Has’, ‘AL-KA’: ‘Kavajë’, ‘AL-ER’: ‘Kolonjë’, ‘AL-KO’: ‘Korcë’, ‘AL-KR’: ‘Krujë’, ‘AL-KC’: ‘Kucovë’, ‘AL-KU’: ‘Kukës’, ‘AL-LA’: ‘Laç’, ‘AL-LE’: ‘Lezhë’, ‘AL-LB’: ‘Librazhd’, ‘AL-LU’: ‘Lushnjë’, ‘AL-MM’: ‘Malësia e Madhe’, ‘AL-MK’: ‘Mallakastër’, ‘AL-MT’: ‘Mat’, ‘AL-MR’: ‘Mirditë’, ‘AL-PQ’: ‘Peqin’, ‘AL-PR’: ‘Përmet’, ‘AL-PG’: ‘Pogradec’, ‘AL-PU’: ‘Pukë’, ‘AL-SR’: ‘Sarandë’, ‘AL-SK’: ‘Skrapar’, ‘AL-SH’: ‘Shkodër’, ‘AL-TE’: ‘Tepelenë’, ‘AL-TR’: ‘Tiranë’, ‘AL-TP’: ‘Tropojë’, ‘AL-VL’: ‘Vlorë’}

But I can’t manage to achieve the following output because of the nested JSON.

country code country name region code region name
AF Afghanistan AF-BDS Badakhshān
AF Afghanistan AF-BDG Bādghīs

I tried to loop inside the DataFrame with :

df = json_normalize(df['divisions']).unstack().apply(pd.Series)

But I’m not getting any satisfying result.

Advertisement

Answer

This should work:

df1 = (
  pd.DataFrame(data)
  .transpose()
  .reset_index(names="country code")
  .rename(columns={"name": "country name"})
  )

divisions = [(k1, v1) for k, v in df1["divisions"].to_dict().items() for k1, v1 in v.items()]
df2 = pd.DataFrame(divisions, columns=["region code", "region name"])

final_df = (
  pd
  .merge(df1.explode("divisions"), df2, left_on="divisions", right_on="region code")
  .drop(columns="divisions")
)

print(final_df.head(10))

  country code country name region code region name
0           AF  Afghanistan      AF-BDS  Badakhshān
1           AF  Afghanistan      AF-BDG     Bādghīs
2           AF  Afghanistan      AF-BGL     Baghlān
3           AF  Afghanistan      AF-BAL       Balkh
4           AF  Afghanistan      AF-BAM      Bāmīān
5           AF  Afghanistan      AF-FRA       Farāh
6           AF  Afghanistan      AF-FYB      Fāryāb
7           AF  Afghanistan      AF-GHA      Ghaznī
8           AF  Afghanistan      AF-GHO       Ghowr
9           AF  Afghanistan      AF-HEL     Helmand
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement