I want to read a .xlsx
file, do some things with the data and convert it to a dict
to save it in a .json
file. To do that I use Python3
and pandas
.
This is the code:
import pandas as pd import json xls = pd.read_excel( io = "20codmun.xlsx", converters = { "CODAUTO" : str, "CPRO" : str, "CMUN" : str, "DC" : str } ) print(xls) #print(xls.columns.values) outDict = {} print(len(xls["NOMBRE"])) # 8131 rows for i in range(len(xls.index)): codauto = xls["CODAUTO"][i] cpro = xls["CPRO"][i] cmun = xls["CMUN"][i] dc = xls["DC"][i] aemetId = cpro + cmun outDict[xls["NOMBRE"][i]] = { "CODAUTO" : codauto, "CPRO" : cpro, "CMUN" : cmun, "DC" : dc, "AEMET_ID" : aemetId } print(i) # 8130 print(len(outDict)) # 8114 entries, SOME ENTIRES ARE LOST!!!!! #print(outDict["Petrer"]) with open("data.json", "w") as outFile: json.dump(outDict, outFile)
I add here the source of the .xlsx
file (Spanish government). Select “Fichero con todas las provincias”. You have to delete the first row.
As you can see, the pandas.DataFrame
has 8131 rows, the for
index at the end is 8130, but the length of the final dict
is 8114, so some data is lost!
You can check that “Aljucén” is on the .xlsx
file, but not in the .json
one. Edit: Solved using json.dump(outDict, outFile, ensure_ascii=False)
Advertisement
Answer
I have analyzed the file and seems like some “NOMBRE” values are duplicated. Try executing xls["NOMBRE"].value_counts()
and you will see that for example “Sada” is twice. You will also see that the unique values are 8114 exactly.
As you are using the city name as the dictionary key, when the key is duplicated, you are modifying the previous value of the dict.