I have a dataframe and want to convert a dictionary consists of set.
To be specific, my dataframe and what I want to make it as below:
month date 0 JAN 1 1 JAN 1 2 JAN 1 3 FEB 2 4 FEB 2 5 FEB 3 6 MAR 1 7 MAR 2 8 MAR 3
My goal:
dict = {'JAN' : {1}, 'FEB' : {2,3}, 'MAR' : {1,2,3}}
I also wrote a code below, however, I am not sure it is suitable. In reality, the data is large, so I would like to know any tips or other efficient (faster) way to make it.
import pandas as pd
df = pd.DataFrame({'month' : ['JAN','JAN','JAN','FEB','FEB','FEB','MAR','MAR','MAR'],
                    'date'  : [1, 1, 1, 1, 2, 3, 1, 2, 3]})
df_list = df.values.tolist()
monthSet = ['JAN','FEB','MAR']
inst_id_dict = {}
for i in df_list:
    monStr = i[0]
    if monStr in monthSet:
        inst_id = i[1]
        inst_id_dict.setdefault(monStr, set([])).add(inst_id)
Advertisement
Answer
Let’s try grouping on the “month’ column, then aggregating by GroupBy.unique:
df.groupby('month', sort=False)['date'].unique().map(set).to_dict()
#  {'JAN': [1], 'FEB': [2, 3], 'MAR': [1, 2, 3]}
Or, if you’d prefer a dictionary of sets, use Groupby.agg:
df.groupby('month', sort=False)['date'].agg(set).to_dict()
# {'JAN': {1}, 'FEB': {2, 3}, 'MAR': {1, 2, 3}}
Another idea is to iteratively build a dict (don’t worry, despite using loops this is likely to outspeed the groupby option):
out = {}
for m, d in df.drop_duplicates(['month', 'date']).to_numpy():
     out.setdefault(m, set()).add(d)
out
# {'JAN': {1}, 'FEB': {2, 3}, 'MAR': {1, 2, 3}}
