So I create an empty pandas df, where I initialize all the cell values to empty lists, except the diagonals, which are set to math.inf
The indexes are the start position, and the column headers are the end position
I want to get the start and end positions, and the difference between the days to get from start to end, and put that value in df.loc[start, end] by using append. But for some reason, every single cell in the df is getting updated, and i dont know why
My code is shown below
self.status_dict = {'nvc': 'At NVC', 'issued': 'Issued', 'ready': 'Ready', 'ar_ds260': 'Action required: Complete Form DS-260', 'transit': 'In Transit', 'refused': 'Refused', 'ar_doc': 'Action required: Submit requested documents', 'admin_process': 'Administrative Processing', 'expire_soon': 'Expiring Soon', 'app_received': 'Application Received', 'ar_pay': 'Action required: Pay fees', 'return_nvc': 'Returned to NVC', 'transfer': 'Transfer in Progress', 'expired': 'Expired', 'ar_pay_miss': 'Action required: Pay missing fees', 'no_action': 'No action required: Review in process', 'no_status': 'No Status', 'ar_choose': 'Action required: Choose an agent' } self.status_dict_lookup = {'At NVC': 'nvc', 'Issued': 'issued', 'Ready': 'ready', 'Action required: Complete Form DS-260': 'ar_ds260', 'In Transit': 'transit', 'Refused': 'refused', 'Action required: Submit requested documents': 'ar_doc', 'Administrative Processing': 'admin_process', 'Expiring Soon': 'expire_soon', 'Application Received': 'app_received', 'Action required: Pay fees': 'ar_pay', 'Returned to NVC': 'return_nvc', 'Transfer in Progress': 'transfer', 'Expired': 'expired', 'Action required: Pay missing fees': 'ar_pay_miss', 'No action required: Review in process': 'no_action', 'No Status': 'no_status', 'Action required: Choose an agent': 'ar_choose' } shape = len(self.status_dict_lookup) const_arr = [[]] * shape keys = self.status_dict.keys() df_dict = dict() for key in keys: df_dict[key] = const_arr df = pd.DataFrame(df_dict) df = df.set_index(pd.Index(keys)) for key in keys: df.loc[key, key] = math.inf # cases = self.cases cases = {1044: [['Action required: Submit requested documents', '2021-12-18'], ['At NVC', '2022-02-03'], ['In Transit', '2022-02-14'], ['Ready', '2022-02-15'], ['Refused', '2022-03-10'], ['Administrative Processing', '2022-03-12'], ['Issued', '2022-03-14']]} for _, val in cases.items(): print(val[0], val[1]) print(val[0][1], val[1][1]) for i in range(len(val) - 1): temp = [] start = val[i][0] end = val[i + 1][0] start_time = datetime.strptime(val[i][1], '%Y-%m-%d') end_time = datetime.strptime(val[i + 1][1], '%Y-%m-%d') diff = end_time - start_time temp = df[self.status_dict_lookup[start]][self.status_dict_lookup[end]] print(temp) temp.append(diff.days) df.loc[self.status_dict_lookup[start], self.status_dict_lookup[end]] = temp
part of the output of the df is shown below:
nvc issued nvc inf [47, 11, 1, 23, 2, 2] issued [47, 11, 1, 23, 2, 2] inf ready [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] ar_ds260 [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] transit [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] refused [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] ar_doc [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] admin_process [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] expire_soon [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] app_received [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] ar_pay [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] return_nvc [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] transfer [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] expired [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] ar_pay_miss [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] no_action [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] no_status [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2] ar_choose [47, 11, 1, 23, 2, 2] [47, 11, 1, 23, 2, 2]
So for the first example,
start = Action required: Submit requested documents
end = At NVC
diff = 47
So i want it to store just 47 as a list in df[ar_doc][nvc]. But it is storing the difference of all days in all the cells
Why does this happen and how to fix it?
Advertisement
Answer
All your pandas data are referencing the same list. You should change how you initialize the DataFrame. You should create a new list in each cell.
Try:
df = pd.DataFrame({k: [list() for _ in range(len(status_dict))] for k in status_dict}, index=status_dict.keys()) for key in keys: df.at[key, key] = math.inf
Separately, since you’re already using pandas
, you don’t need to use datetime
to parse dates. You can reduce your loop to the following:
for _, val in cases.items(): for i in range(len(val)-1): diff = pd.to_datetime(val[i][1], format='%Y-%m-%d') - pd.to_datetime(val[i + 1][1], format='%Y-%m-%d') df.at[status_dict_lookup[val[i][0]], status_dict_lookup[val[i+1][0]]] += [diff.days]