Skip to content
Advertisement

CSV comparison with python multipleindex

I need to compare two CSV files and print out changed, remained same or deleted rows in a third CSV file. First csv file is like this:

location locationid sitename siteid country price
zoo         1         xxx      490     US     5
hosp        2         yyy      590     CA     7
rose        3         ccc      389     UK     5
lily        4         bbb      255     UK     3

Second csv file:

location locationid sitename siteid country price
zoo         1         xxx      490     US     4
hosp        2         yyy      590     CA     7
rose        3         ccc      389     ZW     2
zoo         1         sss      344     ME     3 
fol         9                          RU     11

at the end this is the result i want to get:

location locationid sitename siteid country price status
zoo         1         xxx      490     US     4     changed
hosp        2         yyy      590     CA     7     same
rose        3         ccc      389     UK     5     new
lily        4         bbb      255     UK     3     deleted
zoo         9         sss      344     ME     3     new
fol         9                          RU     11    new

if a there is a new country added to a siteid then it has status of new. Location can have multiple siteids. I want to catch if there is a new country added for a specific location and siteid not just one of them but for both of them as a condition. In the dataset some siteids are NA thats why i added location here. so in some cases from the location the file should understand the status.

Here is my code but it is not working as i wanted. If can you help me that will be really great :)

df1 = pd.read_csv(file1).fillna(0)
df2 = pd.read_csv(file2).fillna(0)
df1.set_index(['location','locationid','sitename','siteid','country'])
df2.set_index(['location','locationid','sitename','siteid','country'])
df3 = pd.concat([df1,df2],sort=False)
df3=df3.set_index(['location','locationid','sitename','siteid','country'])

df3.drop_duplicates()

df3a = df3.stack(dropna=False).groupby(level=[0,1]).unique().unstack().copy()

df3a.loc[~df3a.index.isin(df2.index),'status'] = 'deleted' # if not in df2 index then deleted
df3a.loc[~df3a.index.isin(df1.index),'status'] = 'new'     # if not in df1 index then new
idx = df3.stack().groupby(level=[0,1]).nunique() # get modified cells. 
df3a.loc[idx.mask(idx <= 1).dropna().index.get_level_values(0),'status'] = 'modified'
df3a['status'] = df3a['status'].fillna('same') # assume that anything not fulfilled by above rules is the same.

Advertisement

Answer

I’m not yet convinced this can be done exclusively with pandas operators. You do have several problems in your code. xxx.set_index returns a new data frame — it doesn’t modify in place. So, you need

df1 = df1.set_index(['location,'locationid','sitename','siteid','country'])
df2 = df2.set_index(['location,'locationid','sitename','siteid','country'])

Once you do that, you don’t have to set_index on df3. You really want to add the “status” value to df3, not df3a; after the grouping, df3a doesn’t look like what you need any more. I’m not sure the grouping is really the answer; I’m afraid you’re going to have to iterate the rows that are in both and compare the “price” value to df1. You can find out which rows with

indf1 = df3.index.isin(df1.index)
indf2 = df3.index.isin(df2.index)
inboth = indf1 & indf2
df3.loc[~indf2,'status'] = 'deleted'
df3.loc[~indf1,'status'] = 'new'

but after that, I think you’ll need to iterate.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement