I need to compare two CSV files and print out changed, remained same or deleted rows in a third CSV file. First csv file is like this: Second csv file: at the end this is the result i want to get: if a there is a new country added to a siteid then it has status of new. Location can

CSV comparison with python multipleindex

I need to compare two CSV files and print out changed, remained same or deleted rows in a third CSV file. First csv file is like this:

location locationid sitename siteid country price
zoo         1         xxx      490     US     5
hosp        2         yyy      590     CA     7
rose        3         ccc      389     UK     5
lily        4         bbb      255     UK     3

JavaScript
​x
 
location locationid sitename siteid country price
zoo         1         xxx      490     US     5
hosp        2         yyy      590     CA     7
rose        3         ccc      389     UK     5
lily        4         bbb      255     UK     3
​

Second csv file:

location locationid sitename siteid country price
zoo         1         xxx      490     US     4
hosp        2         yyy      590     CA     7
rose        3         ccc      389     ZW     2
zoo         1         sss      344     ME     3 
fol         9                          RU     11

JavaScript
 
location locationid sitename siteid country price
zoo         1         xxx      490     US     4
hosp        2         yyy      590     CA     7
rose        3         ccc      389     ZW     2
zoo         1         sss      344     ME     3 
fol         9                          RU     11
​

at the end this is the result i want to get:

location locationid sitename siteid country price status
zoo         1         xxx      490     US     4     changed
hosp        2         yyy      590     CA     7     same
rose        3         ccc      389     UK     5     new
lily        4         bbb      255     UK     3     deleted
zoo         9         sss      344     ME     3     new
fol         9                          RU     11    new

JavaScript
 
location locationid sitename siteid country price status
zoo         1         xxx      490     US     4     changed
hosp        2         yyy      590     CA     7     same
rose        3         ccc      389     UK     5     new
lily        4         bbb      255     UK     3     deleted
zoo         9         sss      344     ME     3     new
fol         9                          RU     11    new
​

if a there is a new country added to a siteid then it has status of new. Location can have multiple siteids. I want to catch if there is a new country added for a specific location and siteid not just one of them but for both of them as a condition. In the dataset some siteids are NA thats why i added location here. so in some cases from the location the file should understand the status.

Here is my code but it is not working as i wanted. If can you help me that will be really great :)

df1 = pd.read_csv(file1).fillna(0)
df2 = pd.read_csv(file2).fillna(0)

JavaScript
 
df1 = pd.read_csv(file1).fillna(0)
df2 = pd.read_csv(file2).fillna(0)
​

df1.set_index(['location','locationid','sitename','siteid','country'])
df2.set_index(['location','locationid','sitename','siteid','country'])

JavaScript
 
df1.set_index(['location','locationid','sitename','siteid','country'])
df2.set_index(['location','locationid','sitename','siteid','country'])
​

df3 = pd.concat([df1,df2],sort=False)
df3=df3.set_index(['location','locationid','sitename','siteid','country'])

df3.drop_duplicates()

df3a = df3.stack(dropna=False).groupby(level=[0,1]).unique().unstack().copy()

df3a.loc[~df3a.index.isin(df2.index),'status'] = 'deleted' # if not in df2 index then deleted
df3a.loc[~df3a.index.isin(df1.index),'status'] = 'new'     # if not in df1 index then new
idx = df3.stack().groupby(level=[0,1]).nunique() # get modified cells. 
df3a.loc[idx.mask(idx <= 1).dropna().index.get_level_values(0),'status'] = 'modified'
df3a['status'] = df3a['status'].fillna('same') # assume that anything not fulfilled by above rules is the same.

JavaScript
 
df3 = pd.concat([df1,df2],sort=False)
df3=df3.set_index(['location','locationid','sitename','siteid','country'])
​
df3.drop_duplicates()
​
df3a = df3.stack(dropna=False).groupby(level=[0,1]).unique().unstack().copy()
​
df3a.loc[~df3a.index.isin(df2.index),'status'] = 'deleted' # if not in df2 index then deleted
df3a.loc[~df3a.index.isin(df1.index),'status'] = 'new'     # if not in df1 index then new
idx = df3.stack().groupby(level=[0,1]).nunique() # get modified cells. 
df3a.loc[idx.mask(idx <= 1).dropna().index.get_level_values(0),'status'] = 'modified'
df3a['status'] = df3a['status'].fillna('same') # assume that anything not fulfilled by above rules is the same.
​

Answer

I’m not yet convinced this can be done exclusively with pandas operators. You do have several problems in your code. xxx.set_index returns a new data frame — it doesn’t modify in place. So, you need

df1 = df1.set_index(['location,'locationid','sitename','siteid','country'])
df2 = df2.set_index(['location,'locationid','sitename','siteid','country'])

JavaScript
 
df1 = df1.set_index(['location,'locationid','sitename','siteid','country'])
df2 = df2.set_index(['location,'locationid','sitename','siteid','country'])
​

Once you do that, you don’t have to set_index on df3. You really want to add the “status” value to df3, not df3a; after the grouping, df3a doesn’t look like what you need any more. I’m not sure the grouping is really the answer; I’m afraid you’re going to have to iterate the rows that are in both and compare the “price” value to df1. You can find out which rows with

indf1 = df3.index.isin(df1.index)
indf2 = df3.index.isin(df2.index)
inboth = indf1 & indf2
df3.loc[~indf2,'status'] = 'deleted'
df3.loc[~indf1,'status'] = 'new'

JavaScript
 
indf1 = df3.index.isin(df1.index)
indf2 = df3.index.isin(df2.index)
inboth = indf1 & indf2
df3.loc[~indf2,'status'] = 'deleted'
df3.loc[~indf1,'status'] = 'new'
​

but after that, I think you’ll need to iterate.

Advertisement

Answer