I have two pandas DataFrames. The first one, df1
, contains a column of file paths and a column of lists containing what users have read access to these file paths. The second DataFrame, df2
, contains a list of all possible users. I’ve created an example below:
df1 = pd.DataFrame() df1['path'] = ['C:/pathA', 'C:/pathB', 'C:/pathC', 'C:/pathD'] df1['read'] = [['userA', 'userC', 'userD'], ['userA', 'userB'], ['userB', 'userD'], ['userA', 'userB', 'userC', 'userD']] print(df1) path read 0 C:/pathA [userA, userC, userD] 1 C:/pathB [userA, userB] 2 C:/pathC [userB, userD] 3 C:/pathD [userA, userB, userC, userD]
df2 = pd.DataFrame(data=['userA', 'userB', 'userC', 'userD'], columns=['user']) print(df2) user 0 userA 1 userB 2 userC 3 userD
The end goal is to create a new column df2['read_count']
, which should take each user string from df2['user']
and find the total number of matches in the column df1['read']
.
The expected output would be exactly that – a count of matches of each user string in the column of lists in df1['read']
. Here is what I am expecting based on the example:
df2 user read_count 0 userA 3 1 userB 3 2 userC 2 3 userD 3
I tried putting something together using another question and list comprehension, but no luck. Here is what I currently have:
df2['read_count'] = [sum(all(val in cell for val in row) for cell in df1['read']) for row in df2['user']] print(df2) user read_count 0 userA 0 1 userB 0 2 userC 0 3 userD 0
What is wrong with the code I currently have? I’ve tried actually following through the loops but it all seemed right, but it seems like my code can’t detect the matches I want.
Advertisement
Answer
You can use:
df2.merge(df1['read'].explode().value_counts(), left_on='user', right_index=True)
Or, if you really need to use a different kind of aggregation that depends on “path”:
df2.merge(df1.explode('read').groupby('read')['path'].count(), left_on='user', right_index=True)
output:
user path 0 userA 3 1 userB 3 2 userC 2 3 userD 3
Without df2
:
df1['read'].explode().value_counts().reset_index() # or # df1.explode('read').groupby('read', as_index=False)['path'].count()
output:
index path 0 userA 3 1 userB 3 2 userC 2 3 userD 3