I have two pandas DataFrames. The first one, df1, contains a column of file paths and a column of lists containing what users have read access to these file paths. The second DataFrame, df2, contains a list of all possible users. I’ve created an example below:
df1 = pd.DataFrame()
df1['path'] = ['C:/pathA', 'C:/pathB', 'C:/pathC', 'C:/pathD']
df1['read'] = [['userA', 'userC', 'userD'], 
               ['userA', 'userB'], 
               ['userB', 'userD'], 
               ['userA', 'userB', 'userC', 'userD']]
print(df1)
       path                          read
0  C:/pathA         [userA, userC, userD]
1  C:/pathB                [userA, userB]
2  C:/pathC                [userB, userD]
3  C:/pathD  [userA, userB, userC, userD]
df2 = pd.DataFrame(data=['userA', 'userB', 'userC', 'userD'], columns=['user'])
print(df2)
    user
0  userA
1  userB
2  userC
3  userD
The end goal is to create a new column df2['read_count'], which should take each user string from df2['user'] and find the total number of matches in the column df1['read'].
The expected output would be exactly that – a count of matches of each user string in the column of lists in df1['read']. Here is what I am expecting based on the example:
df2
    user  read_count
0  userA           3
1  userB           3
2  userC           2
3  userD           3
I tried putting something together using another question and list comprehension, but no luck. Here is what I currently have:
df2['read_count'] = [sum(all(val in cell for val in row)
                     for cell in df1['read'])
                     for row in df2['user']]
print(df2)
    user  read_count
0  userA           0
1  userB           0
2  userC           0
3  userD           0
What is wrong with the code I currently have? I’ve tried actually following through the loops but it all seemed right, but it seems like my code can’t detect the matches I want.
Advertisement
Answer
You can use:
df2.merge(df1['read'].explode().value_counts(),
          left_on='user', right_index=True)
Or, if you really need to use a different kind of aggregation that depends on “path”:
df2.merge(df1.explode('read').groupby('read')['path'].count(),
          left_on='user', right_index=True)
output:
user path 0 userA 3 1 userB 3 2 userC 2 3 userD 3
Without df2:
df1['read'].explode().value_counts().reset_index()
# or
# df1.explode('read').groupby('read', as_index=False)['path'].count()
output:
index path 0 userA 3 1 userB 3 2 userC 2 3 userD 3