I have a two columns Pandas data frame containing a list of user_ids and some URLs they have visited. It looks like this:
users urls 0 user1 url1 1 user1 url3 2 user1 url5 3 user2 url2 4 user2 url4 5 user2 url5 6 user3 url1 7 user3 url4 8 user3 url5
I want to create a vector representation of itself, like this:
url1 url2 url3 url4 url5 user1 1.0 NaN 1.0 NaN 1.0 user2 NaN 1.0 NaN 1.0 1.0 user3 1.0 NaN NaN 1.0 1.0
I’ve tried different things, but keep hitting a wall. Any ideas?
Advertisement
Answer
What you’re describing is a pivot of the url column
# Make data
df = pd.DataFrame([
               ['user1', 'url1'], 
               ['user1', 'url3'], 
               ['user1', 'url5'],
               ['user2', 'url2'],
               ['user2', 'url4'],
               ['user2', 'url5'],
               ['user3', 'url1'],
               ['user3', 'url4'],
               ['user3', 'url5']
               ], columns=['users', 'urls'])
# add column to fill pivoted values
df['count'] = 1
new_df = df.pivot(index='users',columns='urls',values='count').fill_na(0)
new_df
# urls   url1  url2  url3  url4  url5
# users                              
# user1   1.0   0.0   1.0   0.0   1.0
# user2   0.0   1.0   0.0   1.0   1.0
# user3   1.0   0.0   0.0   1.0   1.0
This puts the users column in the index, but you can use reset_index to make it a regular column again.
 
						