I have a two columns Pandas data frame containing a list of user_ids and some URLs they have visited. It looks like this:
users urls 0 user1 url1 1 user1 url3 2 user1 url5 3 user2 url2 4 user2 url4 5 user2 url5 6 user3 url1 7 user3 url4 8 user3 url5
I want to create a vector representation of itself, like this:
url1 url2 url3 url4 url5 user1 1.0 NaN 1.0 NaN 1.0 user2 NaN 1.0 NaN 1.0 1.0 user3 1.0 NaN NaN 1.0 1.0
I’ve tried different things, but keep hitting a wall. Any ideas?
Advertisement
Answer
What you’re describing is a pivot of the url column
# Make data df = pd.DataFrame([ ['user1', 'url1'], ['user1', 'url3'], ['user1', 'url5'], ['user2', 'url2'], ['user2', 'url4'], ['user2', 'url5'], ['user3', 'url1'], ['user3', 'url4'], ['user3', 'url5'] ], columns=['users', 'urls']) # add column to fill pivoted values df['count'] = 1 new_df = df.pivot(index='users',columns='urls',values='count').fill_na(0) new_df # urls url1 url2 url3 url4 url5 # users # user1 1.0 0.0 1.0 0.0 1.0 # user2 0.0 1.0 0.0 1.0 1.0 # user3 1.0 0.0 0.0 1.0 1.0
This puts the users column in the index, but you can use reset_index to make it a regular column again.