I have a two columns Pandas data frame containing a list of user_ids and some URLs they have visited. It looks like this:
users urls 0 user1 url1 1 user1 url3 2 user1 url5 3 user2 url2 4 user2 url4 5 user2 url5 6 user3 url1 7 user3 url4 8 user3 url5
I want to create a vector representation of itself, like this:
url1 url2 url3 url4 url5 user1 1.0 NaN 1.0 NaN 1.0 user2 NaN 1.0 NaN 1.0 1.0 user3 1.0 NaN NaN 1.0 1.0
I’ve tried different things, but keep hitting a wall. Any ideas?
Advertisement
Answer
What you’re describing is a pivot of the url column
# Make data
df = pd.DataFrame([
['user1', 'url1'],
['user1', 'url3'],
['user1', 'url5'],
['user2', 'url2'],
['user2', 'url4'],
['user2', 'url5'],
['user3', 'url1'],
['user3', 'url4'],
['user3', 'url5']
], columns=['users', 'urls'])
# add column to fill pivoted values
df['count'] = 1
new_df = df.pivot(index='users',columns='urls',values='count').fill_na(0)
new_df
# urls url1 url2 url3 url4 url5
# users
# user1 1.0 0.0 1.0 0.0 1.0
# user2 0.0 1.0 0.0 1.0 1.0
# user3 1.0 0.0 0.0 1.0 1.0
This puts the users column in the index, but you can use reset_index to make it a regular column again.