I have a very large dataset, that looks like
JavaScript
x
11
11
1
df = pd.DataFrame({'B': ['john smith', 'john doe', 'adam smith', 'john doe', np.nan], 'C': ['indiana jones', 'duck mc duck', 'batman','duck mc duck',np.nan]})
2
3
df
4
Out[173]:
5
B C
6
0 john smith indiana jones
7
1 john doe duck mc duck
8
2 adam smith batman
9
3 john doe duck mc duck
10
4 NaN NaN
11
I need to create a ID variable, that is unique for every B-C combination. That is, the output should be
JavaScript
1
7
1
B C ID
2
0 john smith indiana jones 1
3
1 john doe duck mc duck 2
4
2 adam smith batman 3
5
3 john doe duck mc duck 2
6
4 NaN NaN 0
7
I actually dont care about whether the index starts at zero or not, and whether the value for the missing columns is 0 or any other number. I just want something fast, that does not take a lot of memory and can be sorted quickly. I use:
JavaScript
1
2
1
df['combined_id']=(df.B+df.C).rank(method='dense')
2
but the output is float64
and takes a lot of memory. Can we do better?
Thanks!
Advertisement
Answer
I think you can use factorize
:
JavaScript
1
9
1
df['combined_id'] = pd.factorize(df.B+df.C)[0]
2
print df
3
B C combined_id
4
0 john smith indiana jones 0
5
1 john doe duck mc duck 1
6
2 adam smith batman 2
7
3 john doe duck mc duck 1
8
4 NaN NaN -1
9