I have a very large dataset, that looks like
df = pd.DataFrame({'B': ['john smith', 'john doe', 'adam smith', 'john doe', np.nan], 'C': ['indiana jones', 'duck mc duck', 'batman','duck mc duck',np.nan]}) df Out[173]: B C 0 john smith indiana jones 1 john doe duck mc duck 2 adam smith batman 3 john doe duck mc duck 4 NaN NaN
I need to create a ID variable, that is unique for every B-C combination. That is, the output should be
B C ID 0 john smith indiana jones 1 1 john doe duck mc duck 2 2 adam smith batman 3 3 john doe duck mc duck 2 4 NaN NaN 0
I actually dont care about whether the index starts at zero or not, and whether the value for the missing columns is 0 or any other number. I just want something fast, that does not take a lot of memory and can be sorted quickly. I use:
df['combined_id']=(df.B+df.C).rank(method='dense')
but the output is float64
and takes a lot of memory. Can we do better?
Thanks!
Advertisement
Answer
I think you can use factorize
:
df['combined_id'] = pd.factorize(df.B+df.C)[0] print df B C combined_id 0 john smith indiana jones 0 1 john doe duck mc duck 1 2 adam smith batman 2 3 john doe duck mc duck 1 4 NaN NaN -1