I’d like to create a third column as a result of a cross join between my Columns A
and B
:
import pandas as pd import numpy as np df = pd.read_csv("data.csv", sep=",") df # A B # 0 0 Yes # 1 8 No # 2 2 Yes # 3 4 Maybe # 4 6 NA
They have the following unique values:
>>> df['A'].drop_duplicates() 0 0 2 8 41 4 119 2 1246 3 1808 1 Name: A, dtype: int64 >>> df['B'].drop_duplicates() 0 NA 2 Maybe 320 No 5575 Yes Name: B, dtype: object
I’d like to have a df[‘C’] with the combination of all cross joins, thus we should have 6 * 4 = 24 unique values in it:
#Column C should have 6 * 4 classes: (1,Yes)=1 (1,No)=6 (1, Maybe)=12 (1, NA)=18 (2,Yes)=2 (2,No)=7 (2, maybe)=13 ... (3,Yes)=3 (3,No)=8 ... (4,Yes)=4 (4,No)=9 (8,Yes)=5 ... (0,Yes)=0
Thus we should have the following:
Newdf # A B C # 0 0 Yes 0 # 1 8 No 9 # 2 2 Yes 2 # 3 4 Maybe 15 # 4 8 NA 22
Using this method, I have the following error:
out = df.merge(df[['B']].drop_duplicates().merge(df['A'].drop_duplicates(),how='cross').assign(C=lambda x : x.index+1))
Throws:
"No common columns to perform merge on. " pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
Any help would be appreciated.
Advertisement
Answer
Why don’t you use the good old itertools:
from itertools import product cats = list(product(df['A'].unique(), df['B'].unique())) # merge with this pd.DataFrame(cats, columns=['A','B']).assign(C=range(len(cats)))
Output for the example data:
A B C 0 0 Yes 0 1 0 No 1 2 0 Maybe 2 3 0 NaN 3 4 8 Yes 4 5 8 No 5 6 8 Maybe 6 7 8 NaN 7 8 2 Yes 8 9 2 No 9 10 2 Maybe 10 11 2 NaN 11 12 4 Yes 12 13 4 No 13 14 4 Maybe 14 15 4 NaN 15 16 6 Yes 16 17 6 No 17 18 6 Maybe 18 19 6 NaN 19