I have an input pd dataframe with two columns, one is the sequence and the second is an ID (it is a number between 1-1000). I want to get all the possible combinations between the sequences that have the same ID.
Input:
sequence ID CASSSTGVLLYEQCF 1 CASSSTGVLLYEQYF 1 CAFNAGGTSHGKLTF 2 CAFNAGGTSYGKLTF 2 CAINAGGTSYGKLTF 2 CANSPSPVAGTDTQYF 3 CASSPSPVAGTDTQYF 3
desired output
CASSSTGVLLYEQCF CASSSTGVLLYEQYF CAFNAGGTSHGKLTF CAFNAGGTSYGKLTF CAFNAGGTSYGKLTF CAINAGGTSYGKLTF CAINAGGTSYGKLTF CAFNAGGTSHGKLTF CANSPSPVAGTDTQYF CASSPSPVAGTDTQYF
I have been reading into itertools but this only gives me all possible combinations without using the ID. Does anyone know how this can be done using python or has any tips for me where I can look?
Advertisement
Answer
Use custom lambda function with itertools.combinations per groups in GroupBy.apply:
from  itertools import combinations
df1 = df.groupby('ID')['sequence'].apply(lambda x: pd.DataFrame(combinations(x, 2), 
                                                               columns=['a','b']))
print (df1)
                     a                 b
ID                                      
1  0   CASSSTGVLLYEQCF   CASSSTGVLLYEQYF
2  0   CAFNAGGTSHGKLTF   CAFNAGGTSYGKLTF
   1   CAFNAGGTSHGKLTF   CAINAGGTSYGKLTF
   2   CAFNAGGTSYGKLTF   CAINAGGTSYGKLTF
3  0  CANSPSPVAGTDTQYF  CASSPSPVAGTDTQYF
df1 = df1.droplevel(1).reset_index()
print (df1)
   ID                 a                 b
0   1   CASSSTGVLLYEQCF   CASSSTGVLLYEQYF
1   2   CAFNAGGTSHGKLTF   CAFNAGGTSYGKLTF
2   2   CAFNAGGTSHGKLTF   CAINAGGTSYGKLTF
3   2   CAFNAGGTSYGKLTF   CAINAGGTSYGKLTF
4   3  CANSPSPVAGTDTQYF  CASSPSPVAGTDTQYF