I have an input pd dataframe with two columns, one is the sequence and the second is an ID (it is a number between 1-1000). I want to get all the possible combinations between the sequences that have the same ID.
Input:
sequence ID CASSSTGVLLYEQCF 1 CASSSTGVLLYEQYF 1 CAFNAGGTSHGKLTF 2 CAFNAGGTSYGKLTF 2 CAINAGGTSYGKLTF 2 CANSPSPVAGTDTQYF 3 CASSPSPVAGTDTQYF 3
desired output
CASSSTGVLLYEQCF CASSSTGVLLYEQYF CAFNAGGTSHGKLTF CAFNAGGTSYGKLTF CAFNAGGTSYGKLTF CAINAGGTSYGKLTF CAINAGGTSYGKLTF CAFNAGGTSHGKLTF CANSPSPVAGTDTQYF CASSPSPVAGTDTQYF
I have been reading into itertools but this only gives me all possible combinations without using the ID. Does anyone know how this can be done using python or has any tips for me where I can look?
Advertisement
Answer
Use custom lambda function with itertools.combinations
per groups in GroupBy.apply
:
from itertools import combinations df1 = df.groupby('ID')['sequence'].apply(lambda x: pd.DataFrame(combinations(x, 2), columns=['a','b'])) print (df1) a b ID 1 0 CASSSTGVLLYEQCF CASSSTGVLLYEQYF 2 0 CAFNAGGTSHGKLTF CAFNAGGTSYGKLTF 1 CAFNAGGTSHGKLTF CAINAGGTSYGKLTF 2 CAFNAGGTSYGKLTF CAINAGGTSYGKLTF 3 0 CANSPSPVAGTDTQYF CASSPSPVAGTDTQYF df1 = df1.droplevel(1).reset_index() print (df1) ID a b 0 1 CASSSTGVLLYEQCF CASSSTGVLLYEQYF 1 2 CAFNAGGTSHGKLTF CAFNAGGTSYGKLTF 2 2 CAFNAGGTSHGKLTF CAINAGGTSYGKLTF 3 2 CAFNAGGTSYGKLTF CAINAGGTSYGKLTF 4 3 CANSPSPVAGTDTQYF CASSPSPVAGTDTQYF