I have a problem with pandas and replacing values. I have a table with animals and their alleles looks that:
Name User A1_Top A2_Top stefan1 721 A C stefan2 721 A G stefan3 331 T T stefan4 331 C G stefan5 331 A A stefan6 721 G G
And I need to change values of Top1 and Top2 by a specific Key for each row.
For example: if values in same rows will be = C & A, I will replace it to A & B, if row == TT will be BB
etc. (key is in if/else below).
I got an answer in another post how to do it by dictionary but I can’t handle with that double condition (if it will be one condition for ex. if A in first row, replace to B, it will be ok). So I just put it into if/else loop and it works… I mean worked, it worked until files were large. Now it soo slow. 300mb file can be processing 30 min on a standard desktop.
That’s how my code look now:
def ATCG_to_AB(df): x = 0 for i in range(lenFor): if df['A1_TOP'].iloc[i] == 'A' and df['A2_TOP'].iloc[i] =='C': df['A1_TOP'].iloc[i] = 'A' df['A2_TOP'].iloc[i] ='B' elif df['A1_TOP'].iloc[i] == 'A' and df['A2_TOP'].iloc[i] =='G': df['A1_TOP'].iloc[i] = 'A' df['A2_TOP'].iloc[i] ='B' elif df['A1_TOP'].iloc[i] == 'A' and df['A2_TOP'].iloc[i] =='T': df['A1_TOP'].iloc[i] = 'A' df['A2_TOP'].iloc[i] ='B' elif df['A1_TOP'].iloc[i] == 'C' and df['A2_TOP'].iloc[i] =='G': df['A1_TOP'].iloc[i] = 'A' df['A2_TOP'].iloc[i] ='B' elif df['A1_TOP'].iloc[i] == 'A' and df['A2_TOP'].iloc[i] =='A': df['A1_TOP'].iloc[i] = 'A' df['A2_TOP'].iloc[i] ='A' elif df['A1_TOP'].iloc[i] == 'C' and df['A2_TOP'].iloc[i] =='C': df['A1_TOP'].iloc[i] = 'B' df['A2_TOP'].iloc[i] ='B' elif df['A1_TOP'].iloc[i] == 'G' and df['A2_TOP'].iloc[i] =='G': df['A1_TOP'].iloc[i] = 'B' df['A2_TOP'].iloc[i] ='B' elif df['A1_TOP'].iloc[i] == 'T' and df['A2_TOP'].iloc[i] =='T': df['A1_TOP'].iloc[i] = 'B' df['A2_TOP'].iloc[i] ='B' else: print(x,". Something is wrong in line: ", i) x+=1
X – counting errors. And I know, that code is ugly, so I tried to put the dictionary method. My attempt:
L = [('A', 'C', 'A', 'B'),('A', 'G', 'A', 'B'),('A', 'T', 'A', 'B'), ('C', 'G', 'A', 'B'),('A', 'A', 'A', 'A'),('C', 'C', 'B', 'B'), ('G', 'G', 'B', 'B'),('T', 'T', 'B', 'B')] for x in L: a.loc[(df[2] == x[0]) & (df[3] == x[1]), [2,3]] = [x[2], x[3]]
But I got a bad output. The only A1_top is changed, and usually it a bad symbol. Can someone help me translate my ugly code to the dictionary and explain it? And do I think properly that will be a faster solution?
For sure, expecting output (there are no headers in output, below for clarity)
name User A1_Top A2_Top stefan1 721 A B stefan2 721 A B stefan3 331 B B stefan4 331 A B stefan5 331 A A stefan6 721 B B
Advertisement
Answer
A simple trick, not to say the best, but it works:
[Create a dummy column to do mapping or df[col].apply
]
df['combined'] = df['A1_Top']+"|"+df['A2_Top'] A1_Top A2_Top combined 0 A C A|C 1 A G A|G 2 T T T|T 3 C G C|G 4 A A A|A 5 G G G|G
Create dictionary, mapping all your requirements: I give 1 here
map_dict = {} map_dict['A|C'] = 'B|C' . . . df['new_values'] = df['combined'].apply(lambda x:map_dict[x] if x in map_dict.keys() else x) A1_Top A2_Top combined new_values 0 A C A|C B|C 1 A G A|G A|G 2 T T T|T T|T 3 C G C|G C|G 4 A A A|A A|A 5 G G G|G G|G df['new_a1_top'] = df['new_values'].apply(lambda x: x.split('|')[0]) df['new_a2_top'] = df['new_values'].apply(lambda x: x.split('|')[1]) A1_Top A2_Top combined new_values new_a1_top new_a2_top 0 A C A|C B|C B C 1 A G A|G A|G A G 2 T T T|T T|T T T 3 C G C|G C|G C G 4 A A A|A A|A A A 5 G G G|G G|G G G