OneHotEncoding Protein Sequences

Question

I have an original dataframe of sequences listed below and am trying to use one-hot encoding and then store these in a new dataframe, I am trying to do it with the following code but am not able to store because I get the following output afterwards: Code: but get error Answer You get that strange array because it treats

Accepted Answer

You get that strange array because it treats every sequence as an entry and tries to one-hot encode it, we can use an example:import pandas as pdfrom sklearn.preprocessing import OneHotEncoder df = pd.DataFrame({'sequence':['AQAVPW','AMAVLT','LDTGIN']})enc = OneHotEncoder()seq = np.array(df['sequence']).reshape(-1,1)encoded = enc.fit(seq)encoded.transform(seq).toarray()array([[0., 1., 0.],       [1., 0., 0.],       [0., 0., 1.]])encoded.categories_[array(['AMAVLT', 'AQAVPW', 'LDTGIN'], dtype=object)]Since your entries are unique, you get this all zeros matrix. You can understand this better if you use pd.get_dummiespd.get_dummies(df['sequence'])  AMAVLT AQAVPW LDTGIN0   0   1   01   1   0   02   0   0   1There&#8217;s two ways to do this, one way is to simply count the amino acid occurrence and use that as a predictor, I hope I get the amino acids correct (from school long time ago):from Bio import SeqIOfrom Bio.SeqUtils.ProtParam import ProteinAnalysispd.DataFrame([ProteinAnalysis(i).count_amino_acids() for i in df['sequence']])    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y0   2   0   0   0   0   0   0   0   0   0   0   0   1   1   0   0   0   1   1   01   2   0   0   0   0   0   0   0   0   1   1   0   0   0   0   0   1   1   0   02   0   0   1   0   0   1   0   1   0   1   0   1   0   0   0   0   1   0   0   0The other is to split the sequences, and do this encoding by position, and this requires the sequences to be equally long, and that you have enough memory:byposition = df['sequence'].apply(lambda x:pd.Series(list(x)))byposition    0   1   2   3   4   50   A   Q   A   V   P   W1   A   M   A   V   L   T2   L   D   T   G   I   Npd.get_dummies(byposition)    0_A 0_L 1_D 1_M 1_Q 2_A 2_T 3_G 3_V 4_I 4_L 4_P 5_N 5_T 5_W0   1   0   0   0   1   1   0   0   1   0   0   1   0   0   11   1   0   0   1   0   1   0   0   1   0   1   0   0   1   02   0   1   1   0   0   0   1   1   0   1   0   0   1   0   0

Advertisement

Answer