Skip to content
Advertisement

OneHotEncoding Protein Sequences

I have an original dataframe of sequences listed below and am trying to use one-hot encoding and then store these in a new dataframe, I am trying to do it with the following code but am not able to store because I get the following output afterwards:

Code:

JavaScript

enter image description here

but get error

JavaScript

Advertisement

Answer

You get that strange array because it treats every sequence as an entry and tries to one-hot encode it, we can use an example:

JavaScript

Since your entries are unique, you get this all zeros matrix. You can understand this better if you use pd.get_dummies

JavaScript

There’s two ways to do this, one way is to simply count the amino acid occurrence and use that as a predictor, I hope I get the amino acids correct (from school long time ago):

JavaScript

The other is to split the sequences, and do this encoding by position, and this requires the sequences to be equally long, and that you have enough memory:

JavaScript
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement