I’m in need of some advice on the following issue: I have a DataFrame that looks like this:
ID SEQ LEN BEG_GAP END_GAP 0 A1 AABBCCDDEEFFGG 14 2 4 1 A1 AABBCCDDEEFFGG 14 10 12 2 B1 YYUUUUAAAAMMNN 14 4 6 3 B1 YYUUUUAAAAMMNN 14 8 12 4 C1 LLKKHHUUTTYYYYYYYYAA 20 7 9 5 C1 LLKKHHUUTTYYYYYYYYAA 20 12 15 6 C1 LLKKHHUUTTYYYYYYYYAA 20 17 18
And what I need to get is the SEQ
that’s separated between the different BEG_GAP
and END_GAP
. I already have worked it out (thanks to a previous question) for sequences that have only one pair of gaps, but here they have multiple.
This is what the sequences should look like:
ID SEQ 0 A1 AA---CDDEE---GG 1 B1 YYUU---A-----NN 2 C1 LLKKHHU---YY----Y--A
Or in an exploded DF:
ID Seq_slice 0 A1 AA 1 A1 CDDEE 2 A1 GG 3 B1 YYUU 4 B1 A 5 B1 NN 6 C1 LLKKHHU 7 C1 YY 8 C1 Y 9 C1 A
At the moment, I’m using a piece of code (that I got thanks to a previous question) that works only if there’s one gap, and it looks like this:
import pandas as pd df = pd.read_csv("..path_to_the_csv.csv") df["BEG_GAP"] = df["BEG_GAP"].astype(int) df["END_GAP"]= df["END_GAP"].astype(int) df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1) output = df.explode('SEQ').query('SEQ!=""')
But this has the problem that it generates a bunch of sequences that don’t really exist because they actually have another gap in the middle. I.e what it would generate:
ID Seq_slice 0 A1 AA 1 A1 CDDEEFFG #<- this one shouldn't exist! Because there's another gap in 10-12 2 A1 AABBCCDDEE #<- Also, this one shouldn't exist, it's missing the previous gap. 3 A1 GG
And so on, with the other sequences. As you can see, there are some slices that are not being generated and some that are wrong, because I don’t know how to tell the code to have in mind all the gaps while analyzing the sequence.
All advice is appreciated, I hope I was clear!
Advertisement
Answer
Let’s try defining a function and apply
:
def truncate(data): seq = data.SEQ.iloc[0] ll = data.LEN.iloc[0] return [seq[x:y] for x,y in zip([0]+list(data.END_GAP), list(data.BEG_GAP)+[ll])] (df.groupby('ID').apply(truncate) .explode().reset_index(name='Seq_slice') )
Output:
ID Seq_slice 0 A1 AA 1 A1 CCDDEE 2 A1 GG 3 B1 YYUU 4 B1 AA 5 B1 NN 6 C1 LLKKHHU 7 C1 TYY 8 C1 YY 9 C1 AA