How to slice/chop a string using multiple indexes in a panda DataFrame

I’m in need of some advice on the following issue: I have a DataFrame that looks like this:

   ID                   SEQ LEN BEG_GAP END_GAP  
0  A1        AABBCCDDEEFFGG  14       2       4  
1  A1        AABBCCDDEEFFGG  14      10      12
2  B1        YYUUUUAAAAMMNN  14       4       6
3  B1        YYUUUUAAAAMMNN  14       8      12
4  C1  LLKKHHUUTTYYYYYYYYAA  20       7       9
5  C1  LLKKHHUUTTYYYYYYYYAA  20      12      15
6  C1  LLKKHHUUTTYYYYYYYYAA  20      17      18

JavaScript
​x
 
   ID                   SEQ LEN BEG_GAP END_GAP  
0  A1        AABBCCDDEEFFGG  14       2       4  
1  A1        AABBCCDDEEFFGG  14      10      12
2  B1        YYUUUUAAAAMMNN  14       4       6
3  B1        YYUUUUAAAAMMNN  14       8      12
4  C1  LLKKHHUUTTYYYYYYYYAA  20       7       9
5  C1  LLKKHHUUTTYYYYYYYYAA  20      12      15
6  C1  LLKKHHUUTTYYYYYYYYAA  20      17      18
​

And what I need to get is the SEQ that’s separated between the different BEG_GAP and END_GAP. I already have worked it out (thanks to a previous question) for sequences that have only one pair of gaps, but here they have multiple.

This is what the sequences should look like:

  ID                   SEQ 
0 A1       AA---CDDEE---GG  
1 B1       YYUU---A-----NN  
2 C1  LLKKHHU---YY----Y--A

JavaScript
 
  ID                   SEQ 
0 A1       AA---CDDEE---GG  
1 B1       YYUU---A-----NN  
2 C1  LLKKHHU---YY----Y--A  
​

Or in an exploded DF:

  ID Seq_slice
0 A1        AA
1 A1     CDDEE  
2 A1        GG
3 B1      YYUU
4 B1         A   
5 B1        NN
6 C1   LLKKHHU
7 C1        YY
8 C1         Y
9 C1         A

JavaScript
 
  ID Seq_slice
A1        AA
A1     CDDEE  
A1        GG
B1      YYUU
B1         A   
B1        NN
C1   LLKKHHU
C1        YY
C1         Y
C1         A
​
​

At the moment, I’m using a piece of code (that I got thanks to a previous question) that works only if there’s one gap, and it looks like this:

import pandas as pd

df = pd.read_csv("..path_to_the_csv.csv")


df["BEG_GAP"] = df["BEG_GAP"].astype(int)
df["END_GAP"]= df["END_GAP"].astype(int)

df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)

output = df.explode('SEQ').query('SEQ!=""')

JavaScript
 
import pandas as pd
​
df = pd.read_csv("..path_to_the_csv.csv")
​
​
df["BEG_GAP"] = df["BEG_GAP"].astype(int)
df["END_GAP"]= df["END_GAP"].astype(int)
​
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
​
output = df.explode('SEQ').query('SEQ!=""')
​
​

But this has the problem that it generates a bunch of sequences that don’t really exist because they actually have another gap in the middle. I.e what it would generate:

  ID   Seq_slice
0 A1          AA
1 A1    CDDEEFFG #<- this one shouldn't exist! Because there's another gap in 10-12
2 A1  AABBCCDDEE #<- Also, this one shouldn't exist, it's missing the previous gap.
3 A1          GG

JavaScript
 
  ID   Seq_slice
0 A1          AA
1 A1    CDDEEFFG #<- this one shouldn't exist! Because there's another gap in 10-12
2 A1  AABBCCDDEE #<- Also, this one shouldn't exist, it's missing the previous gap.
3 A1          GG
​

And so on, with the other sequences. As you can see, there are some slices that are not being generated and some that are wrong, because I don’t know how to tell the code to have in mind all the gaps while analyzing the sequence.

All advice is appreciated, I hope I was clear!

Answer

Let’s try defining a function and apply:

def truncate(data):
    seq = data.SEQ.iloc[0]
    ll = data.LEN.iloc[0]
    return [seq[x:y] for x,y in zip([0]+list(data.END_GAP),
                                    list(data.BEG_GAP)+[ll])]

(df.groupby('ID').apply(truncate)
   .explode().reset_index(name='Seq_slice')
)

JavaScript
 
def truncate(data):
    seq = data.SEQ.iloc[0]
    ll = data.LEN.iloc[0]
    return [seq[x:y] for x,y in zip([0]+list(data.END_GAP),
                                    list(data.BEG_GAP)+[ll])]
​
(df.groupby('ID').apply(truncate)
   .explode().reset_index(name='Seq_slice')
)
​

Output:

   ID Seq_slice
0  A1        AA
1  A1    CCDDEE
2  A1        GG
3  B1      YYUU
4  B1        AA
5  B1        NN
6  C1   LLKKHHU
7  C1       TYY
8  C1        YY
9  C1        AA

JavaScript
 
   ID Seq_slice
A1        AA
A1    CCDDEE
A1        GG
B1      YYUU
B1        AA
B1        NN
C1   LLKKHHU
C1       TYY
C1        YY
C1        AA
​

Advertisement

Answer