Fill dataframe values per column, by row index, if position is present in range

I have a list of start and stop coordinates of ranges and would like to fill a pandas df according to their being present in a range.

The numbers of rows are predetermined and filled with ‘0’. If for example a range is 1,3 for a column then rows (index) 1-3 would be filled with ‘1’.

d={
    'a': [[0,2], [3,7], [13,23], [24,25]],
    'b': [[1,5], [8,12], [15,18], [20,24]],
}
presabsdict = {}

for G in d.keys():
    refpositions = list('0'*50)
    positions = d.get(G)
    for pos in positions:
        pos2 = pos[1]
        pos1 = pos[0]
        poslength = (pos2-pos1)
        refpositions[pos1:(pos2+1)] = (list('1'*(poslength+1)))
    presabsdict[G] = refpositions

df = pd.DataFrame.from_dict(presabsdict,orient='index').transpose()
df["Sitespresent"] = df.astype(int).sum(axis=1).astype(int)
print(df)

JavaScript
​x
 
d={
    'a': [[0,2], [3,7], [13,23], [24,25]],
    'b': [[1,5], [8,12], [15,18], [20,24]],
}
presabsdict = {}
​
for G in d.keys():
    refpositions = list('0'*50)
    positions = d.get(G)
    for pos in positions:
        pos2 = pos[1]
        pos1 = pos[0]
        poslength = (pos2-pos1)
        refpositions[pos1:(pos2+1)] = (list('1'*(poslength+1)))
    presabsdict[G] = refpositions
​
df = pd.DataFrame.from_dict(presabsdict,orient='index').transpose()
df["Sitespresent"] = df.astype(int).sum(axis=1).astype(int)
print(df)
​

This is hugely inefficient for large datasets. The ultimate goal is the 'Sitespresent' column so a solution that foregoes the dataframe would also be suitable

Answer

You can do something like this:

import pandas as pd

refpositions = pd.DataFrame({'pos':range(50)})
intervals = pd.arrays.IntervalArray([pd.Interval(start,end) for _, v in d.items() for start, end in v], closed='both')
pos_as_intv = [pd.Interval(i,i, closed='both') for i in refpositions.pos]

# Walk through overlaps and count
refpositions['total'] = [intervals.overlaps(x).sum() for x in pos_as_intv]

JavaScript
 
import pandas as pd
​
refpositions = pd.DataFrame({'pos':range(50)})
intervals = pd.arrays.IntervalArray([pd.Interval(start,end) for _, v in d.items() for start, end in v], closed='both')
pos_as_intv = [pd.Interval(i,i, closed='both') for i in refpositions.pos]
​
# Walk through overlaps and count
refpositions['total'] = [intervals.overlaps(x).sum() for x in pos_as_intv]
​

Advertisement

Answer