I have a list of start and stop coordinates of ranges and would like to fill a pandas df according to their being present in a range.
The numbers of rows are predetermined and filled with ‘0’. If for example a range is 1,3 for a column then rows (index) 1-3 would be filled with ‘1’.
d={
'a': [[0,2], [3,7], [13,23], [24,25]],
'b': [[1,5], [8,12], [15,18], [20,24]],
}
presabsdict = {}
for G in d.keys():
refpositions = list('0'*50)
positions = d.get(G)
for pos in positions:
pos2 = pos[1]
pos1 = pos[0]
poslength = (pos2-pos1)
refpositions[pos1:(pos2+1)] = (list('1'*(poslength+1)))
presabsdict[G] = refpositions
df = pd.DataFrame.from_dict(presabsdict,orient='index').transpose()
df["Sitespresent"] = df.astype(int).sum(axis=1).astype(int)
print(df)
This is hugely inefficient for large datasets. The ultimate goal is the 'Sitespresent' column so a solution that foregoes the dataframe would also be suitable
Advertisement
Answer
You can do something like this:
import pandas as pd
refpositions = pd.DataFrame({'pos':range(50)})
intervals = pd.arrays.IntervalArray([pd.Interval(start,end) for _, v in d.items() for start, end in v], closed='both')
pos_as_intv = [pd.Interval(i,i, closed='both') for i in refpositions.pos]
# Walk through overlaps and count
refpositions['total'] = [intervals.overlaps(x).sum() for x in pos_as_intv]