I have a list of start and stop coordinates of ranges and would like to fill a pandas df according to their being present in a range.
The numbers of rows are predetermined and filled with ‘0’. If for example a range is 1,3 for a column then rows (index) 1-3 would be filled with ‘1’.
d={ 'a': [[0,2], [3,7], [13,23], [24,25]], 'b': [[1,5], [8,12], [15,18], [20,24]], } presabsdict = {} for G in d.keys(): refpositions = list('0'*50) positions = d.get(G) for pos in positions: pos2 = pos[1] pos1 = pos[0] poslength = (pos2-pos1) refpositions[pos1:(pos2+1)] = (list('1'*(poslength+1))) presabsdict[G] = refpositions df = pd.DataFrame.from_dict(presabsdict,orient='index').transpose() df["Sitespresent"] = df.astype(int).sum(axis=1).astype(int) print(df)
This is hugely inefficient for large datasets. The ultimate goal is the 'Sitespresent'
column so a solution that foregoes the dataframe would also be suitable
Advertisement
Answer
You can do something like this:
import pandas as pd refpositions = pd.DataFrame({'pos':range(50)}) intervals = pd.arrays.IntervalArray([pd.Interval(start,end) for _, v in d.items() for start, end in v], closed='both') pos_as_intv = [pd.Interval(i,i, closed='both') for i in refpositions.pos] # Walk through overlaps and count refpositions['total'] = [intervals.overlaps(x).sum() for x in pos_as_intv]