I have a following dataframe:
Bacteria Year Feature_Vector XYRT23 1968 [0 1 0 0 1 1 0 0 0 0 1 1] XXQY12 1968 [0 1 0 0 0 1 1 0 0 0 1 1] RTy11R 1968 [1 0 0 0 0 1 1 0 1 1 1 1] XYRT23 1969 [0 1 0 0 1 1 0 0 0 0 1 1] XXQY12 1969 [0 0 1 0 0 1 1 0 0 0 1 1] RTy11R 1969 [1 0 0 0 0 1 1 1 1 1 1 1]
I would like to calculate pairwise hamming distance for each pair in a given year and save it into a new dataframe. Example: (Note: I made up the numbers for the hamming distance, and I don’t actually need to Pair column)
Pair Year HammingDistance XYRT23 - XXQY12 1968 0.24 XYRT23 - RTy11R 1968 0.33 XXQY12 - RTy11R 1968 0.29 XYRT23 - XXQY12 1969 0.22 XYRT23 - RTy11R 1969 0.34 XXQY12 - RTy11R 1969 0.28
I tried something like:
import itertools from sklearn.metrics.pairwise import pairwise_distances my_list = df.groupby('Year')['Feature_Vector'].apply(list) total_list = [] for lists in my_list: i = 0 results = [] for x in itertools.combinations(lists, 2): vec1, vec2 = np.array(x[0]), np.array(x[1]) keepers = np.where(np.logical_not((np.vstack((vec1, vec2)) == 0).all(axis=0))) vecx = vec1[keepers].reshape(1, -1) vecy = vec2[keepers].reshape(1, -1) try: score = pairwise_distances(vecx, vecy, metric = "hamming") print(score) except: score = 0 results.append(score)
Advertisement
Answer
The function pairwise_distances
can take in a matrix, so it might be easier to just provide the features in a year as a matrix, get back a pairwise matrix of distances and just subset on the comparisons we need. For example, a dataset like yours:
df = pd.DataFrame({'Bacteria':['XYRT23','XXQY12','RTy11R']*2, 'Year':np.repeat(['1968','1969'],3), 'Feature_Vector':list(np.random.binomial(1,0.5,(6,12)))}) type(df['Feature_Vector'][0]) numpy.ndarray
Define the pairwise function that takes in the feature column and also row names :
def pwdist(features , names): dm = pairwise_distances(features.to_list(),metric="hamming") m,n = dm.shape dm[:] = np.where(np.arange(m)[:,None] >= np.arange(n),np.nan,dm) dm = pd.DataFrame(dm,index = names,columns = names) out = dm.stack().reset_index() out.columns = ['Bacteria1','Bacteria2','distance'] return out
Use groupby and apply the function:
df.groupby('Year').apply(lambda x: pwdist(x.Feature_Vector,x.Bacteria.values))
Gives us something like this:
Bacteria1 Bacteria2 distance Year 1968 0 XYRT23 XXQY12 0.333333 1 XYRT23 RTy11R 0.250000 2 XXQY12 RTy11R 0.416667 1969 0 XYRT23 XXQY12 0.500000 1 XYRT23 RTy11R 0.333333 2 XXQY12 RTy11R 0.166667