Calculating hamming distance in a given year

Question

I have a following dataframe: I would like to calculate pairwise hamming distance for each pair in a given year and save it into a new dataframe. Example: (Note: I made up the numbers for the hamming distance, and I don't actually need to Pair column) I tried something like: Answer The function pairwise_distances can take in a matrix, so

Accepted Answer

The function pairwise_distances can take in a matrix, so it might be easier to just provide the features in a year as a matrix, get back a pairwise matrix of distances and just subset on the comparisons we need. For example, a dataset like yours:df = pd.DataFrame({'Bacteria':['XYRT23','XXQY12','RTy11R']*2,'Year':np.repeat(['1968','1969'],3),'Feature_Vector':list(np.random.binomial(1,0.5,(6,12)))})type(df['Feature_Vector'][0])numpy.ndarrayDefine the pairwise function that takes in the feature column and also row names :def pwdist(features , names):    dm = pairwise_distances(features.to_list(),metric="hamming")    m,n = dm.shape    dm[:] = np.where(np.arange(m)[:,None] >= np.arange(n),np.nan,dm)    dm = pd.DataFrame(dm,index = names,columns = names)    out = dm.stack().reset_index()    out.columns = ['Bacteria1','Bacteria2','distance']    return outUse groupby and apply the function:df.groupby('Year').apply(lambda x: pwdist(x.Feature_Vector,x.Bacteria.values))Gives us something like this:       Bacteria1 Bacteria2  distanceYear                                1968 0    XYRT23    XXQY12  0.333333     1    XYRT23    RTy11R  0.250000     2    XXQY12    RTy11R  0.4166671969 0    XYRT23    XXQY12  0.500000     1    XYRT23    RTy11R  0.333333     2    XXQY12    RTy11R  0.166667

Advertisement

Answer