I have a following dataframe:
JavaScript
x
8
1
Bacteria Year Feature_Vector
2
XYRT23 1968 [0 1 0 0 1 1 0 0 0 0 1 1]
3
XXQY12 1968 [0 1 0 0 0 1 1 0 0 0 1 1]
4
RTy11R 1968 [1 0 0 0 0 1 1 0 1 1 1 1]
5
XYRT23 1969 [0 1 0 0 1 1 0 0 0 0 1 1]
6
XXQY12 1969 [0 0 1 0 0 1 1 0 0 0 1 1]
7
RTy11R 1969 [1 0 0 0 0 1 1 1 1 1 1 1]
8
I would like to calculate pairwise hamming distance for each pair in a given year and save it into a new dataframe. Example: (Note: I made up the numbers for the hamming distance, and I don’t actually need to Pair column)
JavaScript
1
8
1
Pair Year HammingDistance
2
XYRT23 - XXQY12 1968 0.24
3
XYRT23 - RTy11R 1968 0.33
4
XXQY12 - RTy11R 1968 0.29
5
XYRT23 - XXQY12 1969 0.22
6
XYRT23 - RTy11R 1969 0.34
7
XXQY12 - RTy11R 1969 0.28
8
I tried something like:
JavaScript
1
20
20
1
import itertools
2
from sklearn.metrics.pairwise import pairwise_distances
3
my_list = df.groupby('Year')['Feature_Vector'].apply(list)
4
5
total_list = []
6
for lists in my_list:
7
i = 0
8
results = []
9
for x in itertools.combinations(lists, 2):
10
vec1, vec2 = np.array(x[0]), np.array(x[1])
11
keepers = np.where(np.logical_not((np.vstack((vec1, vec2)) == 0).all(axis=0)))
12
vecx = vec1[keepers].reshape(1, -1)
13
vecy = vec2[keepers].reshape(1, -1)
14
try:
15
score = pairwise_distances(vecx, vecy, metric = "hamming")
16
print(score)
17
except:
18
score = 0
19
results.append(score)
20
Advertisement
Answer
The function pairwise_distances
can take in a matrix, so it might be easier to just provide the features in a year as a matrix, get back a pairwise matrix of distances and just subset on the comparisons we need. For example, a dataset like yours:
JavaScript
1
7
1
df = pd.DataFrame({'Bacteria':['XYRT23','XXQY12','RTy11R']*2,
2
'Year':np.repeat(['1968','1969'],3),
3
'Feature_Vector':list(np.random.binomial(1,0.5,(6,12)))})
4
5
type(df['Feature_Vector'][0])
6
numpy.ndarray
7
Define the pairwise function that takes in the feature column and also row names :
JavaScript
1
9
1
def pwdist(features , names):
2
dm = pairwise_distances(features.to_list(),metric="hamming")
3
m,n = dm.shape
4
dm[:] = np.where(np.arange(m)[:,None] >= np.arange(n),np.nan,dm)
5
dm = pd.DataFrame(dm,index = names,columns = names)
6
out = dm.stack().reset_index()
7
out.columns = ['Bacteria1','Bacteria2','distance']
8
return out
9
Use groupby and apply the function:
JavaScript
1
2
1
df.groupby('Year').apply(lambda x: pwdist(x.Feature_Vector,x.Bacteria.values))
2
Gives us something like this:
JavaScript
1
10
10
1
Bacteria1 Bacteria2 distance
2
Year
3
1968 0 XYRT23 XXQY12 0.333333
4
1 XYRT23 RTy11R 0.250000
5
2 XXQY12 RTy11R 0.416667
6
1969 0 XYRT23 XXQY12 0.500000
7
1 XYRT23 RTy11R 0.333333
8
2 XXQY12 RTy11R 0.166667
9
10