Skip to content
Advertisement

Fuzzy matching issue with matching nan values

I have a dataframe called RawDatabase which I am am snapping values to a validation list which is called ValidationLists. I take a specific column from the RawDatabase and compare the elements to the validation list. The entry will be snapped to the entry in the validation list it most closely resembles.

The code looks like this:

def GetStandardisedField(rawDatabase,validationLists,field):
print('Standardising ', field,' ...')

my_list = validationLists[field]

l1=[]
    
for x in rawDatabase[field]:
    
    choice = process.extractOne(x, my_list)[0]
    l1.append(choice)
 
rawDatabase['choice']=l1
rawDatabase[field] = rawDatabase['choice']
del rawDatabase['choice']

return rawDatabase 

In an example the rawDatabase[field] looks like:

0       yes
1    YES123
2     nO023
3         n
4       NaN

and the validationList looks like:

YES
NO

I am trying to snap all the values so that the new rawDatabase[field] looks like:

0       YES
1       YES
2        NO
3        NO
4       

I however seem to have a problem when I try to snap an NaN value to the validationList (even when I include NaN in the validationList as a test).

What is the best way to handle NaN values (so the NaN value in the snapped dataset is blank)?

Advertisement

Answer

from fuzzywuzzy import process
l=['YES',"NO"]
a=[]
for x in df.Col1:
    try:
        a.append([process.extract(x, l, limit=1)][0][0][0])
    except:
        a.append(np.nan)

df['target']=a
df
Out[1261]: 
     Col1 target
0     yes    YES
1  YES123    YES
2   nO023     NO
3       n     NO
4     NaN    NaN
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement