Some of the data in the dataset are in string format and I should map all of them to the numeric form. I want to convert string data in some columns in the dataset to int int to become usable in the knn method. I wrote this code but It has this error. How can I fix it? thank you for your consideration.
here is the dataset: http://gitlab.rahnemacollege.com/rahnemacollege/tuning-registration-JusticeInWork/raw/master/dataset.csv
this error is in this part of code:
def string_to_int(s): ord3 = lambda x : '%.3d' % ord(x) return int(''.join(map(ord3, s))) for i in range(1,24857): df.iloc[i,0]=string_to_int(df.iloc[i,0]) df.iloc[i,1]=string_to_int(df.iloc[i,1]) df.iloc[i,3]=string_to_int(df.iloc[i,3]) df.iloc[i,8]=string_to_int(df.iloc[i,8]) df.iloc[i,9]=string_to_int(df.iloc[i,9]) df.iloc[i,10]=string_to_int(df.iloc[i,10]) df.iloc[i,11]=string_to_int(df.iloc[i,11]) df.iloc[i,12]=string_to_int(df.iloc[i,12])
the error is:
TypeError Traceback (most recent call last) <ipython-input-7-f5bce11c577a> in <module>() 30 df.iloc[i,10]=string_to_int(df.iloc[i,10]) 31 df.iloc[i,11]=string_to_int(df.iloc[i,11]) ---> 32 df.iloc[i,12]=string_to_int(df.iloc[i,12]) 33 34 <ipython-input-7-f5bce11c577a> in string_to_int(s) 20 def string_to_int(s): 21 ord3 = lambda x : '%.3d' % ord(x) ---> 22 return int(''.join(map(ord3, s))) 23 24 for i in range(1, 24857): TypeError: 'float' object is not iterable
the total code is here:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from google.colab import files !pip install sklearn from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report, confusion_matrix #-----------------read file------------------- uploaded = files.upload() with open('dataset.csv', 'r') as data: df3 = pd.read_csv(data , encoding = ('ansi')) lst = ['id', 'Prold', 'ProCreationId', 'CustCreatonRate', 'TaskCreationTimestamp', 'Price', 'ServiceId', 'CategoryId', 'ZoneId', 'TaskState', 'TargetProId', 'isFraud'] df = pd.DataFrame(df3) print (df) #----------------------preprocessing---------------- def string_to_int(s): ord3 = lambda x : '%.3d' % ord(x) return int(''.join(map(ord3, s))) for i in range(1,24857): df.iloc[i,0]=string_to_int(df.iloc[i,0]) df.iloc[i,1]=string_to_int(df.iloc[i,1]) df.iloc[i,3]=string_to_int(df.iloc[i,3]) df.iloc[i,8]=string_to_int(df.iloc[i,8]) df.iloc[i,9]=string_to_int(df.iloc[i,9]) df.iloc[i,10]=string_to_int(df.iloc[i,10]) df.iloc[i,11]=string_to_int(df.iloc[i,11]) df.iloc[i,12]=string_to_int(df.iloc[i,12])
Advertisement
Answer
The NaN
values come from empty strings in the original csv file. To leave those as empty strings instead, you could read the csv with df = pd.read_csv(url, keep_default_na=False)
, although having them as NaN
can make it easier to deal with them.
As noted in the comments however, I am skeptical of the correct interpretation of the encoding standard (if any) used in that data.
But if that is as described in the question, then you can use your function string_to_int
without change, apply it to all '...Id'
columns and skip the NaN
(and optionally convert those to another value):
id_cols = [k for k in df.columns if k.lower().endswith('id')] df2 = df.copy() df2[id_cols] = df2[id_cols].applymap(string_to_int, na_action='ignore') # optional: convert nan to some int value (here: 0) df2[id_cols] = df2[id_cols].fillna(0)
Outcome
>>> df2['TargetProId'].head() 0 1181130851071200850681170691090660551030720870... 1 8911811810612110611210908812010605205108207407... 2 0 3 0 4 0 Name: TargetProId, dtype: object
(Note: the dtype is still object
because the int values are overflowing int64
, and are instead Python’s arbitrary-precision int objects; df2.applymap(type).value_counts()
shows that all 'id'
columns are <class 'int'>
).
Original suggestion
Initially I had this other suggestion for string_to_int()
. It handles non-str values explicitly with a default value. It also uses struct.unpack()
as a basis for more performant decoding, although in this specific case, I doubt it makes much difference.
import struct def string_to_int2(s, default=0): if isinstance(s, str): n = len(s) b = s.encode('ascii') return int(''.join([f'{v:03d}' for v in struct.unpack(f'{n}B', b)])) return default df2 = df.copy() df2[id_cols] = df2[id_cols].applymap(string_to_int2)