Skip to content
Advertisement

How can I convert columns of string in dataset to int?

Some of the data in the dataset are in string format and I should map all of them to the numeric form. I want to convert string data in some columns in the dataset to int int to become usable in the knn method. I wrote this code but It has this error. How can I fix it? thank you for your consideration.

here is the dataset: http://gitlab.rahnemacollege.com/rahnemacollege/tuning-registration-JusticeInWork/raw/master/dataset.csv

this error is in this part of code:

     def string_to_int(s):
       ord3 = lambda x : '%.3d' % ord(x)
       return int(''.join(map(ord3, s)))
     for i in range(1,24857):
       df.iloc[i,0]=string_to_int(df.iloc[i,0])
       df.iloc[i,1]=string_to_int(df.iloc[i,1])
       df.iloc[i,3]=string_to_int(df.iloc[i,3])
       df.iloc[i,8]=string_to_int(df.iloc[i,8]) 
       df.iloc[i,9]=string_to_int(df.iloc[i,9])
       df.iloc[i,10]=string_to_int(df.iloc[i,10]) 
       df.iloc[i,11]=string_to_int(df.iloc[i,11])
       df.iloc[i,12]=string_to_int(df.iloc[i,12])

the error is:

TypeError                                 Traceback (most recent call last)
<ipython-input-7-f5bce11c577a> in <module>()
     30    df.iloc[i,10]=string_to_int(df.iloc[i,10])
     31    df.iloc[i,11]=string_to_int(df.iloc[i,11])
---> 32    df.iloc[i,12]=string_to_int(df.iloc[i,12])
 33 
 34 

<ipython-input-7-f5bce11c577a> in string_to_int(s)
 20 def string_to_int(s):
 21    ord3 = lambda x : '%.3d' % ord(x)
 ---> 22    return int(''.join(map(ord3, s)))
    23 
    24 for i in range(1, 24857):

TypeError: 'float' object is not iterable

the total code is here:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import files
!pip install sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
#-----------------read file-------------------
uploaded = files.upload()
with open('dataset.csv', 'r') as data:
   df3 = pd.read_csv(data , encoding = ('ansi'))
   lst = ['id', 'Prold', 'ProCreationId', 'CustCreatonRate', 'TaskCreationTimestamp',     'Price', 'ServiceId', 'CategoryId', 'ZoneId', 'TaskState', 'TargetProId', 'isFraud']
   df = pd.DataFrame(df3)
   print (df)

#----------------------preprocessing----------------

def string_to_int(s):
   ord3 = lambda x : '%.3d' % ord(x)
   return int(''.join(map(ord3, s)))

for i in range(1,24857):
   df.iloc[i,0]=string_to_int(df.iloc[i,0])
   df.iloc[i,1]=string_to_int(df.iloc[i,1])
   df.iloc[i,3]=string_to_int(df.iloc[i,3])
   df.iloc[i,8]=string_to_int(df.iloc[i,8]) 
   df.iloc[i,9]=string_to_int(df.iloc[i,9])
   df.iloc[i,10]=string_to_int(df.iloc[i,10]) 
   df.iloc[i,11]=string_to_int(df.iloc[i,11])
   df.iloc[i,12]=string_to_int(df.iloc[i,12])

Advertisement

Answer

The NaN values come from empty strings in the original csv file. To leave those as empty strings instead, you could read the csv with df = pd.read_csv(url, keep_default_na=False), although having them as NaN can make it easier to deal with them.

As noted in the comments however, I am skeptical of the correct interpretation of the encoding standard (if any) used in that data.

But if that is as described in the question, then you can use your function string_to_int without change, apply it to all '...Id' columns and skip the NaN (and optionally convert those to another value):

id_cols = [k for k in df.columns if k.lower().endswith('id')]

df2 = df.copy()
df2[id_cols] = df2[id_cols].applymap(string_to_int, na_action='ignore')

# optional: convert nan to some int value (here: 0)
df2[id_cols] = df2[id_cols].fillna(0)

Outcome

>>> df2['TargetProId'].head()
0    1181130851071200850681170691090660551030720870...
1    8911811810612110611210908812010605205108207407...
2                                                    0
3                                                    0
4                                                    0
Name: TargetProId, dtype: object

(Note: the dtype is still object because the int values are overflowing int64, and are instead Python’s arbitrary-precision int objects; df2.applymap(type).value_counts() shows that all 'id' columns are <class 'int'>).

Original suggestion

Initially I had this other suggestion for string_to_int(). It handles non-str values explicitly with a default value. It also uses struct.unpack() as a basis for more performant decoding, although in this specific case, I doubt it makes much difference.

import struct

def string_to_int2(s, default=0):
    if isinstance(s, str):
        n = len(s)
        b = s.encode('ascii')
        return int(''.join([f'{v:03d}' for v in struct.unpack(f'{n}B', b)]))
    return default

df2 = df.copy()
df2[id_cols] = df2[id_cols].applymap(string_to_int2)
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement