How can I convert columns of string in dataset to int?

Question

Some of the data in the dataset are in string format and I should map all of them to the numeric form. I want to convert string data in some columns in the dataset to int int to become usable in the knn method. I wrote this code but It has this error. How can I fix it? thank you

Accepted Answer

The NaN values come from empty strings in the original csv file. To leave those as empty strings instead, you could read the csv with df = pd.read_csv(url, keep_default_na=False), although having them as NaN can make it easier to deal with them.As noted in the comments however, I am skeptical of the correct interpretation of the encoding standard (if any) used in that data.But if that is as described in the question, then you can use your function string_to_int without change, apply it to all '...Id' columns and skip the NaN (and optionally convert those to another value):id_cols = [k for k in df.columns if k.lower().endswith('id')]df2 = df.copy()df2[id_cols] = df2[id_cols].applymap(string_to_int, na_action='ignore')# optional: convert nan to some int value (here: 0)df2[id_cols] = df2[id_cols].fillna(0)Outcome>>> df2['TargetProId'].head()0    1181130851071200850681170691090660551030720870...1    8911811810612110611210908812010605205108207407...2                                                    03                                                    04                                                    0Name: TargetProId, dtype: object(Note: the dtype is still object because the int values are overflowing int64, and are instead Python&#8217;s arbitrary-precision int objects; df2.applymap(type).value_counts() shows that all 'id' columns are <class 'int'>).Original suggestionInitially I had this other suggestion for string_to_int(). It handles non-str values explicitly with a default value. It also uses struct.unpack() as a basis for more performant decoding, although in this specific case, I doubt it makes much difference.import structdef string_to_int2(s, default=0):    if isinstance(s, str):        n = len(s)        b = s.encode('ascii')        return int(''.join([f'{v:03d}' for v in struct.unpack(f'{n}B', b)]))    return defaultdf2 = df.copy()df2[id_cols] = df2[id_cols].applymap(string_to_int2)

Advertisement

Answer

Outcome

Original suggestion