I've been trying to implement this ML Linear Model into my dataset. (https://www.tensorflow.org/tutorials/estimator/linear) Language: Python 3.8.3 Lİbraries: TensorFlow 2.4.0 Numpy: 1.19.3 Pandas Matplotliband the others: ss1517 is the name of my dataset. It is a CSV file with 4116 rows and 20 columns and has lots of NaN values( There is no column that hasn't NaN value) CATEGORICAL_COLUMNS are the

TypeError: Expected binary or unicode string, got 618.0

I’ve been trying to implement this ML Linear Model into my dataset. (https://www.tensorflow.org/tutorials/estimator/linear)
Language: Python 3.8.3
Lİbraries: TensorFlow 2.4.0
Numpy: 1.19.3
Pandas
Matplotlib
and the others:

import os
import sys

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
from six.moves import urllib

JavaScript
​x
 
import os
import sys
​
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
from six.moves import urllib
​

import tensorflow.compat.v2.feature_column as fc
import tensorflow as tf

JavaScript
 
import tensorflow.compat.v2.feature_column as fc
import tensorflow as tf
​

ss1517 is the name of my dataset. It is a CSV file with 4116 rows and 20 columns and has lots of NaN values( There is no column that hasn’t NaN value)

traindata = ss1517.iloc[0:2470,:] # 60 % of my dataset is splitted by training set
evaldata = ss1517.iloc[2470:4116, :] # 40 % of my dataset is splitted by eval set
ytrain = traindata.pop("AvgOfMajor N")
yeval = evaldata.pop("AvgOfMajor N")

JavaScript
 
traindata = ss1517.iloc[0:2470,:] # 60 % of my dataset is splitted by training set
evaldata = ss1517.iloc[2470:4116, :] # 40 % of my dataset is splitted by eval set
ytrain = traindata.pop("AvgOfMajor N")
yeval = evaldata.pop("AvgOfMajor N")
​

CATEGORICAL_COLUMNS are the categorical columns in my dataset.
NUMERIC_COLUMNS are the numeric columns in my dataset.

CATEGORICAL_COLUMNS = ['Location Name', 'Location Code', 'Borough', 'Register', 'Building Name', 'Schools in Building', 'ENGroupA', 'RangeA']
NUMERIC_COLUMNS = ['Geographical District Code', '# Schools', 'Major N', 'Oth N', 'NoCrim N', 'Prop N', 'Vio N', 'AvgOfOth N', 'AvgOfNoCrim N', 'AvgOfProp N', 'AvgOfVio N']

feature_columns = []#Sadece linear regression'u eğitmek için kullanıyoruz
for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = traindata[feature_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))
for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

JavaScript
 
CATEGORICAL_COLUMNS = ['Location Name', 'Location Code', 'Borough', 'Register', 'Building Name', 'Schools in Building', 'ENGroupA', 'RangeA']
NUMERIC_COLUMNS = ['Geographical District Code', '# Schools', 'Major N', 'Oth N', 'NoCrim N', 'Prop N', 'Vio N', 'AvgOfOth N', 'AvgOfNoCrim N', 'AvgOfProp N', 'AvgOfVio N']
​
feature_columns = []#Sadece linear regression'u eğitmek için kullanıyoruz
for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = traindata[feature_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))
for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))
​

def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
  def input_function():# inner function, this will be returned.
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df)) # Create tf.data.Dataset object with data and its label
    if shuffle:
      ds = ds.shuffle(1000) # randomize order of data
    ds = ds.batch(batch_size).repeat(num_epochs)
    return ds # return a batch of dataset
  return input_function # return the input_function

train_input_fn = make_input_fn(traindata, ytrain) 
eval_input_fn = make_input_fn(evaldata, yeval, num_epochs=1, shuffle=False)

JavaScript
 
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
  def input_function():# inner function, this will be returned.
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df)) # Create tf.data.Dataset object with data and its label
    if shuffle:
      ds = ds.shuffle(1000) # randomize order of data
    ds = ds.batch(batch_size).repeat(num_epochs)
    return ds # return a batch of dataset
  return input_function # return the input_function
​
train_input_fn = make_input_fn(traindata, ytrain) 
eval_input_fn = make_input_fn(evaldata, yeval, num_epochs=1, shuffle=False) 
​

linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)
linear_est.train(train_input_fn) #train
result = linear_est.evaluate(eval_input_fn) #get model metrics/stats by testing on testing data

clear_output() #clears console output
print(result["accuracy"]) #the result variable is simply dict of stats about our model

JavaScript
 
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)
linear_est.train(train_input_fn) #train
result = linear_est.evaluate(eval_input_fn) #get model metrics/stats by testing on testing data
​
clear_output() #clears console output
print(result["accuracy"]) #the result variable is simply dict of stats about our model
​

I have this error(TypeError: Expected binary or unicode string, got 618.0) every time I tried to fill the NaN values with df.fillna(method="ffill") , df.fillna(method="bfill") , df.fillna(value = 0), ordf.fillna(value="randomstringvalues). I also tried to drop the NaN values with df.dropna()
Needless to say, when I tried to run my code with NaN values it couldn’t work.
I have two questions.
The first one, how could I handle my NaN values so that I won’t see this error (TypeError: Expected binary or unicode string, got 618.0) in the future?
The second one, how can I get rid of this error and implement my dataset into this model swiftly?
P.S.: I am positive that I did not make any typos.

Answer

MY guess is that you have some non-unicode characters in your data. Non unicode characters are like this: � ä

anything that is not a letter, number or symbol. you have two options here, to find all these characters and replace them with something else or remove them.

Or you can use a proper encoding when reading the csv file. pandas.read_csv

data = pandas.read_csv(myfile, encoding='utf-8', quotechar='"', delimiter=',')

JavaScript
 
data = pandas.read_csv(myfile, encoding='utf-8', quotechar='"', delimiter=',') 
​

Advertisement

Answer