I’m trying to convert Pandas DF into Spark one. DF head:
JavaScript
x
4
1
10000001,1,0,1,12:35,OK,10002,1,0,9,f,NA,24,24,0,3,9,0,0,1,1,0,0,4,543
2
10000001,2,0,1,12:36,OK,10002,1,0,9,f,NA,24,24,0,3,9,2,1,1,3,1,3,2,611
3
10000002,1,0,4,12:19,PA,10003,1,1,7,f,NA,74,74,0,2,15,2,0,2,3,1,2,2,691
4
Code:
JavaScript
1
5
1
dataset = pd.read_csv("data/AS/test_v2.csv")
2
sc = SparkContext(conf=conf)
3
sqlCtx = SQLContext(sc)
4
sdf = sqlCtx.createDataFrame(dataset)
5
And I got an error:
JavaScript
1
2
1
TypeError: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>
2
Advertisement
Answer
You need to make sure your pandas dataframe columns are appropriate for the type spark is inferring. If your pandas dataframe lists something like:
JavaScript
1
7
1
pd.info()
2
<class 'pandas.core.frame.DataFrame'>
3
RangeIndex: 5062 entries, 0 to 5061
4
Data columns (total 51 columns):
5
SomeCol 5062 non-null object
6
Col2 5062 non-null object
7
And you’re getting that error try:
JavaScript
1
2
1
df[['SomeCol', 'Col2']] = df[['SomeCol', 'Col2']].astype(str)
2
Now, make sure .astype(str)
is actually the type you want those columns to be. Basically, when the underlying Java code tries to infer the type from an object in python it uses some observations and makes a guess, if that guess doesn’t apply to all the data in the column(s) it’s trying to convert from pandas to spark it will fail.