I’m messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates
method to work. It keeps returning the error:
“AttributeError: ‘list’ object has no attribute ‘dropDuplicates'”
Not quite sure why as I seem to be following the syntax in the latest documentation.
#loading the CSV file into an RDD in order to start working with the data rdd1 = sc.textFile("C:myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect() #loading the RDD object into a dataframe and assigning column names df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect() #dropping duplicates from the dataframe df1.dropDuplicates().show()
Advertisement
Answer
It is not an import problem. You simply call .dropDuplicates()
on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...)
is pyspark.sql.dataframe.DataFrame
, after you apply .collect()
it is a plain Python list
, and lists don’t provide dropDuplicates
method. What you want is something like this:
(df1 = sqlContext .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']) .dropDuplicates()) df1.collect()