Skip to content
Advertisement

Remove duplicates from a dataframe in PySpark

I’m messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates method to work. It keeps returning the error:

“AttributeError: ‘list’ object has no attribute ‘dropDuplicates'”

Not quite sure why as I seem to be following the syntax in the latest documentation.

JavaScript

Advertisement

Answer

It is not an import problem. You simply call .dropDuplicates() on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...) is pyspark.sql.dataframe.DataFrame, after you apply .collect() it is a plain Python list, and lists don’t provide dropDuplicates method. What you want is something like this:

JavaScript
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement