Remove duplicates from a dataframe in PySpark

Question

I'm messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates method to work. It keeps returning the error: "AttributeError: 'list' object has no attribute 'dropDuplicates'" Not quite sure why as I seem to be following the syntax in the latest documentation. Answer It is not an import problem. You simply call .dropDuplicates() on a

Accepted Answer

It is not an import problem. You simply call .dropDuplicates() on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...) is pyspark.sql.dataframe.DataFrame, after you apply .collect() it is a plain Python list, and lists don&#8217;t provide dropDuplicates method. What you want is something like this: (df1 = sqlContext     .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])     .dropDuplicates()) df1.collect()

Advertisement

Answer