Skip to content
Advertisement

PySpark: filtering with isin returns empty dataframe

Context: I need to filter a dataframe based on what contains another dataframe’s column using the isin function.

For Python users working with pandas, that would be: isin().
For R users, that would be: %in%.

So I have a simple spark dataframe with id and value columns:

JavaScript

I want to get all ids that appear multiple times. Here’s a dataframe of unique ids in df:

JavaScript

So the logical operation would be:

JavaScript

However, I get an empty dataframe:

JavaScript

This “error” works in the opposite way:

JavaScript

returns all the rows of df.

Advertisement

Answer

The expression df.id.isin(unique_ids.id) == False is evaluating if Column<b'((id IN (id)) = false)'> and this will never happen because id is in id. However, the expression df.id.isin(unique_ids.id) is evaluating if Column<b'(id IN (id))'>, and this is always true, for that reason it returns the whole data frame. unique_ids.id is a Column not a list.

isin(*cols) receives a list of values as an argument, not a column, so, to work in this way, you should execute the following:

JavaScript

and you will obtain:

JavaScript

In any case, I think it would be better if you join both data frames:

JavaScript

getting:

JavaScript
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement