Skip to content
Advertisement

Pandas .loc[].index

What is the most efficient way (using the least amount of lines possible) to locate and drop multiple strings in a specified column?

Information regarding the .tsv dataset that may help:

‘tconst’ = movie ID

‘region’ = region in which the movie was released in

‘language’ = language of movie

Here is what I have right now:

origin.drop(origin.columns[[1,2,5,6,7]], axis=1, inplace=True)
origin.columns = ['tconst','region','language']
origin.drop(origin.loc[origin['region']!=('US')].index, inplace=True)

I am trying to drop all rows under the ‘region’ column that contain a string value abbreviation of countries that do not speak English. For example, drop all rows under region that are not equal to ‘US’,’UK’,’AUS’,etc. I have tried the & and or operands within the parenthesis and they tend to select and drop only one of the string values that I place inside.

ADDITIONAL QUESTION:

The ‘language’ column contains numerous amounts of null values (I don’t really care about that), however there are some rows that contain ‘en’ for English. IF the ‘region’ happens to be a non-English speaking country BUT the language is in English, how do I prevent those rows from being removed as well?

Advertisement

Answer

Assuming I understood your question correctly and you want to drop the entire row if your specified conditions holds, you can write them as masks like

cond1 = origin['region'].isin(['US', 'UK', 'AUS'])
cond2 = origin['language'] == 'en'

Combining them (you want at least one of the conditions to be true in order to keep a row) with | you get

result = origin[cond1 | cond2]

Of course, this can also be written in a single line if you care to do so:

result = origin[(origin['region'].isin(['US', 'UK', 'AUS'])) | (origin['language'] == 'en')]
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement