Skip to content
Advertisement

Why does this pandas str.extract pattern work?

I have a dataframe “movies” with column “title”, which contains movie titles and their release year in the following format:

The Pirates (2014)

I’m testing different ways to extract just the title portion, which in the example above would be “The Pirates”, into a new column.

I used pandas Series.str.extract() and found a regex pattern that works, but I’m not sure why it works.

movies['title_only'] = movies['title'].str.extract('(.*)[s]', expand=True)

The above code correctly extracts the “The Pirates” into a new column, but why doesn’t it extract only “The” (everything before the first whitespace)?

Advertisement

Answer

  • is a greedy quantifier, meaning it will match as far into the string as possible. To only match the first word, you can switch it to a lazy quantifier *?. Also, note that you don’t need square brackets around the s. [s] == s

According to CAustin

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement