I have a dataframe “movies” with column “title”, which contains movie titles and their release year in the following format:
The Pirates (2014)
I’m testing different ways to extract just the title portion, which in the example above would be “The Pirates”, into a new column.
I used pandas Series.str.extract() and found a regex pattern that works, but I’m not sure why it works.
movies['title_only'] = movies['title'].str.extract('(.*)[s]', expand=True)
The above code correctly extracts the “The Pirates” into a new column, but why doesn’t it extract only “The” (everything before the first whitespace)?
Advertisement
Answer
- is a greedy quantifier, meaning it will match as far into the string as possible. To only match the first word, you can switch it to a lazy quantifier *?. Also, note that you don’t need square brackets around the s. [s] == s
According to CAustin