I have the following dataframe and I’m trying to extract the string that has the ABC followed by it’s numbers.
| Description |
|---|
| ABC12345679 132465 |
| Test ABC12346548 |
| Test ABC1231321 4645 |
I have tried:
data['extract'] = 'ABC'+data['Description'].str.split('ABC').str[1]
But its giving me what it comes after on instances that there’s more text after the ABC* like so:
| Description |
|---|
| ABC12345679 132465 |
| ABC12346548 |
| ABC1231321 4645 |
And I need the column to only contain data as following:
| Description |
|---|
| ABC12345679 |
| ABC12346548 |
| ABC1231321 |
Any assistance on this?
Advertisement
Answer
We can use regex to extract the necessary part of the string.
Here we are checking for atleast one [A-C] and 0 or more[0-9]
data['extract'] = data.Description.str.extract(r'([A-C]+[0-9]*)')
or (based on need)
data['extract'] = data.Description.str.extract(r'([A-C]+[0-9]+)')
Output
Description extract 0 ABC12345679 132465 ABC12345679 1 Test ABC12346548 ABC12346548 2 Test ABC1231321 4645 ABC1231321