I have a data set which has a column that looks like this
Badge Number 1 3 23 / gold 22 / silver 483
I need only the numbers. Here’s my code:
df = pd.read_excel('badges.xlsx') df['Badge Number'] = df['Badge Number'].str.extract('(d+)') print(df)
I was expecting an output like:
Badge Number 1 3 23 22 483
but I got
Badge Number Nan Nan 23 22 Nan
Just to test, I dumped the dataframe to a .csv and read it back with pd.read_csv(). That gave me just the numbers, as I need (though of course that’s not a solution)
I also tried
df['Badge Number'] = np.where(df['Badge Number'].str.isnumeric(), df['Badge Number'], df['Badge Number'].str.extract('(d+)'))
but that just gave me all 1s. I know I am trying things I don’t even remotely understand, but am hoping there’s a straightforward solution.
Advertisement
Answer
Another option is while reading the XLS it self, specify your column to string.
use dtype={'Badge Number': str}
df = pd.read_excel('badges.xlsx',dtype={'Badge Number': str}) df['Badge Number'] = df['Badge Number'].str.extract('(\d+)')