I have a data set which has a column that looks like this I need only the numbers. Here’s my code: I was expecting an output like: but I got Just to test, I dumped the dataframe to a .csv and read it back with pd.read_csv(). That gave me just the numbers, as I need (though of course that’s not

Pandas: str.extract() giving unexpected NaN

I have a data set which has a column that looks like this

Badge Number
1
3
23 / gold
22 / silver
483

I need only the numbers. Here’s my code:

df = pd.read_excel('badges.xlsx')
df['Badge Number'] = df['Badge Number'].str.extract('(d+)')
print(df)

I was expecting an output like:

Badge Number
1
3
23
22
483

but I got

Badge Number
Nan
Nan
23
22
Nan

Just to test, I dumped the dataframe to a .csv and read it back with pd.read_csv(). That gave me just the numbers, as I need (though of course that’s not a solution)

I also tried

df['Badge Number'] = np.where(df['Badge Number'].str.isnumeric(), df['Badge Number'], df['Badge Number'].str.extract('(d+)'))

but that just gave me all 1s. I know I am trying things I don’t even remotely understand, but am hoping there’s a straightforward solution.

Answer

Another option is while reading the XLS it self, specify your column to string.

use dtype={'Badge Number': str}

df = pd.read_excel('badges.xlsx',dtype={'Badge Number': str})
df['Badge Number'] = df['Badge Number'].str.extract('(\d+)')

Advertisement

Answer