Skip to content
Advertisement

Performance tuning: string wordcount in df

I have a df with column “free text”. I wish to count how many characters and words each cell has. Currently, I do it like this:

d = {'free text': ["merry had a little lamb", "Little Jonathan found a chicken"]}
df = pd.DataFrame(data=d)
df['Chars'] = df['free text'].apply(str).apply(len)
df['Words'] = df['free text'].apply(lambda x: len(str(x).split()))

Problem is, that it is pretty slow. I thought about using np.where but I wasn’t sure how. Would appreciate your help here.

Advertisement

Answer

IIUC:

you can try via str.len() and str.count():

df['Chars'] = df['free text'].str.len()
df['Words'] = df['free text'].str.count(' ')+1

Sample dataframe used:

d = {'free text': ["merry had a little lamb", "Little Jonathan found a chicken",np.nan]}
df = pd.DataFrame(data=d)

OR

via numpy but you will get 0 count when there are NaN’s present:

df['Chars'] =np.char.count(df['free text'].to_numpy(na_value='').astype(str),' ')
df['Words'] =np.char.str_len(df['free text'].to_numpy(na_value='').astype(str))

output of df:

    free text                           Chars   Words
0   merry had a little lamb             23.0    5.0
1   Little Jonathan found a chicken     31.0    5.0
2   NaN                                 NaN     NaN
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement