Find average value of number before word in a list of sentences

I am trying to find the average value of numbers before certain words. I have a list of sentences: ['I had to wait 30 minutes', 'It took too long had to wait 35 minutes', ...] I want to find the average value of the numbers before a certain word which in this case is minutes.

So this would result in 32.5 minutes. And I want to be able to do this for any input word. I already found which words most often occur after a number, but I did that by change all number to the same value(@) and seeing what words most frequently occur after the @ sign.

I thought I could maybe create a bigram and then look for the number before minutes, but that does not work right now.

unigrams  = (
    all_data['PreProcess'].str.lower()
                .str.split(expand=True)
                .stack())

from nltk import bigrams

bgs = bigrams(unigrams)
lake_bgs = filter(lambda item: item[0] == 'minutes', bgs)

from collections import Counter
c = Counter(map(lambda item: item[1], lake_bgs))
print (c.most_common(12))

JavaScript
​x
 
unigrams  = (
    all_data['PreProcess'].str.lower()
                .str.split(expand=True)
                .stack())
​
from nltk import bigrams
​
bgs = bigrams(unigrams)
lake_bgs = filter(lambda item: item[0] == 'minutes', bgs)
​
from collections import Counter
c = Counter(map(lambda item: item[1], lake_bgs))
print (c.most_common(12))
​
​
​

Answer

Use str.extractall to get the minutes, convert to numeric and then take the mean…

average = pd.to_numeric(df['PreProcess'].str.extractall(r'(?i)(d+)s+minutes').squeeze()).mean()

JavaScript
 
average = pd.to_numeric(df['PreProcess'].str.extractall(r'(?i)(d+)s+minutes').squeeze()).mean()
​

Advertisement

Answer