I am trying to find the average value of numbers before certain words.
I have a list of sentences:
['I had to wait 30 minutes', 'It took too long had to wait 35 minutes', ...]
I want to find the average value of the numbers before a certain word which in this case is minutes.
So this would result in 32.5 minutes. And I want to be able to do this for any input word. I already found which words most often occur after a number, but I did that by change all number to the same value(@) and seeing what words most frequently occur after the @ sign.
I thought I could maybe create a bigram and then look for the number before minutes, but that does not work right now.
unigrams = ( all_data['PreProcess'].str.lower() .str.split(expand=True) .stack()) from nltk import bigrams bgs = bigrams(unigrams) lake_bgs = filter(lambda item: item[0] == 'minutes', bgs) from collections import Counter c = Counter(map(lambda item: item[1], lake_bgs)) print (c.most_common(12))
Advertisement
Answer
Use str.extractall
to get the minutes, convert to numeric and then take the mean…
average = pd.to_numeric(df['PreProcess'].str.extractall(r'(?i)(d+)s+minutes').squeeze()).mean()