For a school project I am attempting to determine the number of mentions specific words have in Reddit titles and comments. More specifically, stock ticker mentions. Currently the dataframe looks like this (where type could be a string of either title or comment):
body score id created subreddit type mentions 3860 There are much better industrials stocks than ... 1 NaN 2021-03-13 20:32:08+00:00 stocks comment {GE} 3776 I guy I work with told me about PENN about 9 m... 1 NaN 2021-03-13 20:29:30+00:00 investing comment {PENN} 4122 [mp4 link](https://preview.redd.it/ieae3z7suum... 2 NaN 2021-03-13 20:28:43+00:00 StockMarket comment {KB} 2219 If you cant decide, then just buy $GME options 1 NaN 2021-03-13 20:28:12+00:00 wallstreetbets comment {GME} 2229 This sub the most wholesome fucking thing in t... 2 NaN 2021-03-13 20:27:57+00:00 wallstreetbets comment {GME}
Where the mentions column contains a set of tickers mentioned in the body (could be multiple). What I wish to do is to count the number of unique mentions on a per-subreddit per-type (either comment or title) basis. The result I am looking for would be similar to this:
ticker subreddit type count GME wallstreetbets comment 5 GME wallstreetbets title 4 GME investing comment 3 GME investing title 2
Repeated for all unique tickers mentioned.
I had used counters to figure this out utilizing dataframes specific to each instance (ie one dataframe for wallstreetbets comments, one dataframe for wallstreetbets titles) but I could not figure out how to make it work in this fashion when confined to a singular dataframe.
Advertisement
Answer
Sound like a simple groupby
should do it:
df.groupby(['mentions','subreddit','type']).count()
produces
body score id created mentions subreddit type {GE} stocks comment 1 1 0 1 {GME} wallstreetbets comment 2 2 0 2 {KB} StockMarket comment 1 1 0 1 {PENN} investing comment 1 1 0 1