Skip to content
Advertisement

Counting unique mentions in Pandas dataframe column while grouped by multiple other columns

For a school project I am attempting to determine the number of mentions specific words have in Reddit titles and comments. More specifically, stock ticker mentions. Currently the dataframe looks like this (where type could be a string of either title or comment):

                                                   body  score   id                    created       subreddit     type      mentions
3860  There are much better industrials stocks than ...      1  NaN  2021-03-13 20:32:08+00:00          stocks  comment          {GE}
3776  I guy I work with told me about PENN about 9 m...      1  NaN  2021-03-13 20:29:30+00:00       investing  comment        {PENN}
4122  [mp4 link](https://preview.redd.it/ieae3z7suum...      2  NaN  2021-03-13 20:28:43+00:00     StockMarket  comment          {KB}
2219     If you cant decide, then just buy $GME options      1  NaN  2021-03-13 20:28:12+00:00  wallstreetbets  comment        {GME}
2229  This sub the most wholesome fucking thing in t...      2  NaN  2021-03-13 20:27:57+00:00  wallstreetbets  comment         {GME}

Where the mentions column contains a set of tickers mentioned in the body (could be multiple). What I wish to do is to count the number of unique mentions on a per-subreddit per-type (either comment or title) basis. The result I am looking for would be similar to this:

ticker            subreddit             type              count
 GME           wallstreetbets          comment             5
 GME           wallstreetbets           title              4
 GME             investing             comment             3
 GME             investing              title              2

Repeated for all unique tickers mentioned.

I had used counters to figure this out utilizing dataframes specific to each instance (ie one dataframe for wallstreetbets comments, one dataframe for wallstreetbets titles) but I could not figure out how to make it work in this fashion when confined to a singular dataframe.

Advertisement

Answer

Sound like a simple groupby should do it:

df.groupby(['mentions','subreddit','type']).count()

produces

                                        body    score   id  created
mentions    subreddit        type               
{GE}        stocks           comment    1       1       0   1
{GME}       wallstreetbets   comment    2       2       0   2
{KB}        StockMarket      comment    1       1       0   1
{PENN}      investing        comment    1       1       0   1
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement