Skip to content
Advertisement

How to iterate through a nested for loop in pandas dataframe?

I am attempting to iterate through a Hacker News dataset and was trying to create 3 categories (i.e types of posts) found on the HN forum viz, ask_posts, show_posts and other_posts.

In short, I am trying to find out the average number of comments per posts per category(described below).

import pandas as pd
import datetime as dt

df = pd.read_csv('HN_posts_year_to_Sep_26_2016.csv')

ask_posts = []
show_posts = []
other_post = []
total_ask_comments = 0
total_show_comments = 0

for i, row in df.iterrows():
    title = row.title
    comments = row['num_comments']
    if title.lower().startswith('ask hn'):
        ask_posts.append(title)
        for post in ask_posts:
            total_ask_comments += comments
    elif title.lower().startswith('show hn'):
        show_posts.append(title)
        for post in show_posts:
             total_show_comments += comments
    else:
        other_post.append(title)

avg_ask_comments = total_ask_comments/len(ask_posts)
avg_show_comments = total_show_comments/len(show_posts)


print(total_ask_comments)
print(total_show_comments)

print(avg_ask_comments)
print(avg_show_comments)

The results respectively are;

395976587

250362315

and

43328.21829521829

24646.81187241583

These seem quite high and I am not sure if it because this is an issue with the way I have structured my nested loop. Is this method correct? It is critical that I use a for loop to do this.

Any and all help/verification of my code is appreciated.

Advertisement

Answer

This post doesn’t answer specifically the question about looping through dataframes; but it gives you an alternative solution which is faster.

Looping over Pandas dataframes to gather the information as you have it is going to be tremendously slow. It’s much much faster to use filtering to get the information you want.

>>> show_posts = df[df.title.str.contains("show hn", case=False)]
>>> show_posts
              id  ...       created_at
52      12578335  ...   9/26/2016 0:36
58      12578182  ...   9/26/2016 0:01
64      12578098  ...  9/25/2016 23:44
70      12577991  ...  9/25/2016 23:17
140     12577142  ...  9/25/2016 20:06
...          ...  ...              ...
292995  10177714  ...   9/6/2015 14:21
293002  10177631  ...   9/6/2015 13:50
293019  10177511  ...   9/6/2015 13:02
293028  10177459  ...   9/6/2015 12:38
293037  10177421  ...   9/6/2015 12:16

[10189 rows x 7 columns]
>>> ask_posts = df[df.title.str.contains("ask hn", case=False)]
>>> ask_posts
              id  ...       created_at
10      12578908  ...   9/26/2016 2:53
42      12578522  ...   9/26/2016 1:17
76      12577908  ...  9/25/2016 22:57
80      12577870  ...  9/25/2016 22:48
102     12577647  ...  9/25/2016 21:50
...          ...  ...              ...
293047  10177359  ...   9/6/2015 11:27
293052  10177317  ...   9/6/2015 10:52
293055  10177309  ...   9/6/2015 10:46
293073  10177200  ...    9/6/2015 9:36
293114  10176919  ...    9/6/2015 6:02

[9147 rows x 7 columns]

You can get your numbers very quickly this way

>>> num_ask_comments = ask_posts.num_comments.sum()
>>> num_ask_comments
95000
>>> num_show_comments = show_posts.num_comments.sum()
>>> num_show_comments
50026
>>> 
>>> total_num_comments = df.num_comments.sum()
>>> total_num_comments
1912761
>>> 
>>> # Get a ratio of the number ask comments to total number of comments
>>> num_ask_comments / total_num_comments
0.04966642460819726
>>> 

Also you’ll get different numbers with .startswith() vs. .contains() (I’m not sure which you want).

>>> ask_posts = df[df.title.str.lower().str.startswith("ask hn")]
>>> len(ask_posts)
9139
>>> 
>>> ask_posts = df[df.title.str.contains("ask hn", case=False)]
>>> len(ask_posts)
9147
>>> 

The pattern argument to .contains() can be a regular expression – which is very useful. So we can specify all records that begin with “ask hn” at the very start of the title, but if we’re not sure if any whitespace will be in front of it, we can do

>>> ask_posts = df[df.title.str.contains(r"^s*ask hn", case=False)]
>>> len(ask_posts)
9139
>>> 

What’s happening in the filter statements is probably difficult to grasp when you’re starting out using Pandas. The expression in the square brackets of df[df.title.str.contains("show hn", case=False)] for instance.

What the statement inside the square brackets (df.title.str.contains("show hn", case=False)) produces is a column of True and False values – a boolean filter (not sure if that’s what it’s called but it has that effect).

So that boolean column that’s produced is used to select rows in the dataframe, df[<bool column>], and it produces a new dataframe with the matching records. We can then use that to extract other information – like the summation of the comments column.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement