How to iterate through a nested for loop in pandas dataframe?

Question

I am attempting to iterate through a Hacker News dataset and was trying to create 3 categories (i.e types of posts) found on the HN forum viz, ask_posts, show_posts and other_posts. In short, I am trying to find out the average number of comments per posts per category(described below). The results respective…

Accepted Answer

This post doesn&#8217;t answer specifically the question about looping through dataframes; but it gives you an alternative solution which is faster.Looping over Pandas dataframes to gather the information as you have it is going to be tremendously slow. It&#8217;s much much faster to use filtering to get the information you want.>>> show_posts = df[df.title.str.contains("show hn", case=False)]>>> show_posts              id  ...       created_at52      12578335  ...   9/26/2016 0:3658      12578182  ...   9/26/2016 0:0164      12578098  ...  9/25/2016 23:4470      12577991  ...  9/25/2016 23:17140     12577142  ...  9/25/2016 20:06...          ...  ...              ...292995  10177714  ...   9/6/2015 14:21293002  10177631  ...   9/6/2015 13:50293019  10177511  ...   9/6/2015 13:02293028  10177459  ...   9/6/2015 12:38293037  10177421  ...   9/6/2015 12:16[10189 rows x 7 columns]>>> ask_posts = df[df.title.str.contains("ask hn", case=False)]>>> ask_posts              id  ...       created_at10      12578908  ...   9/26/2016 2:5342      12578522  ...   9/26/2016 1:1776      12577908  ...  9/25/2016 22:5780      12577870  ...  9/25/2016 22:48102     12577647  ...  9/25/2016 21:50...          ...  ...              ...293047  10177359  ...   9/6/2015 11:27293052  10177317  ...   9/6/2015 10:52293055  10177309  ...   9/6/2015 10:46293073  10177200  ...    9/6/2015 9:36293114  10176919  ...    9/6/2015 6:02[9147 rows x 7 columns]You can get your numbers very quickly this way>>> num_ask_comments = ask_posts.num_comments.sum()>>> num_ask_comments95000>>> num_show_comments = show_posts.num_comments.sum()>>> num_show_comments50026>>> >>> total_num_comments = df.num_comments.sum()>>> total_num_comments1912761>>> >>> # Get a ratio of the number ask comments to total number of comments>>> num_ask_comments / total_num_comments0.04966642460819726>>> Also you&#8217;ll get different numbers with .startswith() vs. .contains() (I&#8217;m not sure which you want).>>> ask_posts = df[df.title.str.lower().str.startswith("ask hn")]>>> len(ask_posts)9139>>> >>> ask_posts = df[df.title.str.contains("ask hn", case=False)]>>> len(ask_posts)9147>>> The pattern argument to .contains() can be a regular expression &#8211; which is very useful. So we can specify all records that begin with &#8220;ask hn&#8221; at the very start of the title, but if we&#8217;re not sure if any whitespace will be in front of it, we can do>>> ask_posts = df[df.title.str.contains(r"^s*ask hn", case=False)]>>> len(ask_posts)9139>>> What&#8217;s happening in the filter statements is probably difficult to grasp when you&#8217;re starting out using Pandas. The expression in the square brackets of df[df.title.str.contains("show hn", case=False)] for instance. What the statement inside the square brackets (df.title.str.contains("show hn", case=False)) produces is a column of True and False values &#8211; a boolean filter (not sure if that&#8217;s what it&#8217;s called but it has that effect). So that boolean column that&#8217;s produced is used to select rows in the dataframe, df[<bool column>], and it produces a new dataframe with the matching records. We can then use that to extract other information &#8211; like the summation of the comments column.

Advertisement

Answer