I am attempting to iterate through a Hacker News dataset and was trying to create 3 categories (i.e types of posts) found on the HN forum viz, ask_posts, show_posts and other_posts.
In short, I am trying to find out the average number of comments per posts per category(described below).
import pandas as pd import datetime as dt df = pd.read_csv('HN_posts_year_to_Sep_26_2016.csv') ask_posts = [] show_posts = [] other_post = [] total_ask_comments = 0 total_show_comments = 0 for i, row in df.iterrows(): title = row.title comments = row['num_comments'] if title.lower().startswith('ask hn'): ask_posts.append(title) for post in ask_posts: total_ask_comments += comments elif title.lower().startswith('show hn'): show_posts.append(title) for post in show_posts: total_show_comments += comments else: other_post.append(title) avg_ask_comments = total_ask_comments/len(ask_posts) avg_show_comments = total_show_comments/len(show_posts) print(total_ask_comments) print(total_show_comments) print(avg_ask_comments) print(avg_show_comments)
The results respectively are;
395976587
250362315
and
43328.21829521829
24646.81187241583
These seem quite high and I am not sure if it because this is an issue with the way I have structured my nested loop. Is this method correct? It is critical that I use a for loop to do this.
Any and all help/verification of my code is appreciated.
Advertisement
Answer
This post doesn’t answer specifically the question about looping through dataframes; but it gives you an alternative solution which is faster.
Looping over Pandas dataframes to gather the information as you have it is going to be tremendously slow. It’s much much faster to use filtering to get the information you want.
>>> show_posts = df[df.title.str.contains("show hn", case=False)] >>> show_posts id ... created_at 52 12578335 ... 9/26/2016 0:36 58 12578182 ... 9/26/2016 0:01 64 12578098 ... 9/25/2016 23:44 70 12577991 ... 9/25/2016 23:17 140 12577142 ... 9/25/2016 20:06 ... ... ... ... 292995 10177714 ... 9/6/2015 14:21 293002 10177631 ... 9/6/2015 13:50 293019 10177511 ... 9/6/2015 13:02 293028 10177459 ... 9/6/2015 12:38 293037 10177421 ... 9/6/2015 12:16 [10189 rows x 7 columns] >>> ask_posts = df[df.title.str.contains("ask hn", case=False)] >>> ask_posts id ... created_at 10 12578908 ... 9/26/2016 2:53 42 12578522 ... 9/26/2016 1:17 76 12577908 ... 9/25/2016 22:57 80 12577870 ... 9/25/2016 22:48 102 12577647 ... 9/25/2016 21:50 ... ... ... ... 293047 10177359 ... 9/6/2015 11:27 293052 10177317 ... 9/6/2015 10:52 293055 10177309 ... 9/6/2015 10:46 293073 10177200 ... 9/6/2015 9:36 293114 10176919 ... 9/6/2015 6:02 [9147 rows x 7 columns]
You can get your numbers very quickly this way
>>> num_ask_comments = ask_posts.num_comments.sum() >>> num_ask_comments 95000 >>> num_show_comments = show_posts.num_comments.sum() >>> num_show_comments 50026 >>> >>> total_num_comments = df.num_comments.sum() >>> total_num_comments 1912761 >>> >>> # Get a ratio of the number ask comments to total number of comments >>> num_ask_comments / total_num_comments 0.04966642460819726 >>>
Also you’ll get different numbers with .startswith()
vs. .contains()
(I’m not sure which you want).
>>> ask_posts = df[df.title.str.lower().str.startswith("ask hn")] >>> len(ask_posts) 9139 >>> >>> ask_posts = df[df.title.str.contains("ask hn", case=False)] >>> len(ask_posts) 9147 >>>
The pattern argument to .contains()
can be a regular expression – which is very useful. So we can specify all records that begin with “ask hn” at the very start of the title, but if we’re not sure if any whitespace will be in front of it, we can do
>>> ask_posts = df[df.title.str.contains(r"^s*ask hn", case=False)] >>> len(ask_posts) 9139 >>>
What’s happening in the filter statements is probably difficult to grasp when you’re starting out using Pandas. The expression in the square brackets of df[df.title.str.contains("show hn", case=False)]
for instance.
What the statement inside the square brackets (df.title.str.contains("show hn", case=False)
) produces is a column of True and False values – a boolean filter (not sure if that’s what it’s called but it has that effect).
So that boolean column that’s produced is used to select rows in the dataframe, df[<bool column>]
, and it produces a new dataframe with the matching records. We can then use that to extract other information – like the summation of the comments column.