Skip to content
Advertisement

How to iterate through a nested for loop in pandas dataframe?

I am attempting to iterate through a Hacker News dataset and was trying to create 3 categories (i.e types of posts) found on the HN forum viz, ask_posts, show_posts and other_posts.

In short, I am trying to find out the average number of comments per posts per category(described below).

JavaScript

The results respectively are;

395976587

250362315

and

43328.21829521829

24646.81187241583

These seem quite high and I am not sure if it because this is an issue with the way I have structured my nested loop. Is this method correct? It is critical that I use a for loop to do this.

Any and all help/verification of my code is appreciated.

Advertisement

Answer

This post doesn’t answer specifically the question about looping through dataframes; but it gives you an alternative solution which is faster.

Looping over Pandas dataframes to gather the information as you have it is going to be tremendously slow. It’s much much faster to use filtering to get the information you want.

JavaScript

You can get your numbers very quickly this way

JavaScript

Also you’ll get different numbers with .startswith() vs. .contains() (I’m not sure which you want).

JavaScript

The pattern argument to .contains() can be a regular expression – which is very useful. So we can specify all records that begin with “ask hn” at the very start of the title, but if we’re not sure if any whitespace will be in front of it, we can do

JavaScript

What’s happening in the filter statements is probably difficult to grasp when you’re starting out using Pandas. The expression in the square brackets of df[df.title.str.contains("show hn", case=False)] for instance.

What the statement inside the square brackets (df.title.str.contains("show hn", case=False)) produces is a column of True and False values – a boolean filter (not sure if that’s what it’s called but it has that effect).

So that boolean column that’s produced is used to select rows in the dataframe, df[<bool column>], and it produces a new dataframe with the matching records. We can then use that to extract other information – like the summation of the comments column.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement