I have a dataframe of taxi data with two columns that looks like this:
JavaScript
x
7
1
Neighborhood Borough Time
2
Midtown Manhattan X
3
Melrose Bronx Y
4
Grant City Staten Island Z
5
Midtown Manhattan A
6
Lincoln Square Manhattan B
7
Basically, each row represents a taxi pickup in that neighborhood in that borough. Now, I want to find the top 5 neighborhoods in each borough with the most number of pickups. I tried this:
JavaScript
1
2
1
df['Neighborhood'].groupby(df['Borough']).value_counts()
2
Which gives me something like this:
JavaScript
1
24
24
1
borough
2
Bronx High Bridge 3424
3
Mott Haven 2515
4
Concourse Village 1443
5
Port Morris 1153
6
Melrose 492
7
North Riverdale 463
8
Eastchester 434
9
Concourse 395
10
Fordham 252
11
Wakefield 214
12
Kingsbridge 212
13
Mount Hope 200
14
Parkchester 191
15
16
17
Staten Island Castleton Corners 4
18
Dongan Hills 4
19
Eltingville 4
20
Graniteville 4
21
Great Kills 4
22
Castleton 3
23
Woodrow 1
24
How do I filter it so that I get only the top 5 from each? I know there are a few questions with a similar title but they weren’t helpful to my case.
Advertisement
Answer
I think you can use nlargest
– you can change 1
to 5
:
JavaScript
1
15
15
1
s = df['Neighborhood'].groupby(df['Borough']).value_counts()
2
print s
3
Borough
4
Bronx Melrose 7
5
Manhattan Midtown 12
6
Lincoln Square 2
7
Staten Island Grant City 11
8
dtype: int64
9
10
print s.groupby(level=[0,1]).nlargest(1)
11
Bronx Bronx Melrose 7
12
Manhattan Manhattan Midtown 12
13
Staten Island Staten Island Grant City 11
14
dtype: int64
15
additional columns were getting created, specified level info