Skip to content
Advertisement

Sorting entire csv by frequency of occurence in one column

I have a large CSV file, which is a log of caller data.

A short snippet of my file:

CompanyName    High Priority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User

I want to sort the entire list by the frequency of occurrence of customers so it will be like:

CompanyName    High Priority     QualityIssue
Customer3         No               Equipment
Customer3         No               User
Customer3         Yes              User
Customer3         Yes              Equipment
Customer1         Yes              User
Customer1         Yes              User
Customer1         No               Neither
Customer2         No               User
Customer4         No               User

I’ve tried groupby, but that only prints out the Company Name and the frequency but not the other columns, I also tried

df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

and

df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

But these give me errors:

ValueError: The wrong number of items passed 1, indices imply 24

I’ve looked at something like this:

for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
    print "%s: %s" % (key, value)

but this only prints out two columns, and I want to sort my entire CSV. My output should be my entire CSV sorted by the first column.

Thanks for the help in advance!

Advertisement

Answer

This seems to do what you want, basically add a count column by performing a groupby and transform with value_counts and then you can sort on that column:

df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort_values('count', ascending=False)

Output:

  CompanyName HighPriority QualityIssue count
5   Customer3           No         User     4
3   Customer3           No    Equipment     4
7   Customer3          Yes    Equipment     4
6   Customer3          Yes         User     4
0   Customer1          Yes         User     3
4   Customer1           No      Neither     3
1   Customer1          Yes         User     3
8   Customer4           No         User     1
2   Customer2           No         User     1

You can drop the extraneous column using df.drop:

df.drop('count', axis=1)

Output:

  CompanyName HighPriority QualityIssue
5   Customer3           No         User
3   Customer3           No    Equipment
7   Customer3          Yes    Equipment
6   Customer3          Yes         User
0   Customer1          Yes         User
4   Customer1           No      Neither
1   Customer1          Yes         User
8   Customer4           No         User
2   Customer2           No         User
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement