Skip to content
Advertisement

Increasing performance for search pandas df, Count occurrences of starting string grouped by unique identifier

Current dataframe is as follows:

df = pd.read_csv('filename.csv', delimiter=',')

print(df)
idx   uniqueID      String 

0        1           'hello'
1        1           'goodbye'
2        1           'happy'
3        2           'hello'
4        2           'happy'
5        3           'goodbye' 
6        3           'hello'
7        3           'hello'
8        4           'goodbye'
9        5           'goodbye'

Expected Output:

{ 'hello': 2, 'goodbye' : 3} 
Where hello was counted by idx 0 & 3, and goodbye was counted by idx 5 & 8 & 9. 
In the actual dataset there are more than two starting strings. 



I'm thinking of potentially using pandas .groupby() && .where(), 
to filter out for the first time a uniqueId occurs, then group 
by the string? Not entirely sure. 

Question: How do I get the counts of the ‘starting string’, only when uniqueID is occurring for the first time with increased performance.

Thus far, I’m doing with a simple for loop of the data and checking with if/else statements. But this is incredibly slow with a large dataframe.

I’m curious if there are any functions built in pandas, or another library out there, that would reduce the overall time it takes.

Advertisement

Answer

Achieving better than O(N) is not possible.

You can drop_duplicates, then value_counts:

out = df.drop_duplicates('uniqueID')['String'].value_counts()

output:

goodbye    3
hello      2
Name: String, dtype: int64

As dictionary:

df.drop_duplicates('uniqueID')['String'].value_counts().to_dict()

output: {'goodbye': 3, 'hello': 2}

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement