Current dataframe is as follows:
df = pd.read_csv('filename.csv', delimiter=',') print(df) idx uniqueID String 0 1 'hello' 1 1 'goodbye' 2 1 'happy' 3 2 'hello' 4 2 'happy' 5 3 'goodbye' 6 3 'hello' 7 3 'hello' 8 4 'goodbye' 9 5 'goodbye'
Expected Output:
{ 'hello': 2, 'goodbye' : 3} Where hello was counted by idx 0 & 3, and goodbye was counted by idx 5 & 8 & 9. In the actual dataset there are more than two starting strings. I'm thinking of potentially using pandas .groupby() && .where(), to filter out for the first time a uniqueId occurs, then group by the string? Not entirely sure.
Question: How do I get the counts of the ‘starting string’, only when uniqueID is occurring for the first time with increased performance.
Thus far, I’m doing with a simple for loop of the data and checking with if/else statements. But this is incredibly slow with a large dataframe.
I’m curious if there are any functions built in pandas, or another library out there, that would reduce the overall time it takes.
Advertisement
Answer
Achieving better than O(N) is not possible.
You can drop_duplicates
, then value_counts
:
out = df.drop_duplicates('uniqueID')['String'].value_counts()
output:
goodbye 3 hello 2 Name: String, dtype: int64
As dictionary:
df.drop_duplicates('uniqueID')['String'].value_counts().to_dict()
output: {'goodbye': 3, 'hello': 2}