I have two columns in a pandas dataframe FeatureID, and Issue ID. There can be multiple issues for a feature. An IssueID is unique and never repeated.
For example (actual data is 1500 rows so let’s say it’s as follows):
IssueID | FeatureID |
---|---|
5612 | 65002 |
5613 | 65401 |
5614 | 65002 |
5615 | 65002 |
5616 | 65401 |
5617 | 65432 |
5618 | 65536 |
5619 | 65002 |
5620 | 65536 |
5621 | 65536 |
etc.. | etc.. |
I am trying to return all of the features that have 2 or more Issue IDs associated with them and I would like to list all of the IssueIDs in a new column.
Using the following I can get the count of the FeatureIDs, but I don’t know how to get from that to also printing the IssueIDs (or just listing them in a new column).
(Assume I’ve already read the csv into a DataFrame)
df2 = pd.DataFrame([['IssueID', 'FeatureID']] Features = df2['FeatureID'].value_counts().loc[lambda x : x>2] print(FeatureS)
How can I display the FeatureIDs, the number of issues related to them, and then the IssueIDs for each FeatureID?
Advertisement
Answer
Use aggregate function.
df2 = df.groupby('FeatureID').agg({'IssueID': [len, list]}).droplevel(0, axis=1).reset_index() FeatureID len list 0 65002 4 [5612, 5614, 5615, 5619] 1 65401 2 [5613, 5616] 2 65432 1 [5617] 3 65536 3 [5618, 5620, 5621] df2[df2['len']>1] FeatureID len list 0 65002 4 [5612, 5614, 5615, 5619] 1 65401 2 [5613, 5616] 3 65536 3 [5618, 5620, 5621]