I have two columns in a pandas dataframe FeatureID, and Issue ID. There can be multiple issues for a feature. An IssueID is unique and never repeated.
For example (actual data is 1500 rows so let’s say it’s as follows):
| IssueID | FeatureID |
|---|---|
| 5612 | 65002 |
| 5613 | 65401 |
| 5614 | 65002 |
| 5615 | 65002 |
| 5616 | 65401 |
| 5617 | 65432 |
| 5618 | 65536 |
| 5619 | 65002 |
| 5620 | 65536 |
| 5621 | 65536 |
| etc.. | etc.. |
I am trying to return all of the features that have 2 or more Issue IDs associated with them and I would like to list all of the IssueIDs in a new column.
Using the following I can get the count of the FeatureIDs, but I don’t know how to get from that to also printing the IssueIDs (or just listing them in a new column).
(Assume I’ve already read the csv into a DataFrame)
df2 = pd.DataFrame([['IssueID', 'FeatureID']] Features = df2['FeatureID'].value_counts().loc[lambda x : x>2] print(FeatureS)
How can I display the FeatureIDs, the number of issues related to them, and then the IssueIDs for each FeatureID?
Advertisement
Answer
Use aggregate function.
df2 = df.groupby('FeatureID').agg({'IssueID': [len, list]}).droplevel(0, axis=1).reset_index()
FeatureID len list
0 65002 4 [5612, 5614, 5615, 5619]
1 65401 2 [5613, 5616]
2 65432 1 [5617]
3 65536 3 [5618, 5620, 5621]
df2[df2['len']>1]
FeatureID len list
0 65002 4 [5612, 5614, 5615, 5619]
1 65401 2 [5613, 5616]
3 65536 3 [5618, 5620, 5621]