I have two columns in a pandas dataframe FeatureID, and Issue ID. There can be multiple issues for a feature. An IssueID is unique and never repeated.
For example (actual data is 1500 rows so let’s say it’s as follows):
IssueID | FeatureID |
---|---|
5612 | 65002 |
5613 | 65401 |
5614 | 65002 |
5615 | 65002 |
5616 | 65401 |
5617 | 65432 |
5618 | 65536 |
5619 | 65002 |
5620 | 65536 |
5621 | 65536 |
etc.. | etc.. |
I am trying to return all of the features that have 2 or more Issue IDs associated with them and I would like to list all of the IssueIDs in a new column.
Using the following I can get the count of the FeatureIDs, but I don’t know how to get from that to also printing the IssueIDs (or just listing them in a new column).
(Assume I’ve already read the csv into a DataFrame)
JavaScript
x
5
1
df2 = pd.DataFrame([['IssueID', 'FeatureID']]
2
3
Features = df2['FeatureID'].value_counts().loc[lambda x : x>2]
4
print(FeatureS)
5
How can I display the FeatureIDs, the number of issues related to them, and then the IssueIDs for each FeatureID?
Advertisement
Answer
Use aggregate function.
JavaScript
1
15
15
1
df2 = df.groupby('FeatureID').agg({'IssueID': [len, list]}).droplevel(0, axis=1).reset_index()
2
3
FeatureID len list
4
0 65002 4 [5612, 5614, 5615, 5619]
5
1 65401 2 [5613, 5616]
6
2 65432 1 [5617]
7
3 65536 3 [5618, 5620, 5621]
8
9
df2[df2['len']>1]
10
11
FeatureID len list
12
0 65002 4 [5612, 5614, 5615, 5619]
13
1 65401 2 [5613, 5616]
14
3 65536 3 [5618, 5620, 5621]
15