Let’s say that I have this dataframe
Name = ['ID', 'Country', 'IBAN','ID_info_1', 'Dan_Age', 'ID_info_1','Dan_city','ID_info_1','Dan_country','ID_info_1', 'ID_info_2', 'ID_info_2','ID_info_2', 'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country' ] Value = ['TAMARA_CO', 'GERMANY','FR56', '12', '18','25','Berlin','34', '55','345','432', '43', 'GER', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP'] Ccy = ['','','','EUR','EUR','EUR','','EUR','','','','EUR','EUR','USD','USD','','CHF', '','DKN'] Group = ['0','0','0','1','1','2','2','3','3','4','1','2','3','4','2','2','2','3','3'] df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group}) print(df) Name Value Ccy Group 0 ID TAMARA_CO 0 1 Country GERMANY 0 2 IBAN FR56 0 3 ID_info_1 12 EUR 1 4 Dan_Age 18 EUR 1 5 ID_info_1 25 EUR 2 6 Dan_city Berlin 2 7 ID_info_1 34 EUR 3 8 Dan_country 55 3 9 ID_info_1 345 4 10 ID_info_2 432 1 11 ID_info_2 43 EUR 2 12 ID_info_2 GER EUR 3 13 Dan_sex M USD 4 14 Dan_Age 22 USD 2 15 Dan_country FRA 2 16 Dan_sex M CHF 2 17 Dan_city Madrid 3 18 Dan_country ESP DKN 3
I want to reduce this dataframe ! I want to reduce only the rows that contains the string “info” by keeping the ones that have the highest level in the column “Group”. So in this dataframe, it would mean that I keep the row “ID_info_1” in the group 4, and “ID_info_1” in the group 3. In addition, I want to change the their value in the column “Group” to 1.
So at the end I would like to get this new dataframe where the indexing is reset too
Name Value Ccy Group 0 ID TAMARA_CO 0 1 Country GERMANY 0 2 IBAN FR56 0 3 ID_info_1 12 EUR 1 4 Dan_Age 18 EUR 1 5 Dan_city Berlin 2 6 Dan_country 55 3 7 ID_info_1 345 1 8 ID_info_2 GER EUR 1 9 Dan_sex M USD 4 10 Dan_Age 22 USD 2 11 Dan_country FRA 2 12 Dan_sex M CHF 2 13 Dan_city Madrid 3 14 Dan_country ESP DKN 3
Anyone has an efficient idea ?
Thank you
Advertisement
Answer
How about this:
# select rows with "info" di = df[df.Name.str.contains('info')] # Find the rows below max for removal di = di[di.groupby('Name')['Group'].transform('max') != di['Group']] # Remove those rows and set a new index as requested df = df.drop(di.index).reset_index(drop=True) # Change group to one on remaining "info" rows df.loc[df.Name.str.contains('info'), 'Group'] = 1