I have a pandas DataFrame of the following format: Input: where (version, branch) is a MultiIndex. PROBLEM DESCRIPTION: I want to groupby version and set the values in the column X with branch overall to the sum of the values in the column X for the remaining branches (having the same version), weighted by th…
Tag: aggregate
How to split in train and test by month
I have a dataframe structured like this I have data for all days and months from 2018 to 2021, with around 50k observations How can I aggregate all the data for the same month and perform a Train-Test splitting for each month? I.e. for all the data of the months of January, February, March and so on. Answer t…
Pandas: using groupby to calculate a ratio by specific values
Hi I have a dataframe that looks like this: and I want to calculate a ratio in the column ‘count_number’, based on the values in the column ‘tone’ by this formula: [‘blue’+’grey’]/’red’ per each unite combination of ‘participant_id’, R…
Can we use iterables in pandas groupby agg function?
I have a pandas groupby function. I have another input in the form of dict which has {column:aggfunc} structure as shown below: I want to use this dict to apply aggregate function as follows: Is there some way I can achieve this using the input dict d (may be by using dict comprehensions)? Answer If dictionar…
Pandas – What datatype should a duration column (mm:ss) be to use aggregates on it?
I’m doing some NBA analysis and have a “Minutes Played” column for players in a mm:ss format. What dtype should this column be to perform aggregate functions (mean, min, max, etc…) on it? The df has over 20,000 rows, so here is a sample of the column in question: I ran this code to cha…
how to find $avg and $sum for fields which contain NaN value in mongodb?
I can find and limit columns which contain NaN value before using $group clause in mongodb when I use mongo cli or JavaScript. However, when I use python and its major library “pymongo” it seems not to be able to do the same. As following code NaN is not part of python syntax. Whereas it is easy a…
Aggregate data with two conditions
I have a data frame that looks something like this: What I would like to do is aggregate the data if the dates are the same – but only if the name is different. So the above data frame should actually become: Currently I am almost doing it with: However, this will also aggregate the ones where the name …
How to sort aggregated numpy array?
My first post on stackoverflow + am very new to programming. Apologies in advance for any poor formatting and missing information. :) I aggregated two columns in a csv file (one column of seller names, the other of transactional amounts) to find how much each seller has made in total: I want to sort it in des…
Pandas: groupby followed by aggregate – unexpected behaviour when joining strings
Having a pandas data frame containing two columns of type str: which is created as follows: df = pd.DataFrame({“group”:[1,2,2,1],”sc”:[“A”,”B”,”C”,”D”],”wc”:[“word1”, “word2”, “word3″,”…
PySpark Dataframe melt columns into rows
As the subject describes, I have a PySpark Dataframe that I need to melt three columns into rows. Each column essentially represents a single fact in a category. The ultimate goal is to aggregate the data into a single total per category. There are tens of millions of rows in this dataframe, so I need a way t…