Skip to content
Advertisement

How create a column in dataframe1 based on a column created in dataframe2, which is derived by using groupby() on dataframe1

Firstly I apologise for how poorly worded the title is.

I have 2 pandas dataframes. X_train and crunched_X_train.
crunched_X_train is derived from X_train by using groupby() on a column named id.
I then did some stuff on crunched_X_train such that I created a new column in this dataframe named label. For each id in crunched_X_train there was a label in that row. label is a number between 0 and 3. I want to add this label column to the original X_train such that for every id in X_train there will be a label on that row which is got from crunched_X_train.

I’ll try to explain better with the tables.
I have omitted columns from the tables which don’t matter here.
crunched_X_train:

id  label
                        
1   2
2   0
3   1
4   3
5   2

I want X_train to look like, with new column label:

id  label
1   2
1   2    
1   2
.
.
2   0
2   0
3   1
3   1


I hoped I could do it like this but it throws
ValueError: Can only compare identically-labeled Series objects

X_train["label"] = crunched_X_train["label"].loc[crunched_X_train["id"] == X_train["id"]]

I then tried this iterative approach, which appears to be working. But is bad practise and takes forever.

for index, row in X_train.iterrows():
    row["label"] = crunched_X_train["label"].loc[crunched_X_train["id"] == row["id"]]


Any help is appreciated. Thanks.

Advertisement

Answer

After the groupby, the id column is the index of crunched_X_train. Then you can just that index to join to the original dataframe (where id is a column):

X_train.join(crunched_X_train["label"], on="id")

assuming that crunched_X_train has id as the index and that X_train has the a column named id.

See the pandas docs here: pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement