I’m trying to create duplicate rows during a dataframe on conditions.
For example, I have this Dataframe.
students = [
("a", "Ursula"),
("b", "Hayfa, Martin"),
("c", "Kato"),
("d", "Tanek, Ava, Pyto"),
("e", "Aiko"),
("f", "Hunter"),
("g", "Josiah, Derek, Uma, Nell"),
]
df = pd.DataFrame(students, columns=["team", "student"])
print(df)
team student a Ursula b Hayfa, Martin c Kato d Tanek, Ava, Pyto e Aiko f Hunter g Josiah, Derek, Uma, Nell
And I would like to get the following output:
team student name remark a Ursula Ursula b Hayfa, Martin Hayfa with Martin b Hayfa, Martin Martin with Hayfa c Kato Kato d Tanek, Ava, Pyto Tanek with Ava, Pyto d Tanek, Ava, Pyto Ava with Tanek, Pyto d Tanek, Ava, Pyto Pyto with Tanek, Ava e Aiko Aiko f Hunter Hunter g Josiah, Derek, Uma, Nell Josiah with Derek, Uma, Nell g Josiah, Derek, Uma, Nell Derek with Josiah, Uma, Nell g Josiah, Derek, Uma, Nell Uma with Josiah, Derek, Nell g Josiah, Derek, Uma, Nell Nell with Josiah, Derek, Uma
Advertisement
Answer
For pandas 0.25+ is possible use DataFrame.explode with splitted values by Series.str.split and for remark column list comprehension with filtering:
students = df["student"].str.split(", ")
df = df.assign(name=students, remark=students).explode("name").reset_index(drop=True)
df["remark"] = [
"with " + ", ".join(x for x in r if x != n) if len(r) > 1 else ""
for n, r in zip(df["name"], df["remark"])
]
print (df)
And we get the following result:
team student name remark 0 a Ursula Ursula 1 b Hayfa, Martin Hayfa with Martin 2 b Hayfa, Martin Martin with Hayfa 3 c Kato Kato 4 d Tanek, Ava, Pyto Tanek with Ava, Pyto 5 d Tanek, Ava, Pyto Ava with Tanek, Pyto 6 d Tanek, Ava, Pyto Pyto with Tanek, Ava 7 e Aiko Aiko 8 f Hunter Hunter 9 g Josiah, Derek, Uma, Nell Josiah with Derek, Uma, Nell 10 g Josiah, Derek, Uma, Nell Derek with Josiah, Uma, Nell 11 g Josiah, Derek, Uma, Nell Uma with Josiah, Derek, Nell 12 g Josiah, Derek, Uma, Nell Nell with Josiah, Derek, Uma