I was wondering how could I find all values that start with ‘orange’ from all the columns and parse it into new columns.
data = pd.DataFrame({'a':["mango 2","mango 3",'apple 3', 'orange 1345','orange 2456','banana 1', "watermelon 2","mango 2","mango 3"],
'b':["mango 2","mango 3",'apple 3','banana 1', "watermelon 2", 'orange 41134','orange 22145',"mango 2","mango 3"],
'c':['apple 3',"mango 2","mango 3" ,"mango 2","mango 3",'banana 1', "watermelon 2",'orange 2222','orange 2341'],
'd':["mango 2","mango 3","mango 2","mango 3",'apple 3','banana 1', "watermelon 2","mango 2","mango 3",'orange 9087','orange 0021'],
'e':['apple 3', 'orange 1','orange 2','banana 1', "watermelon 2"]})
expected output :
df1 = pd.DataFrame({'category':['orange 1345','orange 2456','orange 41134','orange 22145','orange 2222','orange 2341','orange 9087','orange 0021','orange 1','orange 2'}]
Advertisement
Answer
Let’s try stack then filter by str.contains:
df1 = data.stack()
df1 = (
df1[df1.str.contains('^orange', regex=True)]
.reset_index(drop=True)
.to_frame('category')
)
df1:
category 0 orange 1345 1 orange 41134 2 orange 2222 3 orange 9087 4 orange 1 5 orange 2456 6 orange 22145 7 orange 2341 8 orange 0021 9 orange 2
Or melt for same order as OP:
df1 = data.melt()['value']
df1 = (
df1[df1.str.contains('^orange', regex=True)]
.reset_index(drop=True)
.to_frame('category')
)
df1:
category 0 orange 1345 1 orange 2456 2 orange 41134 3 orange 22145 4 orange 2222 5 orange 2341 6 orange 9087 7 orange 0021 8 orange 1 9 orange 2
regex ^orange:
^asserts position at start of a lineorangematches the charactersorangeliterally (case sensitive)