I was wondering how could I find all values that start with ‘orange’ from all the columns and parse it into new columns.
data = pd.DataFrame({'a':["mango 2","mango 3",'apple 3', 'orange 1345','orange 2456','banana 1', "watermelon 2","mango 2","mango 3"], 'b':["mango 2","mango 3",'apple 3','banana 1', "watermelon 2", 'orange 41134','orange 22145',"mango 2","mango 3"], 'c':['apple 3',"mango 2","mango 3" ,"mango 2","mango 3",'banana 1', "watermelon 2",'orange 2222','orange 2341'], 'd':["mango 2","mango 3","mango 2","mango 3",'apple 3','banana 1', "watermelon 2","mango 2","mango 3",'orange 9087','orange 0021'], 'e':['apple 3', 'orange 1','orange 2','banana 1', "watermelon 2"]})
expected output :
df1 = pd.DataFrame({'category':['orange 1345','orange 2456','orange 41134','orange 22145','orange 2222','orange 2341','orange 9087','orange 0021','orange 1','orange 2'}]
Advertisement
Answer
Let’s try stack
then filter by str.contains
:
df1 = data.stack() df1 = ( df1[df1.str.contains('^orange', regex=True)] .reset_index(drop=True) .to_frame('category') )
df1
:
category 0 orange 1345 1 orange 41134 2 orange 2222 3 orange 9087 4 orange 1 5 orange 2456 6 orange 22145 7 orange 2341 8 orange 0021 9 orange 2
Or melt
for same order as OP:
df1 = data.melt()['value'] df1 = ( df1[df1.str.contains('^orange', regex=True)] .reset_index(drop=True) .to_frame('category') )
df1
:
category 0 orange 1345 1 orange 2456 2 orange 41134 3 orange 22145 4 orange 2222 5 orange 2341 6 orange 9087 7 orange 0021 8 orange 1 9 orange 2
regex ^orange
:
^
asserts position at start of a lineorange
matches the charactersorange
literally (case sensitive)