Skip to content
Advertisement

Can I get a sub-DataFrame according to first letter in columns names?

I want to get only columns whose names start with 'Q1' and those starting with 'Q3', I know that this is possible by doing:

new_df=df[['Q1_1', 'Q1_2', 'Q1_3','Q3_1', 'Q3_2', 'Q3_3']]

But since my real df is too large (more than 70 variables) I search a way to get the new_df by using only desired first letters in the columns titles.

My example dataframe is:

df=pd.DataFrame({
    'Q1_1': [np.random.randint(1,100) for i in range(10)],
    'Q1_2': np.random.random(10),
    'Q1_3': np.random.randint(2, size=10),
    'Q2_1': [np.random.randint(1,100) for i in range(10)],
    'Q2_2': np.random.random(10),
    'Q2_3': np.random.randint(2, size=10),
    'Q3_1': [np.random.randint(1,100) for i in range(10)],
    'Q3_2': np.random.random(10),
    'Q3_3': np.random.randint(2, size=10),
    'Q4_1': [np.random.randint(1,100) for i in range(10)],
    'Q4_2': np.random.random(10),
    'Q4_3': np.random.randint(2, size=10)
})

df has the following display:

    Q1_1    Q1_2    Q1_3    Q2_1    Q2_2    Q2_3    Q3_1    Q3_2    Q3_3    Q4_1    Q4_2    Q4_3
0   92  0.551722    1   36  0.063269    1   95  0.541573    1   91  0.521076    1
1   89  0.951076    1   82  0.853572    1   49  0.782290    1   98  0.232572    0
2   88  0.909953    1   19  0.544450    1   66  0.021061    1   51  0.951225    0
3   66  0.904642    1   17  0.727190    1   85  0.697792    0   35  0.412844    1
4   78  0.802783    1   23  0.634575    1   77  0.759861    0   55  0.460012    0
5   41  0.943271    1   63  0.460578    1   95  0.004986    1   89  0.970059    0
6   54  0.600558    0   18  0.031487    0   84  0.716314    0   84  0.636364    1
7   2   0.458006    0   95  0.029421    0   10  0.927356    1   27  0.031572    1
8   38  0.029658    1   30  0.125706    1   94  0.096702    1   32  0.241613    1
9   52  0.584300    1   85  0.026642    0   78  0.358952    0   70  0.696008    0

I want a simpler way to get the following sub-df:

    Q1_1    Q1_2    Q1_3    Q3_1    Q3_2    Q3_3
0   92  0.551722    1   95  0.541573    1
1   89  0.951076    1   49  0.782290    1
2   88  0.909953    1   66  0.021061    1
3   66  0.904642    1   85  0.697792    0
4   78  0.802783    1   77  0.759861    0
5   41  0.943271    1   95  0.004986    1
6   54  0.600558    0   84  0.716314    0
7   2   0.458006    0   10  0.927356    1
8   38  0.029658    1   94  0.096702    1
9   52  0.584300    1   78  0.358952    0

Please if you need more detail let me know in comments,

Any help from your side will be highly appreciated.

Advertisement

Answer

You can use pd.DataFrame.filter for this:

df.filter(regex = r'Q1_d|Q3_d')

   Q1_1      Q1_2  Q1_3  Q3_1      Q3_2  Q3_3
0     5  0.631041     0    46  0.768563     0
1    32  0.594106     1    46  0.982396     1
2    78  0.703139     1    38  0.252107     0
3    98  0.353230     0    35  0.324079     0
4    77  0.913203     1    11  0.456287     0
5    62  0.565350     1    77  0.387365     0
6    38  0.975652     1    59  0.276421     1
7    97  0.505808     1    84  0.035756     0
8    15  0.525452     0    57  0.675310     1
9    94  0.545259     0    25  0.628030     0
Advertisement