Skip to content
Advertisement

Make a new column for each category in a particular column and repeat this for all columns in a Pandas dataframe

I have a dataset like below-:

df = pd.DataFrame({
'state':['California'] * 4 + ['Florida'] * 5 + ['Minnesota'] * 3 + ['New Hampshire'],
'population':['50-100', '0-50', '150-200', '50-100', '0-50', '150-200',
         '100-150', 'NA', '0-50', 'NA', '100-150', '50-100', 'NA'],
'locale':['rural', 'urban', 'town', 'suburb', 'suburb', 'urban', 'rural', 'suburb', 'NA', 'town', 'town', 'urban', 'rural'] 
})

I want new columns for each category in all columns for each state. An example of a row is below-:

state        population=0-50 population=50-100 population=100-150 population=150-200 locale=rural locale=urban locale=town locale=suburb
California   1               2                 0                  1                  1          1              1            1

EDIT Data dump of 1st 5 rows as asked-:

{'state': {0: 'Connecticut',
  1: 'Connecticut',
  2: 'Connecticut',
  3: 'Connecticut',
  4: 'Connecticut'},
 'locale': {0: 'Suburb', 1: 'Suburb', 2: 'Suburb', 3: 'Suburb', 4: 
'Suburb'},
 'pct_black/hispanic': {0: '[0.6, 0.8[',
  1: '[0.6, 0.8[',
  2: '[0.6, 0.8[',
  3: '[0.6, 0.8[',
  4: '[0.6, 0.8['},
 'pct_free/reduced': {0: '[0.2, 0.4[',
  1: '[0.2, 0.4[',
  2: '[0.2, 0.4[',
  3: '[0.2, 0.4[',
  4: '[0.2, 0.4['},
 'county_connections_ratio': {0: '[0.18, 1[',
  1: '[0.18, 1[',
  2: '[0.18, 1[',
  3: '[0.18, 1[',
  4: '[0.18, 1['},
 'pp_total_raw': {0: 'NA', 1: 'NA', 2: 'NA', 3: 'NA', 4: 'NA'}}

Advertisement

Answer

Use pd.get_dummies + Groupby.sum(), as follows:

(pd.get_dummies(df.set_index('state'))
   .groupby('state').sum()
   .reset_index()
)

Result:

           state  population_0-50  population_100-150  population_150-200  population_50-100  population_NA  locale_NA  locale_rural  locale_suburb  locale_town  locale_urban
0     California                1                   0                   1                  2              0          0             1              1            1             1
1        Florida                2                   1                   1                  0              1          1             1              2            0             1
2      Minnesota                0                   1                   0                  1              1          0             0              0            2             1
3  New Hampshire                0                   0                   0                  0              1          0             1              0            0             0

If you want to exclude the entries with value NA, you can use:

(pd.get_dummies(df[df != 'NA'].set_index('state'))
   .groupby('state').sum()
   .reset_index()
)

Result:

           state  population_0-50  population_100-150  population_150-200  population_50-100  locale_rural  locale_suburb  locale_town  locale_urban
0     California                1                   0                   1                  2             1              1            1             1
1        Florida                2                   1                   1                  0             1              2            0             1
2      Minnesota                0                   1                   0                  1             0              0            2             1
3  New Hampshire                0                   0                   0                  0             1              0            0             0
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement