I have a dataset like below-:
df = pd.DataFrame({
'state':['California'] * 4 + ['Florida'] * 5 + ['Minnesota'] * 3 + ['New Hampshire'],
'population':['50-100', '0-50', '150-200', '50-100', '0-50', '150-200',
'100-150', 'NA', '0-50', 'NA', '100-150', '50-100', 'NA'],
'locale':['rural', 'urban', 'town', 'suburb', 'suburb', 'urban', 'rural', 'suburb', 'NA', 'town', 'town', 'urban', 'rural']
})
I want new columns for each category in all columns for each state. An example of a row is below-:
state population=0-50 population=50-100 population=100-150 population=150-200 locale=rural locale=urban locale=town locale=suburb California 1 2 0 1 1 1 1 1
EDIT Data dump of 1st 5 rows as asked-:
{'state': {0: 'Connecticut',
1: 'Connecticut',
2: 'Connecticut',
3: 'Connecticut',
4: 'Connecticut'},
'locale': {0: 'Suburb', 1: 'Suburb', 2: 'Suburb', 3: 'Suburb', 4:
'Suburb'},
'pct_black/hispanic': {0: '[0.6, 0.8[',
1: '[0.6, 0.8[',
2: '[0.6, 0.8[',
3: '[0.6, 0.8[',
4: '[0.6, 0.8['},
'pct_free/reduced': {0: '[0.2, 0.4[',
1: '[0.2, 0.4[',
2: '[0.2, 0.4[',
3: '[0.2, 0.4[',
4: '[0.2, 0.4['},
'county_connections_ratio': {0: '[0.18, 1[',
1: '[0.18, 1[',
2: '[0.18, 1[',
3: '[0.18, 1[',
4: '[0.18, 1['},
'pp_total_raw': {0: 'NA', 1: 'NA', 2: 'NA', 3: 'NA', 4: 'NA'}}
Advertisement
Answer
Use pd.get_dummies + Groupby.sum(), as follows:
(pd.get_dummies(df.set_index('state'))
.groupby('state').sum()
.reset_index()
)
Result:
state population_0-50 population_100-150 population_150-200 population_50-100 population_NA locale_NA locale_rural locale_suburb locale_town locale_urban 0 California 1 0 1 2 0 0 1 1 1 1 1 Florida 2 1 1 0 1 1 1 2 0 1 2 Minnesota 0 1 0 1 1 0 0 0 2 1 3 New Hampshire 0 0 0 0 1 0 1 0 0 0
If you want to exclude the entries with value NA, you can use:
(pd.get_dummies(df[df != 'NA'].set_index('state'))
.groupby('state').sum()
.reset_index()
)
Result:
state population_0-50 population_100-150 population_150-200 population_50-100 locale_rural locale_suburb locale_town locale_urban 0 California 1 0 1 2 1 1 1 1 1 Florida 2 1 1 0 1 2 0 1 2 Minnesota 0 1 0 1 0 0 2 1 3 New Hampshire 0 0 0 0 1 0 0 0