I have a dataset like below-:
df = pd.DataFrame({ 'state':['California'] * 4 + ['Florida'] * 5 + ['Minnesota'] * 3 + ['New Hampshire'], 'population':['50-100', '0-50', '150-200', '50-100', '0-50', '150-200', '100-150', 'NA', '0-50', 'NA', '100-150', '50-100', 'NA'], 'locale':['rural', 'urban', 'town', 'suburb', 'suburb', 'urban', 'rural', 'suburb', 'NA', 'town', 'town', 'urban', 'rural'] })
I want new columns for each category in all columns for each state. An example of a row is below-:
state population=0-50 population=50-100 population=100-150 population=150-200 locale=rural locale=urban locale=town locale=suburb California 1 2 0 1 1 1 1 1
EDIT Data dump of 1st 5 rows as asked-:
{'state': {0: 'Connecticut', 1: 'Connecticut', 2: 'Connecticut', 3: 'Connecticut', 4: 'Connecticut'}, 'locale': {0: 'Suburb', 1: 'Suburb', 2: 'Suburb', 3: 'Suburb', 4: 'Suburb'}, 'pct_black/hispanic': {0: '[0.6, 0.8[', 1: '[0.6, 0.8[', 2: '[0.6, 0.8[', 3: '[0.6, 0.8[', 4: '[0.6, 0.8['}, 'pct_free/reduced': {0: '[0.2, 0.4[', 1: '[0.2, 0.4[', 2: '[0.2, 0.4[', 3: '[0.2, 0.4[', 4: '[0.2, 0.4['}, 'county_connections_ratio': {0: '[0.18, 1[', 1: '[0.18, 1[', 2: '[0.18, 1[', 3: '[0.18, 1[', 4: '[0.18, 1['}, 'pp_total_raw': {0: 'NA', 1: 'NA', 2: 'NA', 3: 'NA', 4: 'NA'}}
Advertisement
Answer
Use pd.get_dummies
+ Groupby.sum()
, as follows:
(pd.get_dummies(df.set_index('state')) .groupby('state').sum() .reset_index() )
Result:
state population_0-50 population_100-150 population_150-200 population_50-100 population_NA locale_NA locale_rural locale_suburb locale_town locale_urban 0 California 1 0 1 2 0 0 1 1 1 1 1 Florida 2 1 1 0 1 1 1 2 0 1 2 Minnesota 0 1 0 1 1 0 0 0 2 1 3 New Hampshire 0 0 0 0 1 0 1 0 0 0
If you want to exclude the entries with value NA
, you can use:
(pd.get_dummies(df[df != 'NA'].set_index('state')) .groupby('state').sum() .reset_index() )
Result:
state population_0-50 population_100-150 population_150-200 population_50-100 locale_rural locale_suburb locale_town locale_urban 0 California 1 0 1 2 1 1 1 1 1 Florida 2 1 1 0 1 2 0 1 2 Minnesota 0 1 0 1 0 0 2 1 3 New Hampshire 0 0 0 0 1 0 0 0