Skip to content
Advertisement

how to transform dataframe into data set/object

I have a data set in a dataframe that’s almost 9 million rows and 30 columns. As the columns count up, the data becomes more specific thus leading the data in the first columns to be very repetitive. See example:

park_code camp_ground parking_lot
acad campground1 parking_lot1
acad campground1 parking_lot2
acad campground2 parking_lot3
bisc campground3 parking_lot4

I’m looking to feed that information in to a result set like an object for example:

park code: acad
campgrounds: campground 1, campground 2
parking lots: parking_lot1, parking_lot2, parking_lot3

park code: bisc
campgrounds: campground3, ….
…….

etc.

I’m completely at a loss how to do this with pandas, and I’m learning as I go as I’m used to working in SQL and databases not with pandas. If you want to see the code that’s gotten me this far, here it is:

function call:

data_handler.fetch_results(['Wildlife Watching', 'Arts and Culture'], ['Restroom'], ['Acadia National 
    Park'], ['ME'])
def fetch_results(self, activities_selection, amenities_selection, parks_selection, states_selection):
    activities_selection_df = self.activities_df['park_code'][self.activities_df['activity_name'].
    isin(activities_selection)].drop_duplicates()
    amenities_selection_df = self.amenities_parks_df['park_code'][self.amenities_parks_df['amenity_name'].
    isin(amenities_selection)].drop_duplicates()
    states_selection_df = self.activities_df['park_code'][self.activities_df['park_states'].
    isin(states_selection)].drop_duplicates()
    parks_selection_df = self.activities_df['park_code'][self.activities_df['park_name'].
    isin(parks_selection)].drop_duplicates()
    data = activities_selection_df[activities_selection_df.isin(amenities_selection_df) &
                                   activities_selection_df.isin(states_selection_df) & activities_selection_df.
                                   isin(parks_selection_df)].drop_duplicates()
    pandas_select_df = pd.DataFrame(data, columns=['park_code'])

    results_df = pd.merge(pandas_select_df, self.activities_df, on='park_code', how='left')
    results_df = pd.merge(results_df, self.amenities_parks_df[['park_code', 'amenity_name', 'amenity_url']],
                          on='park_code', how='left')
    results_df = pd.merge(results_df, self.campgrounds_df[['park_code', 'campground_name', 'campground_url',
                                                           'campground_road', 'campground_classification',
                                                           'campground_general_ADA',
                                                           'campground_wheelchair_access',
                                                           'campground_rv_info', 'campground_description',
                                                           'campground_cell_reception', 'campground_camp_store',
                                                           'campground_internet', 'campground_potable_water',
                                                           'campground_toilets',
                                                           'campground_campsites_electric',
                                                           'campground_staff_volunteer']], on='park_code',
                                                            how='left')
    results_df = pd.merge(results_df, self.places_df[['park_code', 'places_title', 'places_url']],
                          on='park_code', how='left')
    results_df = pd.merge(results_df, self.parking_lot_df[
        ['park_code', "parking_lots_name", "parking_lots_ADA_facility_description",
         "parking_lots_is_lot_accessible", "parking_lots_number_oversized_spaces",
         "parking_lots_number_ADA_spaces",
         "parking_lots_number_ADA_Step_Free_Spaces", "parking_lots_number_ADA_van_spaces",
         "parking_lots_description"]], on='park_code', how='left')
    # print(self.campgrounds_df.to_string(max_rows=20))
    print(results_df.to_string(max_rows=40))

Any help will be appreciated.

Advertisement

Answer

In general, you can group by park_code and collect other columns into lists, then – transform to a dictionary:

df.groupby('park_code').agg({'camp_ground': list, 'parking_lot': list}).to_dict(orient='index')

Sample result:

{'acad ': {'camp_ground': ['campground1 ', 'campground1 ', 'campground2 '],
  'parking_lot': ['parking_lot1', 'parking_lot2', 'parking_lot3']},
 'bisc ': {'camp_ground': ['campground3 '], 'parking_lot': ['parking_lot4']}}
Advertisement