I have lists of phrases I would like to convert into columns in a dataframe to be used as inputs for a machine learning model. The code should find the unique phrases in all of the rows of data, create columns for the unique rows and indicate if the phrase is present in the row by showing a 1
if the phrase is present and a 0
if it is missing.
The phrases will look like the following:
{"TV", "Internet", "Wireless Internet", "Kitchen", "Free Parking on Premises", "Buzzer/Wireless Intercom", "Heating", "Family/Kid Friendly", "Washer,Dryer", "Smoke Detector", "Carbon Monoxide Detector", "First Aid Kit", "Safety Card", "Fire Extinguisher", "Essentials" } {"TV", "Internet", "Wireless Internet", "Air Conditioning", "Kitchen", "Pets Allowed", "Pets live on this property", "Dog(s)", "Heating", "Family/Kid Friendly", "Washer", "Dryer", "Smoke Detector", "Carbon Monoxide Detector", "Fire Extinguisher", "Essentials", "Shampoo", "Lock on Bedroom Door", "Hangers", "Hair Dryer", "Iron" }
Desired output in the dataframe:
Advertisement
Answer
First create the columns for the DataFrame:
set1 = {"TV", "Internet", "Wireless Internet", "Kitchen", "Free Parking on Premises", "Buzzer/Wireless Intercom", "Heating", "Family/Kid Friendly", "Washer,Dryer", "Smoke Detector", "Carbon Monoxide Detector", "First Aid Kit", "Safety Card", "Fire Extinguisher", "Essentials" } set2 = {"TV", "Internet", "Wireless Internet", "Air Conditioning", "Kitchen", "Pets Allowed", "Pets live on this property", "Dog(s)", "Heating", "Family/Kid Friendly", "Washer", "Dryer", "Smoke Detector", "Carbon Monoxide Detector", "Fire Extinguisher", "Essentials", "Shampoo", "Lock on Bedroom Door", "Hangers", "Hair Dryer", "Iron" } # Create a list of iterables for later list_of_sets = [set1, set2] # Create a list with the "splat" operator, and then create a set from the list columns = set([*set1, *set2]) # Optionally remove spaces, commas, etc columns_optional = set([x.replace(" ", "").replace(",", "").replace("/", "") for x in columns])
Now to create the DataFrame rows:
def create_rows(list_of_iterables, columns): """Iterate through list of iterables (i.e. sets of words) and check if they're in the columns""" list_of_df_rows = [] for iterable in list_of_iterables: row_dict = {} for col in columns: # Set it to zero at first row_dict[col] = 0 for item in iterable: if col == item: # Change it to 1 if we found a match row_dict[col] = 1 list_of_df_rows.append(row_dict) return list_of_df_rows # Create DataFrame rows rows = create_rows(list_of_sets, columns) # Create DataFrame that's tall, not wide, at first df = pd.DataFrame(rows, columns=columns) print(df) >>> Air Conditioning TV ... Free Parking on Premises Washer,Dryer 0 0 1 ... 1 1 1 1 1 ... 0 0