Skip to content
Advertisement

Is there a way to create columns from a list of phrases?

I have lists of phrases I would like to convert into columns in a dataframe to be used as inputs for a machine learning model. The code should find the unique phrases in all of the rows of data, create columns for the unique rows and indicate if the phrase is present in the row by showing a 1 if the phrase is present and a 0 if it is missing.

The phrases will look like the following:

{"TV", "Internet", "Wireless Internet", "Kitchen", "Free Parking on Premises",
 "Buzzer/Wireless Intercom", "Heating", "Family/Kid Friendly",
 "Washer,Dryer", "Smoke Detector", "Carbon Monoxide Detector",
 "First Aid Kit", "Safety Card", "Fire Extinguisher", "Essentials"
 }

{"TV", "Internet", "Wireless Internet", "Air Conditioning", "Kitchen",
 "Pets Allowed", "Pets live on this property", "Dog(s)", "Heating",
 "Family/Kid Friendly", "Washer", "Dryer", "Smoke Detector",
 "Carbon Monoxide Detector", "Fire Extinguisher", "Essentials",
 "Shampoo", "Lock on Bedroom Door", "Hangers", "Hair Dryer", "Iron"
 }

Desired output in the dataframe:

enter image description here

Advertisement

Answer

First create the columns for the DataFrame:

set1 = {"TV", "Internet", "Wireless Internet", "Kitchen", "Free Parking on Premises",
 "Buzzer/Wireless Intercom", "Heating", "Family/Kid Friendly",
 "Washer,Dryer", "Smoke Detector", "Carbon Monoxide Detector",
 "First Aid Kit", "Safety Card", "Fire Extinguisher", "Essentials"
 }

set2 = {"TV", "Internet", "Wireless Internet", "Air Conditioning", "Kitchen",
 "Pets Allowed", "Pets live on this property", "Dog(s)", "Heating",
 "Family/Kid Friendly", "Washer", "Dryer", "Smoke Detector",
 "Carbon Monoxide Detector", "Fire Extinguisher", "Essentials",
 "Shampoo", "Lock on Bedroom Door", "Hangers", "Hair Dryer", "Iron"
 }

# Create a list of iterables for later
list_of_sets = [set1, set2]

# Create a list with the "splat" operator, and then create a set from the list
columns = set([*set1, *set2])

# Optionally remove spaces, commas, etc
columns_optional = set([x.replace(" ", "").replace(",", "").replace("/", "") for x in columns])

Now to create the DataFrame rows:

def create_rows(list_of_iterables, columns):
    """Iterate through list of iterables (i.e. sets of words) 
    and check if they're in the columns"""
    
    list_of_df_rows = []
    for iterable in list_of_iterables:
        row_dict = {}
        for col in columns:
            # Set it to zero at first
            row_dict[col] = 0
            for item in iterable:
                if col == item:
                    # Change it to 1 if we found a match
                    row_dict[col] = 1
                    
        list_of_df_rows.append(row_dict)
    
    return list_of_df_rows

# Create DataFrame rows
rows = create_rows(list_of_sets, columns)

# Create DataFrame that's tall, not wide, at first
df = pd.DataFrame(rows, columns=columns)

print(df)
>>> Air Conditioning  TV  ...  Free Parking on Premises  Washer,Dryer
0                 0   1   ...                         1             1
1                 1   1   ...                         0             0
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement