I am trying to loop through a column in a pandas data frame to remove unnecessary white space in the beginning and end of the strings within the column. My data frame looks like this:
df={'c1': [' ab', 'fg', 'ac ', 'hj-jk ', ' ac', 'df, gh', 'gh', 'ab', 'ad', 'jk-pl', 'ae', 'kl-kl '], 'b2': ['ba', 'bc', 'bd', 'be', 'be', 'be', 'ba'] } c1 b2 0 ab, fg 1 ac, hj-jk 2 ac, df,gh 3 gh, be 4 ab, be 5 ad, jk-pl 6 ae, kl-kl
I tried the this answer here, but did not work either. The reason I need to remove the white space from the strings in this column is that I want to one hot encode this column using get.dummies() function. My idea was to use the strip() function to remove the white space from each value and then I use .str.get_dummies(‘,’):
#function to remove white space from strings def strip_string(dataframe, column_name): for id, item in dataframe[column_name].items(): a=item.strip() #removing the white space from the values of the column strip_string(df, 'c1') #creating one hot-encoded columns from the values using split(",") df1=df['c1'].str.get_dummies(',')
but my code returns duplicate columns and I don’t want this…I suppose the function to remove the white space is not working well? Can anyone help? My current output is:
ab ac df fg gh hj-jk jk-pl kl-kl ab ac ad ae gh 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 2 0 1 1 0 1 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 1 4 0 0 0 0 0 0 0 0 1 0 0 0 0 5 0 0 0 0 0 0 1 0 0 0 1 0 0 6 0 0 0 0 0 0 0 1 0 0 0 1 0
columns ‘ac’ and ‘ab’ are duplicated. I want to remove the duplicated columns
Advertisement
Answer
I would stack
, strip
, get_dummies
, and groupby.max
:
If the separator is ', '
:
df.stack().str.strip().str.get_dummies(sep=', ').groupby(level=0).max()
else:
df.stack().str.replace(r's', '', regex=True).str.get_dummies(sep=',').groupby(level=0).max()
output:
ab ac ba bc bd be df fg gh hj-jk 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 2 0 1 0 0 1 0 0 0 0 0 3 0 0 0 0 0 1 0 0 0 1 4 0 1 0 0 0 1 0 0 0 0 5 0 0 0 0 0 1 1 0 1 0 6 0 0 1 0 0 0 0 0 1 0