I want to anonymize or replace almost all columns in a pyspark dataframe except a few ones. I know its possible to do something like: However, doing this for all columns is a tedious process. I would rather want to do something along the lines of this: This does however not seem to work. Is there other work a…

Using .withColumn on all remaining columns in DF

I want to anonymize or replace almost all columns in a pyspark dataframe except a few ones.

I know its possible to do something like:

anonymized_df = employee_df.withColumn("EMPLOYEENUMBER", col("EMPLOYEENUMBER"))
                                        .withColumn("NAME1", lit(""))
                                        .withColumn("TELEPHONE", lit(""))
                                        .withColumn("ELECTRONICMAILADDRESS", lit(""))

However, doing this for all columns is a tedious process. I would rather want to do something along the lines of this:

anonymized_df = employee_df.withColumn("EMPLOYEENUMBER", col("EMPLOYEENUMBER"))
                           .withcolumn("*", lit("")) # all other columns replace

This does however not seem to work. Is there other work arounds that achieve this?

I guess one solution would be to could create a list of column names and do something along the lines of:

col_list = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']

for col in col_list:
   employee_df= employee_df.withColumn(col, lit("")))

Other suggestions would be of much help.

Answer

You can use select. syntax-wise it won’t be much different but it will only create 1 snapshot.

keep_cols = ['a', 'b', 'c']
empty_cols = ['d', 'e', 'f'] # or list(set(df.columns) - set(keep_cols))

df = df.select(*keep_cols, *[lit('').alias(x) for x in empty_cols])

Advertisement

Answer