I want to anonymize or replace almost all columns in a pyspark dataframe except a few ones.
I know its possible to do something like:
JavaScript
x
7
1
anonymized_df = employee_df.withColumn("EMPLOYEENUMBER", col("EMPLOYEENUMBER"))
2
.withColumn("NAME1", lit(""))
3
.withColumn("TELEPHONE", lit(""))
4
.withColumn("ELECTRONICMAILADDRESS", lit(""))
5
6
7
However, doing this for all columns is a tedious process. I would rather want to do something along the lines of this:
JavaScript
1
3
1
anonymized_df = employee_df.withColumn("EMPLOYEENUMBER", col("EMPLOYEENUMBER"))
2
.withcolumn("*", lit("")) # all other columns replace
3
This does however not seem to work. Is there other work arounds that achieve this?
I guess one solution would be to could create a list of column names and do something along the lines of:
JavaScript
1
5
1
col_list = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
2
3
for col in col_list:
4
employee_df= employee_df.withColumn(col, lit("")))
5
Other suggestions would be of much help.
Advertisement
Answer
You can use select
. syntax-wise it won’t be much different but it will only create 1 snapshot.
JavaScript
1
5
1
keep_cols = ['a', 'b', 'c']
2
empty_cols = ['d', 'e', 'f'] # or list(set(df.columns) - set(keep_cols))
3
4
df = df.select(*keep_cols, *[lit('').alias(x) for x in empty_cols])
5