Adjusting incorrect data of a CSV file data in a Pyspark dataframe

Question

I am trying to read CSV file into a dataframe in Pyspark but I have a CSV file which has mixed data. Part of its data belongs to its adjacent column. Is there any way to modify the dataframe in python to get the output dataframe as expected. Sample CSV Expected Output Answer You can do this by making use

Accepted Answer

You can do this by making use of regexp_extract from pyspark.sql.functions.My approach would be something like this:#read with a different separator so df generated with a single columndf = spark.read.csv('filename',header=True,sep='|')#renamed the column name with irr (to make it easy to call)newcolnames=['irr']for c,n in zip(df.columns,newcolnames):    df=df.withColumnRenamed(c,n)df.withColumn('ID',regexp_extract(df['irr'],r'(d+)',1))  .withColumn('Name',regexp_extract(df['irr'],'your_regex_pattern',0))  .drop(df['irr']).show()

Advertisement

Answer