I am trying to read CSV file into a dataframe in Pyspark but I have a CSV file which has mixed data. Part of its data belongs to its adjacent column. Is there any way to modify the dataframe in python to get the output dataframe as expected.
Sample CSV
ID , Name '1' , 'Alice' '2B' , 'ob' '3Ri' , 'chard'
Expected Output
ID, Name 1, 'Alice' 2, 'Bob' 3, 'Richard'
Advertisement
Answer
You can do this by making use of regexp_extract
from pyspark.sql.functions
.
My approach would be something like this:
#read with a different separator so df generated with a single column df = spark.read.csv('filename',header=True,sep='|') #renamed the column name with irr (to make it easy to call) newcolnames=['irr'] for c,n in zip(df.columns,newcolnames): df=df.withColumnRenamed(c,n) df.withColumn('ID',regexp_extract(df['irr'],r'(d+)',1)) .withColumn('Name',regexp_extract(df['irr'],'your_regex_pattern',0)) .drop(df['irr']).show()