Skip to content
Advertisement

Adjusting incorrect data of a CSV file data in a Pyspark dataframe

I am trying to read CSV file into a dataframe in Pyspark but I have a CSV file which has mixed data. Part of its data belongs to its adjacent column. Is there any way to modify the dataframe in python to get the output dataframe as expected.

Sample CSV

ID   , Name  
'1'  , 'Alice'
'2B' , 'ob'
'3Ri' , 'chard'

Expected Output

ID, Name  
1, 'Alice'
2, 'Bob'
3, 'Richard' 

Advertisement

Answer

You can do this by making use of regexp_extract from pyspark.sql.functions.

My approach would be something like this:

#read with a different separator so df generated with a single column
df = spark.read.csv('filename',header=True,sep='|')

#renamed the column name with irr (to make it easy to call)
newcolnames=['irr']
for c,n in zip(df.columns,newcolnames):
    df=df.withColumnRenamed(c,n)

df.withColumn('ID',regexp_extract(df['irr'],r'(d+)',1))
  .withColumn('Name',regexp_extract(df['irr'],'your_regex_pattern',0))
  .drop(df['irr']).show()
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement