I have one numeric feature in a data frame but in excel some of the values contain quotes which need to be removed.
Below table is what my data appears to be in Excel file now I want to remove quotes from last 3 rows using python.
Col1 | Col2 |
---|---|
123 | A |
456 | B |
789 | C |
“123” | D |
“456” | E |
“789” | F |
I have used following code in Python:
df["Col1"] = df['Col1'].replace('"', ' ').astype(int)
But above code gives me error message: invalid literal for int() with base 10: ‘”123″‘.
I have also tried strip() function but still it is not working.
If I do not convert the data type and use below code
df["Col1"] = df['Col1'].replace('"', ' ')
Then the code is getting executed without any error however while saving the file into CSV it is still showing quotes.
Advertisement
Answer
One way is to use converter function while reading Excel file. Something along those lines (assuming that data provided is in Excel file in columns ‘A’ and ‘B’):
import pandas as pd def conversion(value): if type(value) == int: return value else: return value.strip('"') df = pd.read_excel('remove_quotes_excel.xlsx', header=None, converters={0: conversion}) # df 0 1 0 123 A 1 456 B 2 789 C 3 123 D 4 456 E 5 789 F
Both columns are object type, but now (if needed) it is straightforward to convert to int:
df[0] = df[0].astype(int)