Skip to content
Advertisement

Pandas skipping lines when in read_csv, can I record these to variable/log file

I’ve seen similar questions on here but nothing that is quite what I want to do.

I’m reading in a tsv/csv file using

        try:
            dataframe = pd.read_csv(
                filepath_or_buffer=filename_or_obj,
                sep='t',
                encoding='utf-8',
                skip_blank_lines=True,
                error_bad_lines=False,
                warn_bad_lines=True,
                dtype=data_type_dict,
                engine='python',
                quoting=csv.QUOTE_NONE
            )
        except UnicodeDecodeError:
            dataframe = pd.read_csv(
                filepath_or_buffer=exception_filename_or_obj,
                sep='t',
                encoding='latin-1',
                skip_blank_lines=True,
                error_bad_lines=False,
                warn_bad_lines=True,
                dtype=data_type_dict,
                engine='python',
                quoting=csv.QUOTE_NONE
            )

I have clearly defined headers within the file but sometimes I see that the file has unexpected additional columns and get the following messages in the console

Skipping line 251643: Expected 20 fields in line 251643, saw 21

This is fine for my process, I would just like to know a way that I can record these messages or lines to either a dataframe or log file so that I know what lines have been skipped. Due to the fact that the files can be submitted by anyone and it’s an issue with formatting, I’m not interested in fixing the message, just recording out the line numbers that fail

Massive thanks in advance :)

Edit: include try except clause

Advertisement

Answer

To reproduce the issue, I used the following CSV file (dummy.csv):

F1,F2,F3
11,A,10.54
18,B,0.12,low
24,A,19.00
10,C,7.01,low
22,D,39.11,high
49,E,12.12

It may be noted that some lines have extra fields.

Since, we are using error_bad_lines=False, no errors/exceptions will be raised, so try-except is not the way ahead. We need to redirect the stderr:

from contextlib import redirect_stderr
import pandas as pd
# import io

with open('error_messages.log', 'w') as h:
    # f = io.StringIO()
    # with redirect_stderr(f):
    with redirect_stderr(h):
        df = pd.read_csv(filepath_or_buffer='dummy.csv',
                sep=',',            # change it for your data
                encoding='latin-1',
                skip_blank_lines=True,
                error_bad_lines=False,
                # dtype=data_type_dict,
                engine='python',
                # quoting=csv.QUOTE_NONE
                )
        # h.write(f.getvalue())      # Write the error messages to log file

print(df)

The above code will write the messages to a log file!

Here is a sample output from the log file:

Skipping line 3: Expected 3 fields in line 3, saw 4
Skipping line 5: Expected 3 fields in line 5, saw 4
Skipping line 6: Expected 3 fields in line 6, saw 4

Update

Modified the code based on a suggestion (in comments below)

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement