Why does read_csv skiprows value need to be lower than it should be in this case?

Tags: , , , ,



I have a log file (Text.TXT in this case):

# 1: 5
# 3: x
# F: 5.
# ID: 001
# No.: 2
# No.: 4
# Time: 20191216T122109
# Value: ";"
# Time: 4
# Time: ""
# Time ms: ""
# Date: ""
# Time separator: "T"
# J: 1000000
# Silent: false
# mode: true
Timestamp;T;ID;P
16T122109957;0;6;0006

To read in this log file into pandas and ignore all the header info I would use skiprows up to line 16 like so:

pd.read_csv('test.TXT',skiprows=16,sep=';')

But this produces EmptyDataError as it is skipping past where the data is starting. To make this work I’ve had to use it on line 11:

pd.read_csv('test.TXT',skiprows=11,sep=';')
      Timestamp  T  ID  P
0  16T122109957  0   6  6

My question is if the data doesn’t start until row 17, in this case, why do I need to request a skiprows up to row 11?

Answer

One work around is to use comment parameter of pd.read_csv

from io import StringIO

text='''# 1: 5
# 3: x
# F: 5.
# ID: 001
# No.: 2
# No.: 4
# Time: 20191216T122109
# Value: ";"
# Time: 4
# Time: ""
# Time ms: ""
# Date: ""
# Time separator: "T"
# J: 1000000
# Silent: false
# mode: true
Timestamp;T;ID;P
16T122109957;0;6;0006'''

df = pd.read_csv(StringIO(text),comment='#',sep=';')
df
      Timestamp  T  ID  P
0  16T122109957  0   6  6

Or

df = pd.read_csv(StringIO(text),header=0,comment='#',sep=';')

From docs under header parameter:

Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

Not sure about skiprows‘s weird behaviour here.



Source: stackoverflow