Skip to content
Advertisement

Why file row count is more than len(dataframe)?

Good morning,

I’m new to python and data analysis world, so bear with me. I’ve been trying to understand why when counting file rows it gives the right answer but after converting to dataframe and counting len(datafarme), it gives a rowcount-1.

I’m sure it’s simple but I’ve googled it for about two hours and I didn’t find an answer yet, so would you kindly explain this to me:

import pandas as pd

filename = 'amazon_labelled.txt'
with open(filename, encoding="utf8") as f:
    row_count = sum(1 for line in f)
print(row_count)  # 1000

csv = pd.read_csv(filename, sep='t')
df1 = pd.DataFrame(csv)
print(df1.shape[0])  # 999
print(len(df1))  # 999
print(len(df1.index))  # 999

EDIT: It seems that when converting txt to csv file, some lines went missing:

filename = 'imdb_labelled.txt'
with open(filename, encoding="utf8") as f:
    row_count = sum(1 for line in f)
print(row_count)  # 1000

csv = pd.read_csv(filename, sep='t', header=None)
print(csv.index)  # RangeIndex(start=0, stop=748, step=1)
print(csv)

I’m wondering now, does it have something to do with using sep=’t’?

Advertisement

Answer

Reason is first row of csv is converted to columns, for avoid it and set columns names by range use header=None parameter:

filename = 'amazon_cells_labelled.txt'
with open(filename, encoding="utf8") as f:
    row_count = sum(1 for line in f)
print(row_count)  # 1000

#first row of csv is first row of data 
df1 = pd.read_csv(filename, sep='t', header=None)

print(df1.shape[0])  # 1000
print(len(df1))  # 1000
print(len(df1.index))  # 1000

Your code:

#first row of csv is converted to columns names
df1 = pd.read_csv(filename, sep='t')

EDIT: In next files is used ", so pandas incorrect parsing. For avoid read starting by " and then next rows ending by " like one row use quoting=3 parameter for quoting=None:

filename = 'imdb_labelled.txt'
with open(filename, encoding="utf8") as f:
    row_count = sum(1 for line in f)
print(row_count)  # 1000

df = pd.read_csv(filename, sep='t', header=None, quoting=3)
print(len(df.index))  
1000
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement