Good morning,
I’m new to python and data analysis world, so bear with me. I’ve been trying to understand why when counting file rows it gives the right answer but after converting to dataframe and counting len(datafarme), it gives a rowcount-1.
I’m sure it’s simple but I’ve googled it for about two hours and I didn’t find an answer yet, so would you kindly explain this to me:
import pandas as pd
filename = 'amazon_labelled.txt'
with open(filename, encoding="utf8") as f:
row_count = sum(1 for line in f)
print(row_count) # 1000
csv = pd.read_csv(filename, sep='t')
df1 = pd.DataFrame(csv)
print(df1.shape[0]) # 999
print(len(df1)) # 999
print(len(df1.index)) # 999
EDIT: It seems that when converting txt to csv file, some lines went missing:
filename = 'imdb_labelled.txt'
with open(filename, encoding="utf8") as f:
row_count = sum(1 for line in f)
print(row_count) # 1000
csv = pd.read_csv(filename, sep='t', header=None)
print(csv.index) # RangeIndex(start=0, stop=748, step=1)
print(csv)
I’m wondering now, does it have something to do with using sep=’t’?
Advertisement
Answer
Reason is first row of csv is converted to columns, for avoid it and set columns names by range use header=None
parameter:
filename = 'amazon_cells_labelled.txt'
with open(filename, encoding="utf8") as f:
row_count = sum(1 for line in f)
print(row_count) # 1000
#first row of csv is first row of data
df1 = pd.read_csv(filename, sep='t', header=None)
print(df1.shape[0]) # 1000
print(len(df1)) # 1000
print(len(df1.index)) # 1000
Your code:
#first row of csv is converted to columns names
df1 = pd.read_csv(filename, sep='t')
EDIT: In next files is used "
, so pandas incorrect parsing. For avoid read starting by "
and then next rows ending by "
like one row use quoting=3
parameter for quoting=None
:
filename = 'imdb_labelled.txt'
with open(filename, encoding="utf8") as f:
row_count = sum(1 for line in f)
print(row_count) # 1000
df = pd.read_csv(filename, sep='t', header=None, quoting=3)
print(len(df.index))
1000