Good morning,
I’m new to python and data analysis world, so bear with me. I’ve been trying to understand why when counting file rows it gives the right answer but after converting to dataframe and counting len(datafarme), it gives a rowcount-1.
I’m sure it’s simple but I’ve googled it for about two hours and I didn’t find an answer yet, so would you kindly explain this to me:
import pandas as pd filename = 'amazon_labelled.txt' with open(filename, encoding="utf8") as f: row_count = sum(1 for line in f) print(row_count) # 1000 csv = pd.read_csv(filename, sep='t') df1 = pd.DataFrame(csv) print(df1.shape[0]) # 999 print(len(df1)) # 999 print(len(df1.index)) # 999
EDIT: It seems that when converting txt to csv file, some lines went missing:
filename = 'imdb_labelled.txt' with open(filename, encoding="utf8") as f: row_count = sum(1 for line in f) print(row_count) # 1000 csv = pd.read_csv(filename, sep='t', header=None) print(csv.index) # RangeIndex(start=0, stop=748, step=1) print(csv)
I’m wondering now, does it have something to do with using sep=’t’?
Advertisement
Answer
Reason is first row of csv is converted to columns, for avoid it and set columns names by range use header=None
parameter:
filename = 'amazon_cells_labelled.txt' with open(filename, encoding="utf8") as f: row_count = sum(1 for line in f) print(row_count) # 1000 #first row of csv is first row of data df1 = pd.read_csv(filename, sep='t', header=None) print(df1.shape[0]) # 1000 print(len(df1)) # 1000 print(len(df1.index)) # 1000
Your code:
#first row of csv is converted to columns names df1 = pd.read_csv(filename, sep='t')
EDIT: In next files is used "
, so pandas incorrect parsing. For avoid read starting by "
and then next rows ending by "
like one row use quoting=3
parameter for quoting=None
:
filename = 'imdb_labelled.txt' with open(filename, encoding="utf8") as f: row_count = sum(1 for line in f) print(row_count) # 1000 df = pd.read_csv(filename, sep='t', header=None, quoting=3) print(len(df.index)) 1000