Why file row count is more than len(dataframe)?

Question

Good morning, I'm new to python and data analysis world, so bear with me. I've been trying to understand why when counting file rows it gives the right answer but after converting to dataframe and counting len(datafarme), it gives a rowcount-1. I'm sure it's simple but I've googled it for about two hours and I didn't find an answer yet,

Accepted Answer

Reason is first row of csv is converted to columns, for avoid it and set columns names by range use header=None parameter:filename = 'amazon_cells_labelled.txt'with open(filename, encoding="utf8") as f:    row_count = sum(1 for line in f)print(row_count)  # 1000#first row of csv is first row of data df1 = pd.read_csv(filename, sep='t', header=None)print(df1.shape[0])  # 1000print(len(df1))  # 1000print(len(df1.index))  # 1000Your code:#first row of csv is converted to columns namesdf1 = pd.read_csv(filename, sep='t')EDIT: In next files is used ", so pandas incorrect parsing. For avoid read starting by " and then next rows ending by " like one row use quoting=3 parameter for quoting=None:filename = 'imdb_labelled.txt'with open(filename, encoding="utf8") as f:    row_count = sum(1 for line in f)print(row_count)  # 1000df = pd.read_csv(filename, sep='t', header=None, quoting=3)print(len(df.index))  1000

Advertisement

Answer