I am currently using pandas library to read data from a CSV file. The data includes a “data” column which consists of 1’s and 0’s, and a “published_at” column which has unique time and date stamps (I have converted it to become the index of the dataframe). Click here to see picture of the Dataframe from CSV (I deleted the core_id data as it is irrelevant).
In the data, a “1” means yes and “0” means no. I would like to analyze the data by looping through the dataframe from a certain start date to an end date (i.e. 2020-11-26 to 2020-11-27) and count how many times “1” (yes_data) occurred, and how many times “0” (no_data) occurred in each day. And from there, I would like to create a new CSV file or dataframe that contains that data so I can analyze it from there.
The way I tried to approach this is by creating a nested dictionary and trying to populate it by looping through the main dataframe and counting how many times “yes” and “no” occurred per day.
I would like to end up with a dictionary (or dataframe, csv file, whatever..) that has 3 columns: date (i.e. 2020-11-26), “yes” count, and “no” count.
Below is the code I came up with:
yes_data = 0 no_data = 0 date_id = '2020-11-26' # Create a dictionary to populate a new dataframe #new_data = { # date_id: {'yes': yes_data, 'no': no_data}, # "2020-11-27": {'yes': 2, 'no': 2}} new_data = {"":{}} # I tried to convert the csv data to a dictionary but I don't know # if this is necessary so I commented it out # csv_dict = csv_data.to_dict csv_dict = csv_data for day in csv_dict['2020-11-26':'2020-11-27']: new_data[day] = csv_dict[day] for state in csv_dict['2020-11-26':'2020-11-27'].data: if state == 1: yes_data += 1 new_data[day][state] == yes_data elif state == 0: no_data += 1 new_data[day][state] == no_data
However the code does not work (I keep getting errors everywhere..). How can I fix it to do what I’m trying to do? Any help is appreciated. Thank you!
P.S. I’m fairly new to Python, trying my best here!
Advertisement
Answer
Hope you’re doing well. This snippet will help you do the job!
result = {} for index, row in df.iterrows(): # Iterates over the row date = row['published_at'].split(' ')[0] # This line takes only the date of the row ( not hour and minute ...) ans = row['data'] # Finds if the data is zero or one if date not in result: # Creates an entery for this date if it hasn't created yet result[date] = {'yes':0, 'no':0} if ans: # Increases the number of yes if ans == 1 result[date]['yes']+=1 else: # Increases the number of yes if ans == 0 result[date]['no'] +=1
Keep in mind that df is your data frame and row[‘data’] means the 0 or 1. So if you have different names, change it. At the end of this code, you’re gonna have a dictionary with the structure as mentions in the following.
result = {'2020-12-01': {'yes': 2, 'no': 4}, '2020-12-4': {'yes':n, 'no':m} }
Have a good day