I have a csv dataset (with > 8m rows) that I load into a dataframe. The csv has columns like:
...,started_at,ended_at,... 2022-04-01 18:23:32,2022-04-01 22:18:15 2022-04-02 01:16:34,2022-04-02 02:18:32 ...
I am able to load the dataset into my dataframe, but then I need to add multiple calculated columns to the dataframe for each row. In otherwords, unlike this SO question, I do not want the rows of the new columns to have the same initial value (col 1 all NAN, col 2, all “dogs”, etc.).
Right now, I can add my columns by doing something like:
df['start_time'] = df.apply(lambda row: add_start_time(row['started_at']), axis = 1) df['start_cat'] = df.apply(lambda row: add_start_cat(row['start_time']), axis = 1) df['is_dark'] = df.apply(lambda row: add_is_dark(row['started_at']), axis = 1) df['duration'] = df.apply(lamba row: calc_dur(row'[started_at'],row['ended_at']), axis = 1)
But it seems inefficient since the entire dataset is processed N times (once for each call).
It seems that I should be able to calculate all of the new columns in a single go, but I am missing some conceptual approach.
Examples:
def calc_dur(started_at, ended_at): # started_at, ended_at are datetime64[ns]; converted at csv load diff = ended_at - started_at return diff.total_seconds() / 60 def add_start_time(started_at): # started_at is datetime64[ns]; converted at csv load return started_at.time() def add_is_dark(started_at): # tz is pytz.timezone('US/Central') # chi_town is the astral lookup for Chicago st = started_at.replace(tzinfo=TZ) chk = sun(chi_town.observer, date=st, tzinfo=chi_town.timezone) return st >= chk['dusk'] or st <= chk['dawn']
Update 1
Following on the information for MoRe, I was able to get the essential working. I needed to augment by adding the column
names, and then with the merge to specify the index.
data = pd.Series(df.apply(lambda x: [ add_start_time(x['started_at']), add_is_dark(x['started_at']), yrmo(x['year'], x['month']), calc_duration_in_minutes(x['started_at'], x['ended_at']), add_start_cat(x['started_at']) ], axis = 1)) new_df = pd.DataFrame(data.tolist(), data.index, columns=['start_time','is_dark','yrmo', 'duration','start_cat']) df = df.merge(new_df, left_index=True, right_index=True)
Advertisement
Answer
import pandas as pd data = pd.Series(dataframe.apply(lambda x: [function1(x[column_name]), function2(x[column_name)], function3(x[column_name])], axis = 1)) pd.DataFrame(data.tolist(),data.index)
if i understood your mean correctly, it’s your answer. but before everything please use Swifter pip :) first create a series by lists and convert it to columns…
swifter is a simple library (at least i think it is simple) that only has only one useful method: apply
import swifter data.swifter.apply(lambda x: x+1)
it use parallel manner to improve speed in large datasets… in small ones, it isn’t good and even is worse