Skip to content
Advertisement

How to add multiple columns to a dataframe based on calculations

I have a csv dataset (with > 8m rows) that I load into a dataframe. The csv has columns like:

...,started_at,ended_at,...
2022-04-01 18:23:32,2022-04-01 22:18:15
2022-04-02 01:16:34,2022-04-02 02:18:32
...

I am able to load the dataset into my dataframe, but then I need to add multiple calculated columns to the dataframe for each row. In otherwords, unlike this SO question, I do not want the rows of the new columns to have the same initial value (col 1 all NAN, col 2, all “dogs”, etc.).

Right now, I can add my columns by doing something like:

df['start_time'] = df.apply(lambda row: add_start_time(row['started_at']), axis = 1)
df['start_cat'] = df.apply(lambda row: add_start_cat(row['start_time']), axis = 1)
df['is_dark'] = df.apply(lambda row: add_is_dark(row['started_at']), axis = 1)
df['duration'] = df.apply(lamba row: calc_dur(row'[started_at'],row['ended_at']), axis = 1)

But it seems inefficient since the entire dataset is processed N times (once for each call).

It seems that I should be able to calculate all of the new columns in a single go, but I am missing some conceptual approach.

Examples:

def calc_dur(started_at, ended_at):
  # started_at, ended_at are datetime64[ns]; converted at csv load
  diff = ended_at - started_at
  return diff.total_seconds() / 60

def add_start_time(started_at):
  # started_at is datetime64[ns]; converted at csv load
  return started_at.time()

def add_is_dark(started_at):
  # tz is pytz.timezone('US/Central')
  # chi_town is the astral lookup for Chicago
  st = started_at.replace(tzinfo=TZ)
  chk = sun(chi_town.observer, date=st, tzinfo=chi_town.timezone)
  return st >= chk['dusk'] or st <= chk['dawn']

Update 1

Following on the information for MoRe, I was able to get the essential working. I needed to augment by adding the column names, and then with the merge to specify the index.

  data = pd.Series(df.apply(lambda x: [
                    add_start_time(x['started_at']),
                    add_is_dark(x['started_at']),
                    yrmo(x['year'], x['month']),
                    calc_duration_in_minutes(x['started_at'], x['ended_at']),
                    add_start_cat(x['started_at'])
                  ], axis = 1))

  new_df = pd.DataFrame(data.tolist(),
                        data.index, 
                        columns=['start_time','is_dark','yrmo',
                                 'duration','start_cat'])

  df = df.merge(new_df, left_index=True, right_index=True)

Advertisement

Answer

import pandas as pd
data = pd.Series(dataframe.apply(lambda x: [function1(x[column_name]), function2(x[column_name)], function3(x[column_name])], axis = 1))
pd.DataFrame(data.tolist(),data.index)

if i understood your mean correctly, it’s your answer. but before everything please use Swifter pip :) first create a series by lists and convert it to columns…

swifter is a simple library (at least i think it is simple) that only has only one useful method: apply

import swifter
data.swifter.apply(lambda x: x+1)

it use parallel manner to improve speed in large datasets… in small ones, it isn’t good and even is worse

https://pypi.org/project/swifter/

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement