I’d like to iterate row-wise over 2 identically-shaped dataframes, passing the rows from each as vectors to a function without using loops. Essentially something similar to R’s mapply.
I’ve investigated a little and the best that I’ve seen uses map in a list comprehension, but I’m not doing it correctly. Even if we get this to work, though, it seems a bit clunky – is there a more elegant way to do this? Seems like this should be a functionality in pandas.
import numpy as np import pandas as pd from scipy import stats df1 = pd.DataFrame(np.random.randn(3,3)) df2 = pd.DataFrame(np.random.randn(3,3)) sd_array = np.array([0.02, 0.015, 0.2]) def helper_func(x, y): return stats.norm.pdf(x, loc=y, scale=sd_array).prod() res_lst = [] row_cnt = df1.shape[0] res = [list(map(helper_func, df1.iloc[i,:], df2.iloc[i,:])) for i in range(row_cnt)] res_lst.append(res)
The way I currently have it written doesn’t give an error but also doesn’t return what I want. I should only have 3 values in the output, one for each row of the dataframe.
Advertisement
Answer
You can just do helper_func(df1, df2)
, and in helper_func
: return stats.norm.pdf(x, loc=y, scale=sd_array).prod(axis=1)
. Be aware that your scale is such, that the values returned are almost always 0. Using scale=100*sd_array in the PDF will at least show some non-zero values.
In fact, you don’t need a dataframe in this example:
import numpy as np from scipy import stats np.random.seed(1) data1 = np.random.randn(3,3) data2 = np.random.randn(3,3) sd_array = np.array([0.02, 0.015, 0.2]) C = 100 # for demonstration purposes def helper_func(x, y): return stats.norm.pdf(x, loc=y, scale=C*sd_array).prod(axis=1) res = helper_func(data1, data2) print(res)
yields
array([0.0002616 , 0.00068695, 0.00035566])
But when using a dataframe instead of data1
or data2
, NumPy/Pandas/Scipy are flexible enough to recognize the 2D array of values and use it as such.