I have some code that used to function ~3-4 years ago. I’ve upgraded to newer versions of pandas, numpy, python since then and it has broken. I’ve isolated what I believe is the issue, but don’t quite understand why it occurs.
def function_name(S): L = df2.reindex(S.index.droplevel(['column1','column2']))*len(S) return (-L/np.expm1(-L) - 1) gb = df.groupby(level=['name1', 'name2']) dc = gb.transform(function_name)
Problem: the last line “dc” is a pandas.Series with only NaN values. It should have no NaN values.
Relevant information — the gb object is correct and has no NaN or null values. Also, when I print out the “L” in the function, or the “return” in the function, I get the correct values. However, it’s lost somewhere in the “dc” line. When I swap ‘transform’ to ‘apply’ I get the correct values out of ‘dc’ but the object has duplicate column labels that make it unusable.
Thanks!
EDIT:
Below is some minimal code I spun up to produce the error.
import pandas as pd import numpy as np df1_arrays = [ np.array(["CAT","CAT","CAT","CAT","CAT","CAT","CAT","CAT"]), np.array(["A","A","A","A","B","B","B","B"]), np.array(["AAAT","AAAG","AAAC","AAAD","AAAZ","AAAX","AAAW","AAAM"]), ] df2_arrays = [ np.array(["A","A","A","A","B","B","B","B"]), np.array(["AAAT","AAAG","AAAC","AAAD","AAAZ","AAAX","AAAW","AAAM"]), ] df1 = pd.Series(np.abs(np.random.randn(8))*100, index=df1_arrays) df2 = pd.Series(np.abs(np.random.randn(8)), index=df2_arrays) df1.index.set_names(["mouse", "target", "barcode"], inplace=True) df2.index.set_names(["target", "barcode"], inplace=True) def function_name(S): lambdas = df2.reindex(S.index.droplevel(['mouse']))*len(S) return (-lambdas/np.expm1(-lambdas) - 1) gb = df1.groupby(level=['mouse','target']) d_collisions = gb.transform(function_name) print(d_collisions) mouse target barcode CAT A AAAT NaN AAAG NaN AAAC NaN AAAD NaN B AAAZ NaN AAAX NaN AAAW NaN AAAM NaN
Advertisement
Answer
The cause of the NaNs is that your function outputs a DataFrame/Series with different indices, thus causing reindexing to NaNs.
You can return a numpy array in your function:
def function_name(S): lambdas = df2.reindex(S.index.droplevel(['mouse']))*len(S) return (-lambdas/np.expm1(-lambdas) - 1).to_numpy() # convert to array here gb = df1.groupby(level=['mouse','target']) d_collisions = gb.transform(function_name)
output:
mouse target barcode CAT A AAAT 6.338965 AAAG 2.815679 AAAC 0.547306 AAAD 1.811785 B AAAZ 1.881744 AAAX 10.986611 AAAW 5.124226 AAAM 0.250513 dtype: float64