Below is a portion of mydataframe which has many missing values.
A B S a b c d e a b c d e date 2020-10-15 1.0 2.0 NaN NaN NaN 10.0 11.0 NaN NaN NaN 2020-10-16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2020-10-17 NaN NaN NaN 4.0 NaN NaN NaN NaN 13.0 NaN 2020-10-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2020-10-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2020-10-20 4.0 6.0 4.0 1.0 9.0 10.0 2.0 13.0 4.0 13.0
I would like to replace the NANs in each column using a specific backward fill condition .
For example, in column (A,a) missing values appear for dates 16th, 17th, 18th and 19th. The next value is ‘4’ against 20th. I want this value (the next non missing value in the column) to be distributed among all these dates including 20th at a progressively increasing value of 10%. That is column (A,a) gets values of .655, .720,.793,.872 & .96 approximately for the dates 16th, 17th, 18th, 19th & 20th. This shall be the approach for all columns for all missing values across rows.
I tried using bfill() function but unable to fathom how to incorporate the required formula as an option.
I have checked the link Pandas: filling missing values in time series forward using a formula and a few other links on stackoverflow. This is somewhat similar, but in my case the the number of NANs in a given column are variable in nature and span multiple rows. Compare columns (A,a) with column (A,d) or column (B,d). Given this, I am finding it difficult to adopt the solution to my problem.
Appreciate any inputs.
Advertisement
Answer
Here is a completely vectorized way to do this. It is very efficient and fast: 130 ms on a 1000 x 1000 matrix. This is a good opportunity to expose some interesting techniques using numpy.
First, let’s dig a bit into the requirements, specifically what exactly the value for each cell needs to be.
The example given is [nan, nan, nan, nan, 4.0] –> [.66, .72, .79, .87, .96], which is explained to be a “progressively increasing value of 10%” (in such a way that the total is the “value to spread”: 4.0).
This is a geometric series with rate r = 1 + 0.1: [r^1, r^2, r^3, ...] and then normalized to sum to 1. For example:
r = 1.1 a = 4.0 n = 5 q = np.cumprod(np.repeat(r, n)) a * q / q.sum() # array([0.65518992, 0.72070892, 0.79277981, 0.87205779, 0.95926357])
We’d like to do a direct calculation (to avoid calling Python functions and explicit loops, which would be much slower), so we need to express that normalizing factor q.sum() in closed form. It is a well-established quantity and is:
To generalize, we need 3 quantities to calculate the value of each cell:
a: value to distributei: index of run (0 .. n-1)n: run length- then, the value is
v = a * r**i * (r - 1) / (r**n - 1).
To illustrate with the first column in the OP’s example, where the input is: [1, nan, nan, nan, nan, 4], we would like:
a = [1, 4, 4, 4, 4, 4]i = [0, 0, 1, 2, 3, 4]n = [1, 5, 5, 5, 5, 5]- then, the value
vwould be (rounded at 2 decimals):[1. , 0.66, 0.72, 0.79, 0.87, 0.96].
Now comes the part where we go about getting these three quantities as numpy arrays.
a is the easiest and is simply df.bfill().values. But for i and n, we do have to do a little bit of work, starting by assigning the values to a numpy array:
z = df.values nrows, ncols = z.shape
For i, we start with the cumulative count of NaNs, with reset when values are not NaN. This is strongly inspired by this SO answer for “Cumulative counts in NumPy without iteration”. But we do it for a 2D array, and we also want to add a first row of 0, and discard the last row to satisfy exactly our needs:
def rcount(z):
na = np.isnan(z)
without_reset = na.cumsum(axis=0)
reset_at = ~na
overcount = np.maximum.accumulate(without_reset * reset_at)
result = without_reset - overcount
return result
i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]
For n, we need to do some dancing on our own, using first principles of numpy (I’ll break down the steps if I have time):
runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T))))) n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]
So, putting it all together:
def spread_bfill(df, r=1.1):
z = df.values
nrows, ncols = z.shape
a = df.bfill().values
i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]
runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))
n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]
v = a * r**i * (r - 1) / (r**n - 1)
return pd.DataFrame(v, columns=df.columns, index=df.index)
On your example data, we get:
>>> spread_bfill(df).round(2) # round(2) for printing purposes
A B
a b c d e a b c d e
S
2020-10-15 1.00 2.00 0.52 1.21 1.17 10.00 11.00 1.68 3.93 1.68
2020-10-16 0.66 0.98 0.57 1.33 1.28 1.64 0.33 1.85 4.32 1.85
2020-10-17 0.72 1.08 0.63 1.46 1.41 1.80 0.36 2.04 4.75 2.04
2020-10-18 0.79 1.19 0.69 0.30 1.55 1.98 0.40 2.24 1.21 2.24
2020-10-19 0.87 1.31 0.76 0.33 1.71 2.18 0.44 2.47 1.33 2.47
2020-10-20 0.96 1.44 0.83 0.37 1.88 2.40 0.48 2.71 1.46 2.71
For inspection, let’s look at each of the 3 quantities in that example:
>>> a [[ 1 2 4 4 9 10 11 13 13 13] [ 4 6 4 4 9 10 2 13 13 13] [ 4 6 4 4 9 10 2 13 13 13] [ 4 6 4 1 9 10 2 13 4 13] [ 4 6 4 1 9 10 2 13 4 13] [ 4 6 4 1 9 10 2 13 4 13]] >>> i [[0 0 0 0 0 0 0 0 0 0] [0 0 1 1 1 0 0 1 1 1] [1 1 2 2 2 1 1 2 2 2] [2 2 3 0 3 2 2 3 0 3] [3 3 4 1 4 3 3 4 1 4] [4 4 5 2 5 4 4 5 2 5]] >>> n [[1 1 6 3 6 1 1 6 3 6] [5 5 6 3 6 5 5 6 3 6] [5 5 6 3 6 5 5 6 3 6] [5 5 6 3 6 5 5 6 3 6] [5 5 6 3 6 5 5 6 3 6] [5 5 6 3 6 5 5 6 3 6]]
And here is a final example, to illustrate what happens if a column ends with 1 or several NaNs (they remain NaN):
np.random.seed(10)
a = np.random.randint(0, 10, (6, 6)).astype(float)
a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])
df = pd.DataFrame(a)
>>> df
0 1 2 3 4 5
0 NaN NaN NaN NaN NaN 0.0
1 NaN NaN 9.0 NaN 8.0 NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN 8.0 4.0 NaN NaN NaN
4 NaN NaN NaN 6.0 9.0 NaN
5 NaN NaN 2.0 NaN 7.0 8.0
Then:
>>> spread_bfill(df).round(2) # round(2) for printing
0 1 2 3 4 5
0 NaN 1.72 4.29 0.98 3.81 0.00
1 NaN 1.90 4.71 1.08 4.19 1.31
2 NaN 2.09 1.90 1.19 2.72 1.44
3 NaN 2.29 2.10 1.31 2.99 1.59
4 NaN NaN 0.95 1.44 3.29 1.74
5 NaN NaN 1.05 NaN 7.00 1.92
Speed
a = np.random.randint(0, 10, (1000, 1000)).astype(float) a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7]) df = pd.DataFrame(a) %timeit spread_bfill(df) # 130 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)