Below is a portion of mydataframe
which has many missing values.
A B S a b c d e a b c d e date 2020-10-15 1.0 2.0 NaN NaN NaN 10.0 11.0 NaN NaN NaN 2020-10-16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2020-10-17 NaN NaN NaN 4.0 NaN NaN NaN NaN 13.0 NaN 2020-10-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2020-10-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2020-10-20 4.0 6.0 4.0 1.0 9.0 10.0 2.0 13.0 4.0 13.0
I would like to replace the NANs
in each column using a specific backward fill condition
.
For example, in column (A,a) missing values appear for dates 16th, 17th, 18th and 19th. The next value is ‘4’ against 20th. I want this value (the next non missing value in the column) to be distributed among all these dates including 20th at a progressively increasing value of 10%. That is column (A,a) gets values of .655, .720,.793,.872 & .96 approximately for the dates 16th, 17th, 18th, 19th & 20th. This shall be the approach for all columns for all missing values across rows.
I tried using bfill() function but unable to fathom how to incorporate the required formula as an option.
I have checked the link Pandas: filling missing values in time series forward using a formula and a few other links on stackoverflow
. This is somewhat similar, but in my case the the number of NANs in a given column are variable in nature and span multiple rows. Compare columns (A,a) with column (A,d) or column (B,d). Given this, I am finding it difficult to adopt the solution to my problem.
Appreciate any inputs.
Advertisement
Answer
Here is a completely vectorized way to do this. It is very efficient and fast: 130 ms on a 1000 x 1000 matrix. This is a good opportunity to expose some interesting techniques using numpy
.
First, let’s dig a bit into the requirements, specifically what exactly the value for each cell needs to be.
The example given is [nan, nan, nan, nan, 4.0]
–> [.66, .72, .79, .87, .96]
, which is explained to be a “progressively increasing value of 10%” (in such a way that the total is the “value to spread”: 4.0
).
This is a geometric series with rate r = 1 + 0.1
: [r^1, r^2, r^3, ...]
and then normalized to sum to 1. For example:
r = 1.1 a = 4.0 n = 5 q = np.cumprod(np.repeat(r, n)) a * q / q.sum() # array([0.65518992, 0.72070892, 0.79277981, 0.87205779, 0.95926357])
We’d like to do a direct calculation (to avoid calling Python functions and explicit loops, which would be much slower), so we need to express that normalizing factor q.sum()
in closed form. It is a well-established quantity and is:
To generalize, we need 3 quantities to calculate the value of each cell:
a
: value to distributei
: index of run (0 .. n-1)n
: run length- then, the value is
v = a * r**i * (r - 1) / (r**n - 1)
.
To illustrate with the first column in the OP’s example, where the input is: [1, nan, nan, nan, nan, 4]
, we would like:
a = [1, 4, 4, 4, 4, 4]
i = [0, 0, 1, 2, 3, 4]
n = [1, 5, 5, 5, 5, 5]
- then, the value
v
would be (rounded at 2 decimals):[1. , 0.66, 0.72, 0.79, 0.87, 0.96]
.
Now comes the part where we go about getting these three quantities as numpy arrays.
a
is the easiest and is simply df.bfill().values
. But for i
and n
, we do have to do a little bit of work, starting by assigning the values to a numpy array:
z = df.values nrows, ncols = z.shape
For i
, we start with the cumulative count of NaN
s, with reset when values are not NaN
. This is strongly inspired by this SO answer for “Cumulative counts in NumPy without iteration”. But we do it for a 2D array, and we also want to add a first row of 0, and discard the last row to satisfy exactly our needs:
def rcount(z): na = np.isnan(z) without_reset = na.cumsum(axis=0) reset_at = ~na overcount = np.maximum.accumulate(without_reset * reset_at) result = without_reset - overcount return result i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]
For n
, we need to do some dancing on our own, using first principles of numpy (I’ll break down the steps if I have time):
runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T))))) n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]
So, putting it all together:
def spread_bfill(df, r=1.1): z = df.values nrows, ncols = z.shape a = df.bfill().values i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1] runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T))))) n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1] v = a * r**i * (r - 1) / (r**n - 1) return pd.DataFrame(v, columns=df.columns, index=df.index)
On your example data, we get:
>>> spread_bfill(df).round(2) # round(2) for printing purposes A B a b c d e a b c d e S 2020-10-15 1.00 2.00 0.52 1.21 1.17 10.00 11.00 1.68 3.93 1.68 2020-10-16 0.66 0.98 0.57 1.33 1.28 1.64 0.33 1.85 4.32 1.85 2020-10-17 0.72 1.08 0.63 1.46 1.41 1.80 0.36 2.04 4.75 2.04 2020-10-18 0.79 1.19 0.69 0.30 1.55 1.98 0.40 2.24 1.21 2.24 2020-10-19 0.87 1.31 0.76 0.33 1.71 2.18 0.44 2.47 1.33 2.47 2020-10-20 0.96 1.44 0.83 0.37 1.88 2.40 0.48 2.71 1.46 2.71
For inspection, let’s look at each of the 3 quantities in that example:
>>> a [[ 1 2 4 4 9 10 11 13 13 13] [ 4 6 4 4 9 10 2 13 13 13] [ 4 6 4 4 9 10 2 13 13 13] [ 4 6 4 1 9 10 2 13 4 13] [ 4 6 4 1 9 10 2 13 4 13] [ 4 6 4 1 9 10 2 13 4 13]] >>> i [[0 0 0 0 0 0 0 0 0 0] [0 0 1 1 1 0 0 1 1 1] [1 1 2 2 2 1 1 2 2 2] [2 2 3 0 3 2 2 3 0 3] [3 3 4 1 4 3 3 4 1 4] [4 4 5 2 5 4 4 5 2 5]] >>> n [[1 1 6 3 6 1 1 6 3 6] [5 5 6 3 6 5 5 6 3 6] [5 5 6 3 6 5 5 6 3 6] [5 5 6 3 6 5 5 6 3 6] [5 5 6 3 6 5 5 6 3 6] [5 5 6 3 6 5 5 6 3 6]]
And here is a final example, to illustrate what happens if a column ends with 1 or several NaN
s (they remain NaN
):
np.random.seed(10) a = np.random.randint(0, 10, (6, 6)).astype(float) a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7]) df = pd.DataFrame(a) >>> df 0 1 2 3 4 5 0 NaN NaN NaN NaN NaN 0.0 1 NaN NaN 9.0 NaN 8.0 NaN 2 NaN NaN NaN NaN NaN NaN 3 NaN 8.0 4.0 NaN NaN NaN 4 NaN NaN NaN 6.0 9.0 NaN 5 NaN NaN 2.0 NaN 7.0 8.0
Then:
>>> spread_bfill(df).round(2) # round(2) for printing 0 1 2 3 4 5 0 NaN 1.72 4.29 0.98 3.81 0.00 1 NaN 1.90 4.71 1.08 4.19 1.31 2 NaN 2.09 1.90 1.19 2.72 1.44 3 NaN 2.29 2.10 1.31 2.99 1.59 4 NaN NaN 0.95 1.44 3.29 1.74 5 NaN NaN 1.05 NaN 7.00 1.92
Speed
a = np.random.randint(0, 10, (1000, 1000)).astype(float) a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7]) df = pd.DataFrame(a) %timeit spread_bfill(df) # 130 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)