Filling missing data using a custom condition in a Pandas time series dataframe

Question

Below is a portion of mydataframe which has many missing values. I would like to replace the NANs in each column using a specific backward fill condition . For example, in column (A,a) missing values appear for dates 16th, 17th, 18th and 19th. The next value is '4' against 20th. I want this value (the next non missing value in

Accepted Answer

Here is a completely vectorized way to do this. It is very efficient and fast: 130 ms on a 1000 x 1000 matrix.  This is a good opportunity to expose some interesting techniques using numpy.First, let&#8217;s dig a bit into the requirements, specifically what exactly the value for each cell needs to be.The example given is [nan, nan, nan, nan, 4.0] &#8211;> [.66, .72, .79, .87, .96], which is explained to be a &#8220;progressively increasing value of 10%&#8221; (in such a way that the total is the &#8220;value to spread&#8221;: 4.0).This is a geometric series with rate r = 1 + 0.1: [r^1, r^2, r^3, ...] and then normalized to sum to 1. For example:r = 1.1a = 4.0n = 5q = np.cumprod(np.repeat(r, n))a * q / q.sum()# array([0.65518992, 0.72070892, 0.79277981, 0.87205779, 0.95926357])We&#8217;d like to do a direct calculation (to avoid calling Python functions and explicit loops, which would be much slower), so we need to express that normalizing factor q.sum() in closed form. It is a well-established quantity and is:To generalize, we need 3 quantities to calculate the value of each cell:a: value to distributei: index of run (0 .. n-1)n: run lengththen, the value is v = a * r**i * (r - 1) / (r**n - 1).To illustrate with the first column in the OP&#8217;s example, where the input is: [1, nan, nan, nan, nan, 4], we would like:a = [1, 4, 4, 4, 4, 4]i = [0, 0, 1, 2, 3, 4]n = [1, 5, 5, 5, 5, 5]then, the value v would be (rounded at 2 decimals): [1.  , 0.66, 0.72, 0.79, 0.87, 0.96].Now comes the part where we go about getting these three quantities as numpy arrays.a is the easiest and is simply df.bfill().values. But for i and n, we do have to do a little bit of work, starting by assigning the values to a numpy array:z = df.valuesnrows, ncols = z.shapeFor i, we start with the cumulative count of NaNs, with reset when values are not NaN. This is strongly inspired by this SO answer for &#8220;Cumulative counts in NumPy without iteration&#8221;. But we do it for a 2D array, and we also want to add a first row of 0, and discard the last row to satisfy exactly our needs:def rcount(z):    na = np.isnan(z)    without_reset = na.cumsum(axis=0)    reset_at = ~na    overcount = np.maximum.accumulate(without_reset * reset_at)    result = without_reset - overcount    return resulti = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]For n, we need to do some dancing on our own, using first principles of numpy (I&#8217;ll break down the steps if I have time):runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]So, putting it all together:def spread_bfill(df, r=1.1):    z = df.values    nrows, ncols = z.shape    a = df.bfill().values    i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]    runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))    n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]    v = a * r**i * (r - 1) / (r**n - 1)    return pd.DataFrame(v, columns=df.columns, index=df.index)On your example data, we get:>>> spread_bfill(df).round(2)  # round(2) for printing purposes               A                              B                                        a     b     c     d     e      a      b     c     d     eS                                                                       2020-10-15  1.00  2.00  0.52  1.21  1.17  10.00  11.00  1.68  3.93  1.682020-10-16  0.66  0.98  0.57  1.33  1.28   1.64   0.33  1.85  4.32  1.852020-10-17  0.72  1.08  0.63  1.46  1.41   1.80   0.36  2.04  4.75  2.042020-10-18  0.79  1.19  0.69  0.30  1.55   1.98   0.40  2.24  1.21  2.242020-10-19  0.87  1.31  0.76  0.33  1.71   2.18   0.44  2.47  1.33  2.472020-10-20  0.96  1.44  0.83  0.37  1.88   2.40   0.48  2.71  1.46  2.71For inspection, let&#8217;s look at each of the 3 quantities in that example:>>> a[[ 1  2  4  4  9 10 11 13 13 13] [ 4  6  4  4  9 10  2 13 13 13] [ 4  6  4  4  9 10  2 13 13 13] [ 4  6  4  1  9 10  2 13  4 13] [ 4  6  4  1  9 10  2 13  4 13] [ 4  6  4  1  9 10  2 13  4 13]]>>> i[[0 0 0 0 0 0 0 0 0 0] [0 0 1 1 1 0 0 1 1 1] [1 1 2 2 2 1 1 2 2 2] [2 2 3 0 3 2 2 3 0 3] [3 3 4 1 4 3 3 4 1 4] [4 4 5 2 5 4 4 5 2 5]]>>> n[[1 1 6 3 6 1 1 6 3 6] [5 5 6 3 6 5 5 6 3 6] [5 5 6 3 6 5 5 6 3 6] [5 5 6 3 6 5 5 6 3 6] [5 5 6 3 6 5 5 6 3 6] [5 5 6 3 6 5 5 6 3 6]]And here is a final example, to illustrate what happens if a column ends with 1 or several NaNs (they remain NaN):np.random.seed(10)a = np.random.randint(0, 10, (6, 6)).astype(float)a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])df = pd.DataFrame(a)>>> df    0    1    2    3    4    50 NaN  NaN  NaN  NaN  NaN  0.01 NaN  NaN  9.0  NaN  8.0  NaN2 NaN  NaN  NaN  NaN  NaN  NaN3 NaN  8.0  4.0  NaN  NaN  NaN4 NaN  NaN  NaN  6.0  9.0  NaN5 NaN  NaN  2.0  NaN  7.0  8.0Then:>>> spread_bfill(df).round(2)  # round(2) for printing    0     1     2     3     4     50 NaN  1.72  4.29  0.98  3.81  0.001 NaN  1.90  4.71  1.08  4.19  1.312 NaN  2.09  1.90  1.19  2.72  1.443 NaN  2.29  2.10  1.31  2.99  1.594 NaN   NaN  0.95  1.44  3.29  1.745 NaN   NaN  1.05   NaN  7.00  1.92Speeda = np.random.randint(0, 10, (1000, 1000)).astype(float)a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])df = pd.DataFrame(a)%timeit spread_bfill(df)# 130 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Advertisement

Answer