Skip to content
Advertisement

Filling missing data using a custom condition in a Pandas time series dataframe

Below is a portion of mydataframe which has many missing values.

JavaScript

I would like to replace the NANs in each column using a specific backward fill condition .

For example, in column (A,a) missing values appear for dates 16th, 17th, 18th and 19th. The next value is ‘4’ against 20th. I want this value (the next non missing value in the column) to be distributed among all these dates including 20th at a progressively increasing value of 10%. That is column (A,a) gets values of .655, .720,.793,.872 & .96 approximately for the dates 16th, 17th, 18th, 19th & 20th. This shall be the approach for all columns for all missing values across rows.

I tried using bfill() function but unable to fathom how to incorporate the required formula as an option.

I have checked the link Pandas: filling missing values in time series forward using a formula and a few other links on stackoverflow. This is somewhat similar, but in my case the the number of NANs in a given column are variable in nature and span multiple rows. Compare columns (A,a) with column (A,d) or column (B,d). Given this, I am finding it difficult to adopt the solution to my problem.

Appreciate any inputs.

Advertisement

Answer

Here is a completely vectorized way to do this. It is very efficient and fast: 130 ms on a 1000 x 1000 matrix. This is a good opportunity to expose some interesting techniques using numpy.

First, let’s dig a bit into the requirements, specifically what exactly the value for each cell needs to be.

The example given is [nan, nan, nan, nan, 4.0] –> [.66, .72, .79, .87, .96], which is explained to be a “progressively increasing value of 10%” (in such a way that the total is the “value to spread”: 4.0).

This is a geometric series with rate r = 1 + 0.1: [r^1, r^2, r^3, ...] and then normalized to sum to 1. For example:

JavaScript

We’d like to do a direct calculation (to avoid calling Python functions and explicit loops, which would be much slower), so we need to express that normalizing factor q.sum() in closed form. It is a well-established quantity and is:

To generalize, we need 3 quantities to calculate the value of each cell:

  • a: value to distribute
  • i: index of run (0 .. n-1)
  • n: run length
  • then, the value is v = a * r**i * (r - 1) / (r**n - 1).

To illustrate with the first column in the OP’s example, where the input is: [1, nan, nan, nan, nan, 4], we would like:

  • a = [1, 4, 4, 4, 4, 4]
  • i = [0, 0, 1, 2, 3, 4]
  • n = [1, 5, 5, 5, 5, 5]
  • then, the value v would be (rounded at 2 decimals): [1. , 0.66, 0.72, 0.79, 0.87, 0.96].

Now comes the part where we go about getting these three quantities as numpy arrays.

a is the easiest and is simply df.bfill().values. But for i and n, we do have to do a little bit of work, starting by assigning the values to a numpy array:

JavaScript

For i, we start with the cumulative count of NaNs, with reset when values are not NaN. This is strongly inspired by this SO answer for “Cumulative counts in NumPy without iteration”. But we do it for a 2D array, and we also want to add a first row of 0, and discard the last row to satisfy exactly our needs:

JavaScript

For n, we need to do some dancing on our own, using first principles of numpy (I’ll break down the steps if I have time):

JavaScript

So, putting it all together:

JavaScript

On your example data, we get:

JavaScript

For inspection, let’s look at each of the 3 quantities in that example:

JavaScript

And here is a final example, to illustrate what happens if a column ends with 1 or several NaNs (they remain NaN):

JavaScript

Then:

JavaScript

Speed

JavaScript
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement