Skip to content
Advertisement

How to fillna in pandas dataframe based on pattern like in excel dragging?

I have dataframe which should be filled by understanding rows understanding like we do in excel. If its continious integer it fill by next number itself.

Is there any function in python like this?

import pandas as pd
d = { 'year': [2019,2020,2019,2020,np.nan,np.nan], 'cat1': [1,2,3,4,np.nan,np.nan], 'cat2': ['c1','c1','c1','c2',np.nan,np.nan]}
df = pd.DataFrame(data=d)
df
    year    cat1    cat2
0   2019.0  1.0     c1
1   2020.0  2.0     c1
2   2019.0  3.0     c1
3   2020.0  4.0     c2
4   NaN     NaN     NaN
5   NaN     NaN     NaN

output required:

    year    cat1    cat2
0   2019.0  1.0     c1
1   2020.0  2.0     c1
2   2019.0  3.0     c1
3   2020.0  4.0     c2
4   2019.0  5.0     c2 #here can be ignored if it can't understand the earlier pattern
5   2020.0  6.0     c2 #here can be ignored if it can't understand the earlier pattern

I tried df.interpolate(method='krogh') #it fill 1,2,3,4,5,6 but incorrect others.

Advertisement

Answer

Here is my solution for the specific use case you mention –

The code for these helper functions for categorical_repeat, continous_interpolate and other is provided below in EXPLANATION > Approach section.

config = {'year':categorical_repeat,    #shortest repeating sequence
          'cat1':continous_interpolate, #curve fitting (linear)
          'cat2':other}                 #forward fill

print(df.agg(config))
     year  cat1 cat2
0  2019.0     1   c1
1  2020.0     2   c1
2  2019.0     3   c1
3  2020.0     4   c2
4  2019.0     5   c2
5  2020.0     6   c2

EXPLANATION:

As I understand, there is no direct way of handling all types of patterns in pandas as excel does. Excel involves linear interpolation for continuous sequences, but it involves other methods for other column patterns.

  1. Continous integer array -> linear interpolation
  2. Repeated cycles -> Smallest repeating sequence
  3. Alphabet (and similar) -> Tiling fixed sequence until the length of df
  4. Unrecognizable pattern -> Forward fill

Here is the dummy dataset that I attempt my approach on –

data = {'A': [2019, 2020, 2019, 2020, 2019, 2020],
        'B': [1, 2, 3, 4, 5, 6],
        'C': [6, 5, 4, 3, 2, 1],
        'D': ['C', 'D', 'E', 'F', 'G', 'H'],
        'E': ['A', 'B', 'C', 'A', 'B', 'C'],
        'F': [1,2,3,3,4,2]
       }

df = pd.DataFrame(data)
empty = pd.DataFrame(columns=df.columns, index=df.index)[:4]
df_new = df.append(empty).reset_index(drop=True)
print(df_new)
      A    B    C    D    E    F
0  2019    1    6    C    A    1
1  2020    2    5    D    B    2
2  2019    3    4    E    C    3
3  2020    4    3    F    A    3
4  2019    5    2    G    B    4
5  2020    6    1    H    C    2
6   NaN  NaN  NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN  NaN  NaN

Approach:

Let’s start with some helper functions –

import numpy as np
import scipy as sp
import pandas as pd


#Curve fitting (linear)
def f(x, m, c):
    return m*x+c     #Modify to extrapolate for exponential sequences etc.

#Interpolate continous linear
def continous_interpolate(s):
    clean = s.dropna()
    popt, pcov = sp.optimize.curve_fit(f, clean.index, clean)
    output = [round(i) for i in f(s.index, *popt)]  #Remove the round() for float values
    return pd.Series(output)

#Smallest Repeating sub-sequence
def pattern(inputv):
    '''
    https://stackoverflow.com/questions/6021274/finding-shortest-repeating-cycle-in-word
    '''
    pattern_end =0
    for j in range(pattern_end+1,len(inputv)):

        pattern_dex = j%(pattern_end+1)
        if(inputv[pattern_dex] != inputv[j]):

            pattern_end = j;
            continue

        if(j == len(inputv)-1):
            return inputv[0:pattern_end+1];
    return inputv;

#Categorical repeat imputation
def categorical_repeat(s):
    clean = s.dropna()
    cycle = pattern(clean)
    
    repetitions = (len(s)//len(cycle))+1
    output = np.tile(cycle, repetitions)[:len(s)]
    return pd.Series(output)

#continous sequence of alphabets
def alphabet(s):
    alp = 'abcdefghijklmnopqrstuvwxyz'
    alp2 = alp*((len(s)//len(alp))+1)
    
    start = s[0]
    idx = alp2.find(start.lower())
    output = alp2[idx:idx+len(s)]

    if start.isupper():
        output = output.upper()
    
    return pd.Series(list(output))

#If no pattern then just ffill
def other(s):
    return s.ffill()

Next, lets create a configuration based on what we want to solve and apply the methods required –

config = {'A':categorical_repeat,
          'B':continous_interpolate, 
          'C':continous_interpolate, 
          'D':alphabet,
          'E':categorical_repeat, 
          'F':other}

output_df = df_new.agg(config)
print(output_df)
      A   B  C  D  E  F
0  2019   1  6  C  A  1
1  2020   2  5  D  B  2
2  2019   3  4  E  C  3
3  2020   4  3  F  A  3
4  2019   5  2  G  B  4
5  2020   6  1  H  C  2
6  2019   7  0  I  A  2
7  2020   8 -1  J  B  2
8  2019   9 -2  K  C  2
9  2020  10 -3  L  A  2
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement