I have a DataFrame similar to the following where I have a column with a non-unique value (in this case address) as well as some other columns containing information about it.
df = pd.DataFrame({'address': {0:'11 Star Street', 1:'22 Milky Way', 2:'88 Dark Drive', 3:'33 Planet Place', 4:'22 Milky Way', 5:'22 Milky Way'}, 'val': {0:10, 1:'', 2:'', 3:20, 4: 20, 5:''}, 'val2': {0:20, 1:'', 2:'', 3:40, 4:10, 5:''}}) address val val2 0 11 Star Street 10 20 1 22 Milky Way 2 88 Dark Drive 3 33 Planet Place 20 40 4 22 Milky Way 20 10 5 22 Milky Way
Some of the addresses appear more than once in the DataFrame and some of those repeated ones are missing information. If a certain row is missing the values, but that address appears in another row in the DataFrame, I’d like to replace the NaN values with those from the same address to get something like this:
address val val2 0 11 Star Street 10 20 1 22 Milky Way 20 10 2 88 Dark Drive 3 33 Planet Place 20 40 4 22 Milky Way 20 10 5 22 Milky Way 20 10
Using something like a dictionary would be infeasible since the DataFrame contains thousands of different addresses.
EDIT: It’s safe to assume that either both values are missing or both are present. In other words, there will never be a row with only val and not val2 or vice-versa. However, an answer that could take that possible circumstance into account would be even better!
Advertisement
Answer
number of ways you can do this, the most easiest is groupby and ffill / bfill the groups.
import numpy as np import pandas as pd df = df.replace('',np.nan,regex=True).groupby('address').apply(lambda x : x.ffill().bfill()) print(df) address val val2 0 11 Star Street 10.0 20.0 1 22 Milky Way 20.0 10.0 2 88 Dark Drive NaN NaN 3 33 Planet Place 20.0 40.0 4 22 Milky Way 20.0 10.0 5 22 Milky Way 20.0 10.0
Another, and more performant method would be using update
along your axis.
vals = df.replace('',np.nan,regex=True).groupby('address').first() print(vals) val val2 address 11 Star Street 10.0 20.0 22 Milky Way 20.0 10.0 33 Planet Place 20.0 40.0 88 Dark Drive NaN NaN df = df.set_index('address') df.update(vals) val val2 address 11 Star Street 10 20 22 Milky Way 20 10 88 Dark Drive 33 Planet Place 20 40 22 Milky Way 20 10 22 Milky Way 20 10