I have a DataFrame similar to the following where I have a column with a non-unique value (in this case address) as well as some other columns containing information about it.
df = pd.DataFrame({'address': {0:'11 Star Street', 1:'22 Milky Way', 2:'88 Dark Drive', 3:'33 Planet Place', 4:'22 Milky Way', 5:'22 Milky Way'}, 'val': {0:10, 1:'', 2:'', 3:20, 4: 20, 5:''}, 'val2': {0:20, 1:'', 2:'', 3:40, 4:10, 5:''}})
address val val2
0 11 Star Street 10 20
1 22 Milky Way
2 88 Dark Drive
3 33 Planet Place 20 40
4 22 Milky Way 20 10
5 22 Milky Way
Some of the addresses appear more than once in the DataFrame and some of those repeated ones are missing information. If a certain row is missing the values, but that address appears in another row in the DataFrame, I’d like to replace the NaN values with those from the same address to get something like this:
address val val2
0 11 Star Street 10 20
1 22 Milky Way 20 10
2 88 Dark Drive
3 33 Planet Place 20 40
4 22 Milky Way 20 10
5 22 Milky Way 20 10
Using something like a dictionary would be infeasible since the DataFrame contains thousands of different addresses.
EDIT: It’s safe to assume that either both values are missing or both are present. In other words, there will never be a row with only val and not val2 or vice-versa. However, an answer that could take that possible circumstance into account would be even better!
Advertisement
Answer
number of ways you can do this, the most easiest is groupby and ffill / bfill the groups.
import numpy as np
import pandas as pd
df = df.replace('',np.nan,regex=True).groupby('address').apply(lambda x : x.ffill().bfill())
print(df)
address val val2
0 11 Star Street 10.0 20.0
1 22 Milky Way 20.0 10.0
2 88 Dark Drive NaN NaN
3 33 Planet Place 20.0 40.0
4 22 Milky Way 20.0 10.0
5 22 Milky Way 20.0 10.0
Another, and more performant method would be using update
along your axis.
vals = df.replace('',np.nan,regex=True).groupby('address').first()
print(vals)
val val2
address
11 Star Street 10.0 20.0
22 Milky Way 20.0 10.0
33 Planet Place 20.0 40.0
88 Dark Drive NaN NaN
df = df.set_index('address')
df.update(vals)
val val2
address
11 Star Street 10 20
22 Milky Way 20 10
88 Dark Drive
33 Planet Place 20 40
22 Milky Way 20 10
22 Milky Way 20 10