Sample dataframe:
JavaScript
x
23
23
1
In [1898]: df = pd.DataFrame({'index':[0,0,5,5,6,6,8,8], 'table_name':['f_person', 'f_person', 'f_person', 'f_person', 'f_person', 'f_person', 'f_person', 'f_person'], 'column_name':['active', 'actv', 'ssn', 'ssn', 'pl', '
2
pl', 'prefix', 'prefix'], 'data_type':['integer', 'integer', 'varchar', 'varchar', 'varchar', 'varchar', 'varchar', 'integer'], 'default':[np.nan, np.nan, np.nan, np.nan, 10, np.nan, np.nan, np.nan], 'max_length :
3
':[np.nan, np.nan, 256, 99, 256, 256, 256, 256]}) :
4
5
In [1899]: df = pd.DataFrame({'index':[0,0,5,5,6,6,8,8], 'table_name':['f_person', 'f_person', 'f_person', 'f_person', 'f_person', 'f_person', 'f_person', 'f_person'], 'column_name':['active', 'actv', 'ssn', 'ssn', 'pl', '
6
pl', 'prefix', 'prefix'], 'data_type':['integer', 'integer', 'varchar', 'varchar', 'varchar', 'bigint', 'varchar', 'integer'], 'default':[np.nan, np.nan, np.nan, np.nan, 10, np.nan, np.nan, np.nan], 'max_length' :
7
np.nan, np.nan, 256, 99, 256, 256, 256, 256]}) : :[
8
9
In [1900]: df = df.set_index('index')
10
11
In [1901]: df
12
Out[1901]:
13
table_name column_name data_type default max_length
14
index
15
0 f_person active integer NaN NaN
16
0 f_person actv integer NaN NaN
17
5 f_person ssn varchar NaN 256.0
18
5 f_person ssn varchar NaN 99.0
19
6 f_person pl varchar 10.0 256.0
20
6 f_person pl bigint NaN 256.0
21
8 f_person prefix varchar NaN 256.0
22
8 f_person prefix integer NaN 256.0
23
If you see here, the rows with common index have atleast one difference amongst them.
For ex:
Rows with index 0
, have difference in column_name
.
Rows with index 5
, have difference in max_length
.
Rows with index 6
, have differences in both data_type
and default
.
Rows with index 8
, have difference in data_type
.
Expected Output:
JavaScript
1
9
1
[
2
{
3
0: {'column_name': ['active', 'actv']},
4
5: {'max_length': [256, 99]},
5
6: {'data_type': ['varchar', 'bigint'], 'default': [10, np.nan]},
6
8: {'data_type': ['varchar', 'integer']}
7
}
8
]
9
This is part of a bigger problem. I’ve kind of solved it till here. Not sure how to proceed further. Any ideas?
Advertisement
Answer
Here is solution similar like @Riccardo Bucco solution with Series.nunique
and because always 2 values per groups output is converting to list only:
JavaScript
1
9
1
f = lambda x: {c: x[c].tolist() for c in x.columns if x[c].nunique(dropna=False) != 1}
2
d = df.groupby('index').apply(f).to_dict()
3
print (d)
4
5
# {0: {'column_name': ['active', 'actv']},
6
# 5: {'max_length': [256.0, 99.0]},
7
# 6: {'data_type': ['varchar', 'bigint'], 'default': [10.0, nan]},
8
# 8: {'data_type': ['varchar', 'integer']}}
9
Solution with always 2 rows per groups index values:
JavaScript
1
24
24
1
m = df.index.duplicated()
2
m1 = df.fillna('miss')[m].ne(df.fillna('miss')[~m])
3
4
s = (df.where(m1)
5
.stack()
6
.groupby(level=[0,1])
7
.agg(lambda x: list(x) if len(x) == 2 else [*x, np.nan]))
8
print (s)
9
index
10
0 column_name [active, actv]
11
5 max_length [256.0, 99.0]
12
6 data_type [varchar, bigint]
13
default [10.0, nan]
14
8 data_type [varchar, integer]
15
dtype: object
16
17
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
18
print (d)
19
{0: {'column_name': ['active', 'actv']},
20
5: {'max_length': [256.0, 99.0]},
21
6: {'data_type': ['varchar', 'bigint'],
22
'default': [10.0, nan]},
23
8: {'data_type': ['varchar', 'integer']}}
24