Python Data Frame summary

I have dataframe (df table below): Every user can post in any category. I have to calculate HOW MANY DISTINCT USERS has a post in category A and at the same time has posts in categories, B, C and D. Table like:

User	Category
1	A
1	B
33	B
33	C
33	D
54	A
54	B
87	A
87	B
87	C
87	D

Total table is 950,000 rows It either gives Nan results in pivot (variant 1) or memory overflow (variant 2). Necessary result is

Category	A	B	C	D
A	3	3	1	1
B	3	4	2	1
C	1	2	2	2
D	1	1	2	2

Diagonal: A-B-C-D – total posted in each category, lower and upper triangles are the same and are intersections of how many users posted in, for instance, B and C.

I have tried

df.pivot_table(values=[‘user’,’category’],index=[‘category’], columns=[‘category’],aggfunc=np.count_nonzero)
a = df.values

d = {(i, j): np.mean(a[:, i] == a[:, j]) for i, j in combinations(range(a.shape[1]), 2)}

res, c, vals = np.zeros((a.shape[1], a.shape[1])),
list(map(list, zip(*d.keys()))), list(d.values())

res[c[0], c[1]] = vals

res_df = pd.DataFrame(res, columns=df.columns, index=df.columns)

Would yo please help with ideas. Thank you.

Answer

One idea is use merge with crosstab like:

df = df.merge(df, on='User')
df = pd.crosstab(df['Category_x'], df['Category_y'])
print (df)
Category_y  A  B  C  D
Category_x            
A           3  3  1  1
B           3  4  2  2
C           1  2  2  2
D           1  2  2  2

Advertisement

Answer