Skip to content
Advertisement

Python Data Frame summary

I have dataframe (df table below): Every user can post in any category. I have to calculate HOW MANY DISTINCT USERS has a post in category A and at the same time has posts in categories, B, C and D. Table like:

User Category
1 A
1 B
33 B
33 C
33 D
54 A
54 B
87 A
87 B
87 C
87 D

Total table is 950,000 rows It either gives Nan results in pivot (variant 1) or memory overflow (variant 2). Necessary result is

Category A B C D
A 3 3 1 1
B 3 4 2 1
C 1 2 2 2
D 1 1 2 2

Diagonal: A-B-C-D – total posted in each category, lower and upper triangles are the same and are intersections of how many users posted in, for instance, B and C.

I have tried

  1. df.pivot_table(values=[‘user’,’category’],index=[‘category’], columns=[‘category’],aggfunc=np.count_nonzero)

  2. a = df.values

d = {(i, j): np.mean(a[:, i] == a[:, j]) for i, j in combinations(range(a.shape[1]), 2)}

res, c, vals = np.zeros((a.shape[1], a.shape[1])),
list(map(list, zip(*d.keys()))), list(d.values())

res[c[0], c[1]] = vals

res_df = pd.DataFrame(res, columns=df.columns, index=df.columns)

Would yo please help with ideas. Thank you.

Advertisement

Answer

One idea is use merge with crosstab like:

df = df.merge(df, on='User')
df = pd.crosstab(df['Category_x'], df['Category_y'])
print (df)
Category_y  A  B  C  D
Category_x            
A           3  3  1  1
B           3  4  2  2
C           1  2  2  2
D           1  2  2  2
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement