I have two ndarrays of size (m x n), and two lists of length m and n respectively. I want to convert the two matrices to a dataframe with four columns. The first two columns correspond to the m and n dimensions, and contain the values from the lists. The next two columns should contain the values from the two matrices. In total, the resulting dataframe should have m times n rows.
Example: If these are the two matrices and two lists,
a1 = np.array([[1, 2], [3, 4],[5,6]]) a2 = np.array([[10, 20], [30, 40],[50,60]]) l1 = [5,7,99] l2 = [2,3]
then the resulting dataframe should look like this:
"l1" "l2" "a1" "a2" 5 2 1 10 7 2 3 30 99 2 5 50 5 3 2 20 7 3 4 40 99 3 6 60
The order of the rows does not matter.
Although I only have two matrices in this specific case, I am curious about a solution which is easily applicable to any number of same size matrices.
Advertisement
Answer
Use np.vstack for join arrays created by numpy.tile, numpy.repeat and numpy.ravel and pass to DataFrame cosntructor:
a = np.vstack((np.tile(l1, len(l2)),
               np.repeat(l2, len(l1)),
               np.ravel(a1, 'F'), 
               np.ravel(a2, 'F'))).T
print (a)
[[ 5  2  1 10]
 [ 7  2  3 30]
 [99  2  5 50]
 [ 5  3  2 20]
 [ 7  3  4 40]
 [99  3  6 60]]
df = pd.DataFrame(a, columns=['l1','l2','a1','a2'])
print (df)
   l1  l2  a1  a2
0   5   2   1  10
1   7   2   3  30
2  99   2   5  50
3   5   3   2  20
4   7   3   4  40
5  99   3   6  60
For multiple arrays:
arrays =  [a1, a2]
arr = [np.ravel(a, 'F') for a in arrays]
a = np.vstack((np.tile(l1, len(l2)), 
               np.repeat(l2, len(l1)),
               arr)).T
print (a)
[[ 5  2  1 10]
 [ 7  2  3 30]
 [99  2  5 50]
 [ 5  3  2 20]
 [ 7  3  4 40]
 [99  3  6 60]]
df = pd.DataFrame(a, columns=['l1','l2'] + [f'a{x+1}' for x in range(len(arrays))])
print (df)
   l1  l2  a1  a2
0   5   2   1  10
1   7   2   3  30
2  99   2   5  50
3   5   3   2  20
4   7   3   4  40
5  99   3   6  60
Pandas only solution with concat and DataFrame.unstack:
df = (pd.concat([pd.DataFrame(a1, columns=l2, index=l1).unstack(),
                pd.DataFrame(a2, columns=l2, index=l1).unstack()],
               axis=1, keys=['a1','a2'])
        .rename_axis(['l2','l1']).swaplevel(1,0).reset_index())
print (df)
   l1  l2  a1  a2
0   5   2   1  10
1   7   2   3  30
2  99   2   5  50
3   5   3   2  20
4   7   3   4  40
5  99   3   6  60
For multiple arrays:
arrays =  [a1, a2]
df = (pd.concat([pd.DataFrame(a, columns=l2, index=l1).unstack() for a in arrays],
               axis=1)
        .rename_axis(['l2','l1'])
        .swaplevel(1,0)
        .rename(columns=lambda x: f'a{x+1}')
        .reset_index())
print (df)
   l1  l2  a1  a2
0   5   2   1  10
1   7   2   3  30
2  99   2   5  50
3   5   3   2  20
4   7   3   4  40
5  99   3   6  60