I’m a beginner with python in combination with pandas, and I understand the basics.
But I received a couple days ago 3 strange datasets in excel.
As image below:
JavaScript
x
40
40
1
import pandas as pd
2
3
dfinput = pd.DataFrame([
4
["uuid", "79876081-099b-474f-9e8f-ff917fd7394c", "uuid", "a96bc7cb-02b1-4d13-823a-908531cda095", "uuid",
5
"38bc7d20-10be-4774-973c-b3b00234a645", "uuid", "e7b12da6-a47f-4c24-8545-faa24e249a03", "uuid", "6b2c9426-bd6f-4bda-9c53-a86200e051f8"],
6
["variable 1", "value", "variable 1", "value", "variable 1",
7
"value", "variable 1", "value", "variable 1", "value"],
8
["variable 2", "value", "variable 2", "value", "variable 2",
9
"value", "variable 2", "value", "variable 2", "value"],
10
["variable 3", "value", "variable 3", "value", "variable 3",
11
"value", "variable 3", "value", "variable 3", "value"],
12
["variable 4", "value", "variable 4", "value", "variable 4",
13
"value", "variable 4", "value", "variable 4", "value"],
14
["variable 5", "value", "variable 5", "value", "variable 5",
15
"value", "variable 5", "value", "variable 5", "value"],
16
["variable 6", "value", "variable 6", "value", "variable 6",
17
"value", "variable 6", "value", "variable 6", "value"],
18
["variable 7", "value", "variable 7", "value", "variable 7",
19
"value", "variable 7", "value", "variable 7", "value"],
20
["variable 8", "value", "variable 8", "value", "variable 8",
21
"value", "variable 8", "value", "variable 8", "value"],
22
["variable 9", "value", "variable 9", "value", "variable 9",
23
"value", "variable 9", "value", "variable 9", "value"],
24
["variable 10", "value", "variable 10", "value", "variable 10",
25
"value", "variable 10", "value", "variable 10", "value"],
26
["variable A", "value", "variable B", "value", "variable A",
27
"value", "variable A", "value", "variable A", "value"],
28
["variable B", "value", "variable C", "value", "variable C",
29
"value", "variable B", "value", "variable B", "value"],
30
["variable C", "value", "variable D", "value", "variable D",
31
"value", "variable D", "value", "variable C", "value"],
32
["variable D", "value", "Variable E", "value", "Variable E",
33
"value", "Variable F", "value", "Variable E", "value"],
34
["Variable E", "value", "Variable F", "value", "Variable H",
35
"value", "Variable G", "value", "Variable F", "value"],
36
["Variable F", "value", "Variable H", "value", "",
37
"", "Variable H", "value", "Variable G", "value"],
38
["Variable G", "value", "", "", "", "", "", "", "Variable H", "value"]
39
])
40
JavaScript
1
12
12
1
dfoutput = pd.DataFrame([["value", "value", "value", "value", "value", "value", "value", "value", "value", "value", "value", "value", "value", "value", "value", "value", "value", "null"],
2
["value", "value", "value", "value", "value", "value", "value", "value", "value",
3
"value", "null", "value", "value", "value", "value", "value", "null", "value"],
4
["value", "value", "value", "value", "value", "value", "value", "value", "value",
5
"value", "value", "null", "value", "value", "value", "null", "null", "value"],
6
["value", "value", "value", "value", "value", "value", "value", "value", "value",
7
"value", "value", "value", "null", "value", "null", "value", "value", "value"],
8
["value", "value", "value", "value", "value", "value", "value", "value", "value", "value", "value", "value", "value", "null", "value", "value", "value", "value"]],
9
index=['79876081-099b-474f-9e8f-ff917fd7394c', 'a96bc7cb-02b1-4d13-823a-908531cda095',
10
'38bc7d20-10be-4774-973c-b3b00234a645', 'e7b12da6-a47f-4c24-8545-faa24e249a03', '6b2c9426-bd6f-4bda-9c53-a86200e051f8'],
11
columns=['variable 1', 'variable 2', 'variable 3', 'variable 4', 'variable 5', 'variable 6', 'variable 7', 'variable 8', 'variable 9', 'variable 10', 'variable A', 'variable B', 'variable C', 'variable D', 'Variable E', 'Variable F', 'Variable G', 'Variable H'])
12
I did try to loop the columns and create a new dataframe, but got stuck and think I make it unnecessary complex. I can’t get my head around it. Someone dealt with this before? and have a useful direction for me to go?
Advertisement
Answer
You can re-structure your data to your desired outcome with a rather simple manipulation. Note that I am using the dataframe
(dfinput) you posted:
JavaScript
1
10
10
1
# Change first row to headers and Transpose
2
headers = dfinput.iloc[0]
3
one = (pd.DataFrame(dfinput.values[1:], columns=headers)).T
4
5
# Change first row to headers again
6
one.columns = one.iloc[0]
7
8
# Keep only odd indexed rows
9
res = one.iloc[1::2, :]
10
JavaScript
1
10
10
1
res
2
3
uuid variable 1 variable 2 variable 3 variable 4 variable 5 variable 6 variable 7 variable 8 variable 9 variable 10 variable A variable B variable C variable D Variable E Variable F Variable G
4
5
79876081-099b-474f-9e8f-ff917fd7394c value value value value value value value value value value value value value value value value value
6
a96bc7cb-02b1-4d13-823a-908531cda095 value value value value value value value value value value value value value value value value
7
38bc7d20-10be-4774-973c-b3b00234a645 value value value value value value value value value value value value value value value
8
e7b12da6-a47f-4c24-8545-faa24e249a03 value value value value value value value value value value value value value value value value
9
6b2c9426-bd6f-4bda-9c53-a86200e051f8 value value value value value value value value value value value value value value value value value
10