Skip to content
Advertisement

How to clean survey data in pandas

Input:

enter image description here Output:

enter image description here

here’s the data:

d = {'Morning': ["Didn't answer", "Didn't answer", "Didn't answer", 'Morning', "Didn't answer"], 'Afternoon': ["Didn't answer", 'Afternoon', "Didn't answer", 'Afternoon', "Didn't answer"], 'Night': ["Didn't answer", 'Night', "Didn't answer", 'Night', 'Night'], 'Sporadic': ["Didn't answer", "Didn't answer", 'Sporadic', "Didn't answer", "Didn't answer"], 'Constant': ["Didn't answer", "Didn't answer", "Didn't answer", 'Constant', "Didn't answer"]}

         Morning      Afternoon          Night       Sporadic       Constant
0  Didn't answer  Didn't answer  Didn't answer  Didn't answer  Didn't answer
1  Didn't answer      Afternoon          Night  Didn't answer  Didn't answer
2  Didn't answer  Didn't answer  Didn't answer       Sporadic  Didn't answer
3        Morning      Afternoon          Night  Didn't answer       Constant
4  Didn't answer  Didn't answer          Night  Didn't answer  Didn't answer

I want the output to be:

d = {"Time of the day": ["Didn't answer", "['Afternoon', 'Night']", "Sporadic", "['Morning', 'Afternoon', 'Night', 'Constant']", "Night"]}

                                 Time of the day
0                                  Didn't answer
1                         ['Afternoon', 'Night']
2                                       Sporadic
3  ['Morning', 'Afternoon', 'Night', 'Constant']
4                                          Night

so if there’s no answer in every column in a row, the value in the new data frame would be “Didn’t answer” and if there’s at least one answer like “night”, the value in the new data frame would be “night” and if there are multiple answers like ” Morning”, “Night” the value in the new data frame would be a list of the answers

Advertisement

Answer

You can use:

df["ToD"] = (df.replace("Didn't answer", np.nan).stack().groupby(level=0)
               .apply(lambda x: [i for i in x] if len(x) > 1 else x.iloc[0])
               .reindex(df.index, fill_value="Didn't answer"))

Output:

>>> df["ToD"]
0                            Didn't answer
1                       [Afternoon, Night]
2                                 Sporadic
3    [Morning, Afternoon, Night, Constant]
4                                    Night
Name: ToD, dtype: object
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement