I’m working with pandas for the first time. I have a column with survey responses in, which can take ‘strongly agree’, ‘agree’, ‘disagree’, ‘strongly disagree’, and ‘neither’ values.
This is the output of describe()
and value_counts()
for the column:
count 4996 unique 5 top Agree freq 1745 dtype: object Agree 1745 Strongly agree 926 Strongly disagree 918 Disagree 793 Neither 614 dtype: int64
I want to do a linear regression on this question versus overall score. However, I have a feeling that I should convert the column into a Category variable first, given that it’s inherently ordered. Is this correct? If so, how should I do this?
I’ve tried this:
df.EasyToUseQuestionFactor = pd.Categorical.from_array(df.EasyToUseQuestion) print df.EasyToUseQuestionFactor
This produces output that looks vaguely right, but it seems that the categories are in the wrong order. Is there a way that I can specify ordering? Do I even need to specify ordering?
This is the rest of my code right now:
df = pd.read_csv('./data/responses.csv') lm1 = ols('OverallScore ~ EasyToUseQuestion', data).fit() print lm1.rsquared
Advertisement
Answer
Yes you should convert it to categorical data and this should do the trick
likert_scale = {'strongly agree':2, 'agree':1, 'neither':0, 'disagree':-1, 'strongly disagree':-2} df['categorical_data'] = df.EasyToUseQuestion.apply(lambda x: likert_scale[x])