I have data in SVMlight format (label feature1:value1 feature2:v2 …) as such
JavaScript
x
3
1
talk.politics.guns a:12 about:1 abrams:1 absolutely:1
2
talk.politics.mideast I:4 run:10 go:3
3
I tried sklearn.load_svmlight_file
but it doesn’t seem to work with categorical string features and labels. I am trying to store it into pandas DataFrame. Any pointers would be appreciated.
Advertisement
Answer
You can do it by hand… One way you can convert the file you want in a DataFrame:
JavaScript
1
22
22
1
svmformat_file = """~/svmformat_file_sample"""
2
3
# Read to list
4
with open(svmformat_file, mode="r") as fp:
5
svmformat_list = fp.readlines()
6
7
# For each line we save the key:values to a dict
8
pandas_list = []
9
for line in svmformat_list:
10
line_dict = dict()
11
12
line_split = line.split(' ')
13
line_dict["label"] = line_split[0]
14
15
for col in line_split[1:]:
16
col = col.rstrip() # Remove 'n'
17
col_split = col.split(':')
18
key, value = col_split[0], col_split[1]
19
line_dict[key] = value
20
21
pandas_list.append(line_dict)
22
The result DataFrame with your example file:
JavaScript
1
2
1
pd.DataFrame(pandas_list)
2