I have the following dataframe:
JavaScript
x
6
1
a b x y language
2
0 id1 id_2 3 text1
3
1 id2 id_4 6 text2
4
2 id3 id_6 9 text3
5
3 id4 id_8 12 text4
6
I am attempting to use langdetect to detect the language of the text elements in column y.
This is the code I have used for that purpose:
JavaScript
1
3
1
for i,row in df.iterrows():
2
df.loc[i].at["language"] = detect(df.loc[i].at["y"])
3
Unfortunately, there are non-textual elements (including blanks, symbols, numbers and combinations of these) involved in this column, so I get the following traceback:
JavaScript
1
37
37
1
LangDetectException Traceback (most recent call last)
2
<ipython-input-40-3b2637554e5f> in <module>
3
1 df["language"]=""
4
2 for i,row in df.iterrows():
5
----> 3 df.loc[i].at["language"] = detect(df.loc[i].at["y"])
6
4 df.head()
7
8
C:Anacondalibsite-packageslangdetectdetector_factory.py in detect(text)
9
128 detector = _factory.create()
10
129 detector.append(text)
11
--> 130 return detector.detect()
12
131
13
132
14
15
C:Anacondalibsite-packageslangdetectdetector.py in detect(self)
16
134 which has the highest probability.
17
135 '''
18
--> 136 probabilities = self.get_probabilities()
19
137 if probabilities:
20
138 return probabilities[0].lang
21
22
C:Anacondalibsite-packageslangdetectdetector.py in get_probabilities(self)
23
141 def get_probabilities(self):
24
142 if self.langprob is None:
25
--> 143 self._detect_block()
26
144 return self._sort_probability(self.langprob)
27
145
28
29
C:Anacondalibsite-packageslangdetectdetector.py in _detect_block(self)
30
148 ngrams = self._extract_ngrams()
31
149 if not ngrams:
32
--> 150 raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.')
33
151
34
152 self.langprob = [0.0] * len(self.langlist)
35
36
LangDetectException: No features in text.
37
Is there a way I can employ exception handling so the detect function from the langdetect library may be used for those appropriate text elements?
Advertisement
Answer
So, given the following dataframe:
JavaScript
1
12
12
1
import pandas as pd
2
3
df = pd.DataFrame(
4
{
5
"a": {0: "id1", 1: "id2", 2: "id3", 3: "id4"},
6
"b": {0: "id_2", 1: "id_4", 2: "id_6", 3: "id_8"},
7
"x": {0: 3, 1: 6, 2: 9, 3: 12},
8
"y": {0: "text1", 1: "text2", 2: "text3", 3: "text4"},
9
"language": {0: "", 1: "", 2: "", 3: ""},
10
}
11
)
12
And, for the purpose of the answer, these mocked exception and function:
JavaScript
1
9
1
class LangDetectException(Exception):
2
pass
3
4
def detect(x):
5
if x == "text2":
6
raise LangDetectException
7
else:
8
return "english"
9
You can skip rows (row 1 here) in which “y” has non-textual elements, like this:
JavaScript
1
6
1
for i, row in df.iterrows():
2
try:
3
df.loc[i, "language"] = detect(row["y"])
4
except LangDetectException:
5
continue
6
And so:
JavaScript
1
8
1
print(df)
2
# Outputs
3
a b x y language
4
0 id1 id_2 3 text1 english
5
1 id2 id_4 6 text2
6
2 id3 id_6 9 text3 english
7
3 id4 id_8 12 text4 english
8