I have the following dataframe:
a b x y language 0 id1 id_2 3 text1 1 id2 id_4 6 text2 2 id3 id_6 9 text3 3 id4 id_8 12 text4
I am attempting to use langdetect to detect the language of the text elements in column y.
This is the code I have used for that purpose:
for i,row in df.iterrows(): df.loc[i].at["language"] = detect(df.loc[i].at["y"])
Unfortunately, there are non-textual elements (including blanks, symbols, numbers and combinations of these) involved in this column, so I get the following traceback:
LangDetectException Traceback (most recent call last) <ipython-input-40-3b2637554e5f> in <module> 1 df["language"]="" 2 for i,row in df.iterrows(): ----> 3 df.loc[i].at["language"] = detect(df.loc[i].at["y"]) 4 df.head() C:Anacondalibsite-packageslangdetectdetector_factory.py in detect(text) 128 detector = _factory.create() 129 detector.append(text) --> 130 return detector.detect() 131 132 C:Anacondalibsite-packageslangdetectdetector.py in detect(self) 134 which has the highest probability. 135 ''' --> 136 probabilities = self.get_probabilities() 137 if probabilities: 138 return probabilities[0].lang C:Anacondalibsite-packageslangdetectdetector.py in get_probabilities(self) 141 def get_probabilities(self): 142 if self.langprob is None: --> 143 self._detect_block() 144 return self._sort_probability(self.langprob) 145 C:Anacondalibsite-packageslangdetectdetector.py in _detect_block(self) 148 ngrams = self._extract_ngrams() 149 if not ngrams: --> 150 raise LangDetectException(ErrorCode.CantDetectError, 'No features in text.') 151 152 self.langprob = [0.0] * len(self.langlist) LangDetectException: No features in text.
Is there a way I can employ exception handling so the detect function from the langdetect library may be used for those appropriate text elements?
Advertisement
Answer
So, given the following dataframe:
import pandas as pd df = pd.DataFrame( { "a": {0: "id1", 1: "id2", 2: "id3", 3: "id4"}, "b": {0: "id_2", 1: "id_4", 2: "id_6", 3: "id_8"}, "x": {0: 3, 1: 6, 2: 9, 3: 12}, "y": {0: "text1", 1: "text2", 2: "text3", 3: "text4"}, "language": {0: "", 1: "", 2: "", 3: ""}, } )
And, for the purpose of the answer, these mocked exception and function:
class LangDetectException(Exception): pass def detect(x): if x == "text2": raise LangDetectException else: return "english"
You can skip rows (row 1 here) in which “y” has non-textual elements, like this:
for i, row in df.iterrows(): try: df.loc[i, "language"] = detect(row["y"]) except LangDetectException: continue
And so:
print(df) # Outputs a b x y language 0 id1 id_2 3 text1 english 1 id2 id_4 6 text2 2 id3 id_6 9 text3 english 3 id4 id_8 12 text4 english