I am using guess_language to detect the language of the tweets for a school project. I used pandas to read the .csv file. I have around 30000 rows.
However, my problem is that the guess language can only read one tweet at a time.
guess_language(“Top story: ‘Massive Mental Health Crisis’ “)
‘en’
I am very new at python and been trying to figure out the loop and if statements for this for almost a day now and they keep just returning one tweet.
Thank you and apologies if the question is lame.
I used the code suggested below by Kareem.
from guess_language import guess_language resdf = nodupdf[ nodupdf[‘text’].apply(guess_language) == ‘en’ ]
It worked for the small file (100 csv), but when I applied it on the bigger one. It gave me this error.
TypeError Traceback (most recent call last) in 9 10 for chunk in noeng: —> 11 chunk[‘text’].apply(guess_language)== ‘en’
~Anaconda3libsite-packagespandascoreseries.py in apply(self, func, convert_dtype, args, **kwds) 4040 else: 4041 values = self.astype(object).values -> 4042 mapped = lib.map_infer(values, f, convert=convert_dtype) 4043 4044 if len(mapped) and isinstance(mapped[0], Series):
pandas_libslib.pyx in pandas._libs.lib.map_infer()
~Anaconda3libsite-packagesguess_language__init__.py in guess_language(text, hints) 322 “””Return the ISO 639-1 language code. 323 “”” –> 324 words = WORD_RE.findall(text[:MAX_LENGTH].replace(“’”, “‘”)) 325 return identify(words, find_runs(words), hints) 326
TypeError: ‘float’ object is not subscriptable
Thinking it was a memory error, I used chunk.
noeng=pd.read_csv(r’C:Usersjeannodupdf.csv’, chunksize=10) for chunk in noeng: chunk[‘text’].apply(guess_language)== ‘en’
I still got the same error.
Advertisement
Answer
You can fetch every and process them basically like this
resdf = newdf[ newdf['text'].apply(guess_language) == 'en' ]
the resdf should contain the rows of the original that had a classification of english for its tweets.
The function apply
should apply your function guess_language
on every single tweet and return the column values after being classified, then we use that to get only the indexes of rows who had en
as classification.