How to parse guess_language to read 30000 tweets?

I am using guess_language to detect the language of the tweets for a school project. I used pandas to read the .csv file. I have around 30000 rows.

However, my problem is that the guess language can only read one tweet at a time.

guess_language(“Top story: â€˜Massive Mental Health Crisisâ€™ “)

‘en’

I am very new at python and been trying to figure out the loop and if statements for this for almost a day now and they keep just returning one tweet.

Thank you and apologies if the question is lame.

I used the code suggested below by Kareem.

from guess_language import guess_language resdf = nodupdf[ nodupdf[‘text’].apply(guess_language) == ‘en’ ]

It worked for the small file (100 csv), but when I applied it on the bigger one. It gave me this error.

TypeError Traceback (most recent call last) in 9 10 for chunk in noeng: —> 11 chunk[‘text’].apply(guess_language)== ‘en’

~Anaconda3libsite-packagespandascoreseries.py in apply(self, func, convert_dtype, args, **kwds) 4040 else: 4041 values = self.astype(object).values -> 4042 mapped = lib.map_infer(values, f, convert=convert_dtype) 4043 4044 if len(mapped) and isinstance(mapped[0], Series):

pandas_libslib.pyx in pandas._libs.lib.map_infer()

~Anaconda3libsite-packagesguess_language__init__.py in guess_language(text, hints) 322 “””Return the ISO 639-1 language code. 323 “”” –> 324 words = WORD_RE.findall(text[:MAX_LENGTH].replace(“’”, “‘”)) 325 return identify(words, find_runs(words), hints) 326

TypeError: ‘float’ object is not subscriptable

Thinking it was a memory error, I used chunk.

noeng=pd.read_csv(r’C:Usersjeannodupdf.csv’, chunksize=10) for chunk in noeng: chunk[‘text’].apply(guess_language)== ‘en’

I still got the same error.

Answer

You can fetch every and process them basically like this

resdf =  newdf[ newdf['text'].apply(guess_language) == 'en' ]

the resdf should contain the rows of the original that had a classification of english for its tweets.

The function apply should apply your function guess_language on every single tweet and return the column values after being classified, then we use that to get only the indexes of rows who had en as classification.

Advertisement

Answer