Skip to content
Advertisement

How to strip string from punctuation except apostrophes for NLP

I am using the below “fastest” way of removing punctuation from a string:

JavaScript

However, it removes all punctuation including apostrophes from tokens such as shouldn't turning it into shouldnt.

The problem is I am using NLTK library for stopwords and the standard stopwords don’t include such examples without apostrophes but instead have tokens that NLTK would generate if I used the NLTK tokenizer to split my text. For example for shouldnt the stopwords included are shouldn, shouldn't, t.

I can either add the additional stopwords or remove the apostrophes from the NLTK stopwords. But both solutions don’t seem “correct” in a way as I think the apostrophes should be left when doing punctuation cleaning.

Is there a way I can leave the apostrophes when doing fast punctuation cleaning?

Advertisement

Answer

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement