I am using the below “fastest” way of removing punctuation from a string:
text = file_open.translate(str.maketrans("", "", string.punctuation))
However, it removes all punctuation including apostrophes from tokens such as shouldn't
turning it into shouldnt
.
The problem is I am using NLTK library for stopwords and the standard stopwords don’t include such examples without apostrophes but instead have tokens that NLTK would generate if I used the NLTK tokenizer to split my text. For example for shouldnt
the stopwords included are shouldn, shouldn't, t
.
I can either add the additional stopwords or remove the apostrophes from the NLTK stopwords. But both solutions don’t seem “correct” in a way as I think the apostrophes should be left when doing punctuation cleaning.
Is there a way I can leave the apostrophes when doing fast punctuation cleaning?
Advertisement
Answer
>>> from string import punctuation >>> type(punctuation) <class 'str'> >>> my_punctuation = punctuation.replace("'", "") >>> my_punctuation '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~' >>> "It's right, isn't it?".translate(str.maketrans("", "", my_punctuation)) "It's right isn't it"