I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Python which shows a nice solution using regex, but I am not sure how to implement it
JavaScript
x
9
1
def mapfn(k, v):
2
print v
3
import re, string
4
pattern = re.compile('[W_]+')
5
v = pattern.match(v)
6
print v
7
for w in v.split():
8
yield w, 1
9
I’m afraid I am not sure how to use the library re
or even regex for that matter. I am not sure how to apply the regex pattern to the incoming string (line of a book) v
properly to retrieve the new line without any non-alphanumeric chars.
Suggestions?
Advertisement
Answer
Use re.sub
JavaScript
1
7
1
import re
2
3
regex = re.compile('[^a-zA-Z]')
4
#First parameter is the replacement, second parameter is your input string
5
regex.sub('', 'ab3d*E')
6
#Out: 'abdE'
7
Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input…)
JavaScript
1
2
1
regex = re.compile('[,.!?]') #etc.
2