I want to delete all the URLs in the sentence.
Here is my code:
JavaScript
x
11
11
1
import ijson
2
f = open("/content/drive/My Drive/PTT 爬蟲/content/MakeUp/PTT_MakeUp_content_0_1000.json")
3
objects = ijson.items(f, 'item')
4
5
for obj in list(objects):
6
article = obj['content']
7
ret = re.findall("http[s*]:[a-zA-Z0-9_.+-/#~]+ ", article) # Question here
8
for r in ret:
9
article = article.replace(r, "")
10
print(article)
11
But a URL with “http” is still left in the sentence.
JavaScript
1
2
1
article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
2
How can I fix it?
Advertisement
Answer
One simple fix would be to just replace the pattern https?://S+
with an empty string:
JavaScript
1
4
1
article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
2
output = re.sub(r'https?://S+', '', article_example)
3
print(output)
4
This prints:
JavaScript
1
2
1
眼影盤長這樣 說真的 很不好拍
2
My pattern assumes that whatever non whitespace characters which follow http://
or https://
are part of the URL.