Skip to content
Advertisement

Regular expression for removing all URLs in a string in Python

I want to delete all the URLs in the sentence.

Here is my code:

import ijson
f = open("/content/drive/My Drive/PTT 爬蟲/content/MakeUp/PTT_MakeUp_content_0_1000.json")
objects = ijson.items(f, 'item')

for obj in list(objects):
    article = obj['content']
    ret = re.findall("http[s*]:[a-zA-Z0-9_.+-/#~]+ ", article) # Question here
    for r in ret:
        article = article.replace(r, "")
    print(article)

But a URL with “http” is still left in the sentence.

article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"

How can I fix it?

Advertisement

Answer

One simple fix would be to just replace the pattern https?://S+ with an empty string:

article_example = "眼影盤長這樣 http://i.imgur.com/uxvRo3h.jpg 說真的 很不好拍"
output = re.sub(r'https?://S+', '', article_example)
print(output)

This prints:

眼影盤長這樣  說真的 很不好拍

My pattern assumes that whatever non whitespace characters which follow http:// or https:// are part of the URL.

Advertisement