Skip to content
Advertisement

Remove certain words from URL

I scraped tweet statuses, from which I’m removing certain words; however, it doesn’t work effectively as it only removes the first string in “stopwords”.

Code:

stopwords = ['/people', '/photo/1']
link_list = []
for link in links:
    for i in stopwords:
        remove = link.replace(i, "")
        link = remove
        link_list.append(link)

Output:

https://twitter.com/CultOfCurtis/status/1492292326051483648



https://twitter.com/consequence/status/1492245783084773383/photo/1



https://twitter.com/gayesian/status/1492292246456184841


I tried different codes after researching, but to no avail. :/

Advertisement

Answer

You just need to de-indent the last line there:

stopwords = ['/people', '/photo/1']
link_list = []
for link in links:
    for i in stopwords:
        remove = link.replace(i, "")
        link = remove
    link_list.append(link) 

In its original position, it would append the link with /people removed but before removing /photo/1. Then it would append again with /photo/1 removed.

You could alternatively apply this suggestion here and use a compiled regular expression:

import re

stopwords = ['/people', '/photo/1']
pattern = re.compile('|'.join(map(re.escape, stopwords)))
link_list = [pattern.sub('', link) for link in links]
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement