Hi there I have following Problem: I extracted a list of URL's from a .txt file with Python using this: And the Output contains for some files following: PROBLEM IS: as you see it printed out "#038;" I'm thinking that translates into "&" but there is already a "&" infront of that and if I follow the Link its invalid.

Python found URL is invalid

Hi there I have following Problem:

I extracted a list of URL’s from a .txt file with Python using this:

 import re

with open('html.txt') as f:
    urls = f.read()
    links = re.findall('"((http)s?://.*?)"', urls)
for url in links:
    print(url[0])

JavaScript
​x
 
 import re
​
with open('html.txt') as f:
    urls = f.read()
    links = re.findall('"((http)s?://.*?)"', urls)
for url in links:
    print(url[0])
​

And the Output contains for some files following:

https://url.com/?download_file=259&order=wc_order_xDxDxD&email=testmail%40gmail.com&key=1234-1234-1234-1234-8c368abd9c22

JavaScript
 
https://url.com/?download_file=259&order=wc_order_xDxDxD&email=testmail%40gmail.com&key=1234-1234-1234-1234-8c368abd9c22
​

PROBLEM IS:

as you see it printed out “#038;” I’m thinking that translates into “&” but there is already a “&” infront of that and if I follow the Link its invalid.

However if I delete all “#038;” the Link works just fine.

How can I print them so that I dont have “#038;” inside and the Link works?

Thanks so much

Answer

Looks like a url encoding issue. Since, you are only printing, you can use string replace function.

for url in links:
    url[0].replace("#038","")

JavaScript
 
for url in links:
    url[0].replace("#038","") 
​

Advertisement

Answer