Hi there I have following Problem:
I extracted a list of URL’s from a .txt file with Python using this:
JavaScript
x
8
1
import re
2
3
with open('html.txt') as f:
4
urls = f.read()
5
links = re.findall('"((http)s?://.*?)"', urls)
6
for url in links:
7
print(url[0])
8
And the Output contains for some files following:
JavaScript
1
2
1
https://url.com/?download_file=259&order=wc_order_xDxDxD&email=testmail%40gmail.com&key=1234-1234-1234-1234-8c368abd9c22
2
PROBLEM IS:
as you see it printed out “#038;” I’m thinking that translates into “&” but there is already a “&” infront of that and if I follow the Link its invalid.
However if I delete all “#038;” the Link works just fine.
How can I print them so that I dont have “#038;” inside and the Link works?
Thanks so much
Advertisement
Answer
Looks like a url encoding issue. Since, you are only printing, you can use string replace function.
JavaScript
1
3
1
for url in links:
2
url[0].replace("#038","")
3