Is there requests/selenium function to convert a href link to proper link like :
clickLink("https://www.google.com","about")
returns value like https://www.google.com/about ?
like it fix a href link and converts to regular link
e.g.
https://google.com about https://google.com/about //www.pastebin.com/ / https://www.pastebin.com/
etc
I try make one but with no luck
def fixLink(Link,LinkOriginalPage):
'''Fixes link. ex. /f/d -> https://www.wtds.com/f/d
LinkOriginalPage=page Link redirected from'''
if Link.startswith("https://") or Link.startswith("http://"):
return "debug1 " + Link # , and exit
#fix 329 links crawled! - Latest link: https://www.wikipedia.com/https://kl.wikipedia.org/
if Link.startswith("//"):
Link="debug2 " + "https:"+Link # example, //www.pastebin.com/ -> http://www.pastebin.com/
# print(Link)
return Link # due to glitch
# now link does not start with //
# check if link is like a/b/c->site.com/a/b/c
asciiLetters="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
linkStartsWithValidProtocol=not (Link.startswith("http://") or Link.startswith("https://"))
linkDoesNotStartWithSlash=Link[0] in asciiLetters
if linkStartsWithValidProtocol and linkDoesNotStartWithSlash:
if LinkOriginalPage.endswith("/"):
Link="debug3 " + LinkOriginalPage+Link
else:
Link="debug4 " + LinkOriginalPage+"/"+Link
return Link
# now link does not start with ascii letter
# check if link is like /a/b/c
if Link.startswith("/"):
domainOfLink=getDomainFromLink(LinkOriginalPage)
# print(domainOfLink)
Link="debug 5|"+LinkOriginalPage+" http://"+domainOfLink+Link
# print("startswith / "+Link)
return Link # due to glitch
# fix div links (widely used bad code practice)
if Link.startswith("#"):
#glitch, invalud url like *&YT -> invalud url schema
#fix div
domainOfLink=getDomainFromLink(LinkOriginalPage)
Link="debug 6 "+domainOfLink+Link
return Link
# return the output if not returned (nvm)
return "https://about.io"
Advertisement
Answer
You can use “urljoin” function in urllib.parse. Here’s an example.
from urllib.parse import urljoin a = "http://www.example.com" b = "index.html" print(urljoin(a,b)) # Returns 'http://www.example.com/index.html'
PS. http://www.example.com/ actually exists.