Is there requests/selenium function to convert a href link to proper link like :
clickLink("https://www.google.com","about")
returns value like https://www.google.com/about
?
like it fix a href link and converts to regular link
e.g.
https://google.com about https://google.com/about //www.pastebin.com/ / https://www.pastebin.com/
etc
I try make one but with no luck
def fixLink(Link,LinkOriginalPage): '''Fixes link. ex. /f/d -> https://www.wtds.com/f/d LinkOriginalPage=page Link redirected from''' if Link.startswith("https://") or Link.startswith("http://"): return "debug1 " + Link # , and exit #fix 329 links crawled! - Latest link: https://www.wikipedia.com/https://kl.wikipedia.org/ if Link.startswith("//"): Link="debug2 " + "https:"+Link # example, //www.pastebin.com/ -> http://www.pastebin.com/ # print(Link) return Link # due to glitch # now link does not start with // # check if link is like a/b/c->site.com/a/b/c asciiLetters="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" linkStartsWithValidProtocol=not (Link.startswith("http://") or Link.startswith("https://")) linkDoesNotStartWithSlash=Link[0] in asciiLetters if linkStartsWithValidProtocol and linkDoesNotStartWithSlash: if LinkOriginalPage.endswith("/"): Link="debug3 " + LinkOriginalPage+Link else: Link="debug4 " + LinkOriginalPage+"/"+Link return Link # now link does not start with ascii letter # check if link is like /a/b/c if Link.startswith("/"): domainOfLink=getDomainFromLink(LinkOriginalPage) # print(domainOfLink) Link="debug 5|"+LinkOriginalPage+" http://"+domainOfLink+Link # print("startswith / "+Link) return Link # due to glitch # fix div links (widely used bad code practice) if Link.startswith("#"): #glitch, invalud url like *&YT -> invalud url schema #fix div domainOfLink=getDomainFromLink(LinkOriginalPage) Link="debug 6 "+domainOfLink+Link return Link # return the output if not returned (nvm) return "https://about.io"
Advertisement
Answer
You can use “urljoin” function in urllib.parse. Here’s an example.
from urllib.parse import urljoin a = "http://www.example.com" b = "index.html" print(urljoin(a,b)) # Returns 'http://www.example.com/index.html'
PS. http://www.example.com/ actually exists.