Skip to content
Advertisement

How to make function to fix a href link? [closed]

Is there requests/selenium function to convert a href link to proper link like :

clickLink("https://www.google.com","about")

returns value like https://www.google.com/about ?

like it fix a href link and converts to regular link

e.g.

https://google.com about https://google.com/about
//www.pastebin.com/ / https://www.pastebin.com/

etc

I try make one but with no luck

def fixLink(Link,LinkOriginalPage):
    '''Fixes link. ex. /f/d -> https://www.wtds.com/f/d
    LinkOriginalPage=page Link redirected from'''
    if Link.startswith("https://") or Link.startswith("http://"):
        return "debug1 " + Link # , and exit
        #fix 329 links crawled! - Latest link: https://www.wikipedia.com/https://kl.wikipedia.org/
    if Link.startswith("//"):
        Link="debug2 " + "https:"+Link # example, //www.pastebin.com/ -> http://www.pastebin.com/
        # print(Link)
        return Link # due to glitch
    # now link does not start with //
    # check if link is like a/b/c->site.com/a/b/c
    asciiLetters="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    linkStartsWithValidProtocol=not (Link.startswith("http://") or Link.startswith("https://"))
    linkDoesNotStartWithSlash=Link[0] in asciiLetters
    if linkStartsWithValidProtocol and linkDoesNotStartWithSlash:
        if LinkOriginalPage.endswith("/"):
            Link="debug3 " + LinkOriginalPage+Link
        else:
            Link="debug4 " + LinkOriginalPage+"/"+Link
        return Link
    # now link does not start with ascii letter
    # check if link is like /a/b/c
    if Link.startswith("/"):
        domainOfLink=getDomainFromLink(LinkOriginalPage)
        # print(domainOfLink)
        Link="debug 5|"+LinkOriginalPage+" http://"+domainOfLink+Link
        # print("startswith / "+Link)
        return Link # due to glitch
    # fix div links (widely used bad code practice)
    if Link.startswith("#"):
        #glitch, invalud url like *&YT -> invalud url schema
        #fix div
        domainOfLink=getDomainFromLink(LinkOriginalPage)
        Link="debug 6 "+domainOfLink+Link
        return Link
    # return the output if not returned (nvm)
    return "https://about.io"

Advertisement

Answer

You can use “urljoin” function in urllib.parse. Here’s an example.

from urllib.parse import urljoin
a = "http://www.example.com"
b = "index.html"
print(urljoin(a,b))
# Returns 'http://www.example.com/index.html'

PS. http://www.example.com/ actually exists.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement