Is there requests/selenium function to convert a href link to proper link like :
clickLink("https://www.google.com","about")
returns value like https://www.google.com/about
?
like it fix a href link and converts to regular link
e.g.
JavaScript
x
3
1
https://google.com about https://google.com/about
2
//www.pastebin.com/ / https://www.pastebin.com/
3
etc
I try make one but with no luck
JavaScript
1
39
39
1
def fixLink(Link,LinkOriginalPage):
2
'''Fixes link. ex. /f/d -> https://www.wtds.com/f/d
3
LinkOriginalPage=page Link redirected from'''
4
if Link.startswith("https://") or Link.startswith("http://"):
5
return "debug1 " + Link # , and exit
6
#fix 329 links crawled! - Latest link: https://www.wikipedia.com/https://kl.wikipedia.org/
7
if Link.startswith("//"):
8
Link="debug2 " + "https:"+Link # example, //www.pastebin.com/ -> http://www.pastebin.com/
9
# print(Link)
10
return Link # due to glitch
11
# now link does not start with //
12
# check if link is like a/b/c->site.com/a/b/c
13
asciiLetters="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
14
linkStartsWithValidProtocol=not (Link.startswith("http://") or Link.startswith("https://"))
15
linkDoesNotStartWithSlash=Link[0] in asciiLetters
16
if linkStartsWithValidProtocol and linkDoesNotStartWithSlash:
17
if LinkOriginalPage.endswith("/"):
18
Link="debug3 " + LinkOriginalPage+Link
19
else:
20
Link="debug4 " + LinkOriginalPage+"/"+Link
21
return Link
22
# now link does not start with ascii letter
23
# check if link is like /a/b/c
24
if Link.startswith("/"):
25
domainOfLink=getDomainFromLink(LinkOriginalPage)
26
# print(domainOfLink)
27
Link="debug 5|"+LinkOriginalPage+" http://"+domainOfLink+Link
28
# print("startswith / "+Link)
29
return Link # due to glitch
30
# fix div links (widely used bad code practice)
31
if Link.startswith("#"):
32
#glitch, invalud url like *&YT -> invalud url schema
33
#fix div
34
domainOfLink=getDomainFromLink(LinkOriginalPage)
35
Link="debug 6 "+domainOfLink+Link
36
return Link
37
# return the output if not returned (nvm)
38
return "https://about.io"
39
Advertisement
Answer
You can use “urljoin” function in urllib.parse. Here’s an example.
JavaScript
1
6
1
from urllib.parse import urljoin
2
a = "http://www.example.com"
3
b = "index.html"
4
print(urljoin(a,b))
5
# Returns 'http://www.example.com/index.html'
6
PS. http://www.example.com/ actually exists.