This is the regex code:
without_header = re.findall('/sports/[a-z0-9/.-:]*[0-9.]+cms', without_header_url)
It returns me the output of each URL which doesn’t have the https header in front. For example:
/sports/cricket/ipl/top-stories/kxip-vs-csk-shane-watson-faf-du-plessis-infuse-life-into-csks-ipl-campaign-shape-confidence-boosting-win-over-kxip/articleshow/78481088.cms' /sports/football/epl/top-stories/epl-manchester-united-humiliated-as-mourinhos-spurs-win-6-1-at-old-trafford/articleshow/78481304.cms
For this, I want to append “https://example.com” in the beginning. I don’t want a for loop, is there any efficient way of doing it using re.sub?
Advertisement
Answer
You may use this regex in re.sub
:
(?<!:/)(/sports/[a-z0-9/.:-]*[0-9.]+cms)
Code:
s = re.sub(r'(?<!:/)(/sports/[a-z0-9/.:-]*[0-9.]+cms)', r'https://1', s)
RegEx Details:
(?<!:/)
: Negative lookbehind to assert that we don’t have:/
at previous position(/sports/[a-z0-9/.:-]*[0-9.]+cms)
: Match your text and capture in group #1