Hi I’m struggling to understand why my Regex isn’t working.
I have URL’s that have DOI’s on them like so:
https://link.springer.com/10.1007/s00737-021-01116-5 https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228 https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228 https://onlinelibrary.wiley.com/doi/10.1111/jocn.13435 https://journals.sagepub.com/doi/pdf/10.1177/1062860613484171 https://onlinelibrary.wiley.com/resolve/openurl?genre=article&title=Natural+Resources+Forum&issn=0165-0203&volume=26&date=2002&issue=1&spage=3 https://dx.doi.org/10.1108/14664100110397304?nols=y https://onlinelibrary.wiley.com/doi/10.1111/jocn.15833 https://www.tandfonline.com/doi/pdf/10.1080/03768350802090592?needAccess=true
And I’m using for example this Regex, but it always returns empty?
print(re.findall(r'/^10.d{4,9}/[-._;()/:A-Z0-9]+$/i', 'https://dx.doi.org/10.1108/02652320410549638?nols=y'))
Where have I gone wrong?
Advertisement
Answer
It looks like you come from another programming language that has the notion of regex literals that are delimited with forward slashes and have the modifiers following the closing slash (hence /i
).
In Python there is no such thing, and these slashes and modifier(s) are taken as literal characters. For flags like i
you can use the optional flags
parameter of findall
.
Secondly, ^
will match the start of the input string, but evidently the URLs you have as input do not start with 10
, so that has to go. Instead you could require that the 10
must follow a word break… i.e. it should not be preceded by an alphanumerical character (or underscore).
Similarly, $
will match the end of the input string, but you have URLs that continue with URL parameters, like ?nols=y
, so again the part you are interested in does not go on until the end of the input. So that has to go too.
The dot has a special meaning in regex, but you clearly intended to match a literal dot, so it should be escaped.
Finally, alphanumerical characters can be matched with w
, which also matches both lower case and capital Latin letters, so you can shorten the character class a bit and do without any flags such as i
(re.I
).
This leaves us with:
print(re.findall(r'b10.d{4,9}/[-.;()/:w]+', 'https://dx.doi.org/10.1108/02652320410549638?nols=y'))