How to remove urls between texts in pandas dataframe rows?

Question

I am trying to solve a nlp problem, here in dataframe text column have lots of rows filled with urls like http.somethingsomething.some of the urls and other texts have no space between them for example- ':http:\something',';http:\something',',http:\something'. so there sometime , before url text without any space and sometime something else but mostly , ,. ,:, ;. and url either at

Accepted Answer

A simple approach would be to just remove any URL starting with http or https:df["text"] = df["text"].str.replace(r's*https?://S+(s+|$)', ' ').str.strip()There is some subtle logic in the above line of code which merits some explanation.  We capture a URL, with optional whitespace on the left and mandatory whitespace on the right (except for when the URL continues to the end).  Then, we replace that with a single space, and use strip() in case this operation would leave dangling whitespace at the start/end.

id	text	target
1	we always try to bring the heavy metal rt `http:\something11`	1
4	on plus side look at the sky last night it was ablaze `;http:\somethingdifferent`	1
6	inec office in abia set ablaze `:http:\itsjustaurl`	1
3	`.http:\something11` we always try to bring the heavy metal rt	1

Advertisement

Answer