Skip to content
Advertisement

Extract full text from different tags and outside them

I want to extract all text information from the already scrapped readme files from github. There is text between Html tags but there is also a lot of text outside (between) tags. Tags are different because those are different readmes so the authors do not follow any particular rules. I want to extract text from tags but also the rest outside any tag.
The example:

JavaScript

And I want to extract all text so:

JavaScript

and so on …. I’ve tried BeatifulSoup with get_text() or just soup.text

JavaScript

but it doesn’t work and I get only text within tags :

JavaScript

Advertisement

Answer

Instead of .text you could use .get_text() with parameters seperator and split and also replace the <!-- --> to get the comments, if needed:

JavaScript

Example

JavaScript

Output

JavaScript
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement