I am using the following code to extract text from a web page:
JavaScript
x
19
19
1
from bs4.element import Comment
2
import urllib.request
3
4
def tag_visible(element):
5
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]',]:
6
return False
7
if isinstance(element, Comment):
8
return False
9
return True
10
11
def text_from_html(body):
12
soup = BeautifulSoup(body, 'html.parser')
13
texts = soup.findAll(text=True)
14
visible_texts = filter(tag_visible, texts)
15
return u" ".join(t.strip() for t in visible_texts)
16
#return ' '.join(texts)
17
html = urllib.request.urlopen('https://ordoabchao.ca/volume-one/babylon').read()
18
text = text_from_html(html)
19
The problem is, when I open text, I get all the links from the bottoms that exist at the top of the page, which I don’t want. How can i modify the above code to do so?
I also gets the footnotes, which i may want, but a separate text. Is there a way to separate the footsnotes from the main text?
Thanks
Advertisement
Answer
If you want to extract all the text then you can use get_text()
method
JavaScript
1
9
1
from bs4 import BeautifulSoup
2
import requests
3
url = 'https://ordoabchao.ca/volume-one/babylon'
4
res = requests.get(url)
5
soup = BeautifulSoup(res.text, 'lxml')
6
7
for p in soup.select('.sqs-block-content p'):
8
print(p.get_text(strip=True))
9
To save as text file, you can use pandas DataFrame
JavaScript
1
7
1
lst = []
2
for p in soup.select('.sqs-block-content p'):
3
txt= p.get_text(strip=True)
4
lst.append({'Text':txt})
5
6
df=pd.DataFrame(lst).to_csv('out.txt',sep='t',index= False)
7
#import
JavaScript
1
2
1
import pandas as pd
2