I am training on how to scrape some data in python and here’s my try:
import requests from bs4 import BeautifulSoup url = 'https://learndataanalysis.org/python-tutorial/page/10' r = requests.get(url) soup = BeautifulSoup(r.content, 'lxml') links = [i['href'] for i in soup.select('h2.entry-title a')] print(links)
The code gets the links of the webpage. I can use this line to get the titles of each tutorial:
[i.text for i in soup.select('h2.entry-title a')]
How can I make a list of both the links and the titles and finally to export the results to excel file?
Simply I need a column for the titles of the articles and the other column for the link of each article.
Advertisement
Answer
You can actually do it with a single list comprehension.
Basically, what you have is the right approach, you just need to create a list of lists using your list comprehension.
For each match returned by soup.select
, you can extract both the text
and href
together.
Then, using the csv
module, you can pass this list of lists to csv.writerows
to create the CSV file for viewing in Excel or other tools, data processing, etc.
You can also optionally prepend a header to the list of lists, if you want, e.g. ['Title', 'URL']
.
Here is a full working example:
from bs4 import BeautifulSoup import csv import requests url = 'https://learndataanalysis.org/python-tutorial/page/10' r = requests.get(url) soup = BeautifulSoup(r.content, 'lxml') data = [[i.text, i['href']] for i in soup.select('h2.entry-title a')] # optional, if you want to add a header line data.insert(0, ['Title', 'URL']) with open('output_data.csv', 'w') as output_file: writer = csv.writer(output_file, delimiter=',', quoting=csv.QUOTE_ALL) writer.writerows(data)
Note that csv.QUOTE_ALL
isn’t strictly necessary, but its often a good idea to force quoting on all fields.
If you instead want to export to an XLSX format, its best to use the pandas
module instead:
import pandas as pd df = pd.DataFrame(data, columns=['Title', 'URL']) df.to_excel('output_data.xlsx')
This will by default also export the row numbers. If you prefer to omit them, you can use the pandas.ExcelWriter
class, as in this post.
Edit:
If you want also want to extract the dates, then you can do so with a separate list comprehension (since the date information is in a different HTML element altogether).
Then, you can use zip
to combine the information together.
data = [[i.text, i['href']] for i in soup.select('h2.entry-title a')] dates = [i.text for i in soup.select('span.published')] data = [i + [j] for i, j in zip(data, dates)]