Export results to excel file title and link requests python [closed]

Tags: , , , ,



I am training on how to scrape some data in python and here’s my try:

import requests
from bs4 import BeautifulSoup

url = 'https://learndataanalysis.org/python-tutorial/page/10'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
links = [i['href'] for i in soup.select('h2.entry-title a')]
print(links)

The code gets the links of the webpage. I can use this line to get the titles of each tutorial:

[i.text for i in soup.select('h2.entry-title a')]

How can I make a list of both the links and the titles and finally to export the results to excel file?

Simply I need a column for the titles of the articles and the other column for the link of each article.

Answer

You can actually do it with a single list comprehension.

Basically, what you have is the right approach, you just need to create a list of lists using your list comprehension.

For each match returned by soup.select, you can extract both the text and href together.

Then, using the csv module, you can pass this list of lists to csv.writerows to create the CSV file for viewing in Excel or other tools, data processing, etc.

You can also optionally prepend a header to the list of lists, if you want, e.g. ['Title', 'URL'].

Here is a full working example:

from bs4 import BeautifulSoup

import csv
import requests

url = 'https://learndataanalysis.org/python-tutorial/page/10'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')

data = [[i.text, i['href']] for i in soup.select('h2.entry-title a')]

# optional, if you want to add a header line
data.insert(0, ['Title', 'URL'])

with open('output_data.csv', 'w') as output_file:
    writer = csv.writer(output_file, delimiter=',', quoting=csv.QUOTE_ALL)
    writer.writerows(data)

Note that csv.QUOTE_ALL isn’t strictly necessary, but its often a good idea to force quoting on all fields.


If you instead want to export to an XLSX format, its best to use the pandas module instead:

import pandas as pd
df = pd.DataFrame(data, columns=['Title', 'URL'])
df.to_excel('output_data.xlsx')

This will by default also export the row numbers. If you prefer to omit them, you can use the pandas.ExcelWriter class, as in this post.


Edit:

If you want also want to extract the dates, then you can do so with a separate list comprehension (since the date information is in a different HTML element altogether).

Then, you can use zip to combine the information together.

data = [[i.text, i['href']] for i in soup.select('h2.entry-title a')]
dates = [i.text for i in soup.select('span.published')]
data = [i + [j] for i, j in zip(data, dates)]


Source: stackoverflow