Skip to content
Advertisement

Yielding values from consecutive parallel parse functions via meta in Scrapy

In my scrapy code I’m trying to yield the following figures from parliament’s website where all the members of parliament (MPs) are listed. Opening the links for each MP, I’m making parallel requests to get the figures I’m trying to count. I’m intending to yield each three figures below in the company of the name and the party of the MP

Here are the figures I’m trying to scrape

  1. How many bill proposals that each MP has their signature on
  2. How many question proposals that each MP has their signature on
  3. How many times that each MP spoke on the parliament

In order to count and yield out how many bills has each member of parliament has their signature on, I’m trying to write a scraper on the members of parliament which works with 3 layers:

  • Starting with the link where all MPs are listed
  • From (1) accessing the individual page of each MP where the three information defined above is displayed
  • 3a) Requesting the page with bill proposals and counting the number of them by len function 3b) Requesting the page with question proposals and counting the number of them by len function 3c) Requesting the page with speeches and counting the number of them by len function

What I want: I want to yield the inquiries of 3a,3b,3c with the name and the party of the MP in the same raw

  • Problem 1) When I get an output to csv it only creates fields of speech count, name, part. It doesn’t show me the fields of bill proposals and question proposals

  • Problem 2) There are two empty values for each MP, which I guess corresponds to the values I described above at Problem1

  • Problem 3) What is the better way of restructuring my code to output the three values in the same line, rather than printing each MP three times for each value that I’m scraping

enter image description here

from scrapy import Spider
from scrapy.http import Request

import logging


class MvSpider(Spider):
    name = 'mv2'
    allowed_domains = ['tbmm.gov.tr']
    start_urls = ['https://www.tbmm.gov.tr/Milletvekilleri/liste']

    def parse(self, response):
        mv_list =  mv_list = response.xpath("//ul[@class='list-group list-group-flush']") #taking all MPs listed

        for mv in mv_list:
            name = mv.xpath("./li/div/div/a/text()").get() # MP's name taken
            party = mv.xpath("./li/div/div[@class='col-md-4 text-right']/text()").get().strip() #MP's party name taken
            partial_link = mv.xpath('.//div[@class="col-md-8"]/a/@href').get()
            full_link = response.urljoin(partial_link)

            yield Request(full_link, callback = self.mv_analysis, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        })


    def mv_analysis(self, response):
        name = response.meta.get('name')
        party = response.meta.get('party')

        billprop_link_path = response.xpath(".//a[contains(text(),'İmzası Bulunan Kanun Teklifleri')]/@href").get()
        billprop_link = response.urljoin(billprop_link_path)

        questionprop_link_path = response.xpath(".//a[contains(text(),'Sahibi Olduğu Yazılı Soru Önergeleri')]/@href").get()
        questionprop_link = response.urljoin(questionprop_link_path)

        speech_link_path = response.xpath(".//a[contains(text(),'Genel Kurul Konuşmaları')]/@href").get()
        speech_link = response.urljoin(speech_link_path)

        yield Request(billprop_link, callback = self.bill_prop_counter, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        })  #number of bill proposals to be requested

        yield Request(questionprop_link, callback = self.quest_prop_counter, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        }) #number of question propoesals to be requested


        yield Request(speech_link, callback = self.speech_counter, meta = {
                                                                            'name': name,
                                                                            'party': party
                                                                        })  #number of speeches to be requested




# COUNTING FUNCTIONS


    def bill_prop_counter(self,response):

        name = response.meta.get('name')
        party = response.meta.get('party')

        billproposals = response.xpath("//tr[@valign='TOP']")

        yield  { 'bill_prop_count': len(billproposals),
                'name': name,
                'party': party}

    def quest_prop_counter(self, response):

        name = response.meta.get('name')
        party = response.meta.get('party')

        researchproposals = response.xpath("//tr[@valign='TOP']")

        yield {'res_prop_count': len(researchproposals),
               'name': name,
               'party': party}

    def speech_counter(self, response):

        name = response.meta.get('name')
        party = response.meta.get('party')

        speeches = response.xpath("//tr[@valign='TOP']")

        yield { 'speech_count' : len(speeches),
               'name': name,
               'party': party}

Advertisement

Answer

This is happening because you are yielding dicts instead of item objects, so spider engine will not have a guide of fields you want to have as default.

In order to make the csv output fields bill_prop_count and res_prop_count, you should make the following changes in your code:

1 – Create a base item object with all desirable fields – you can create this in the items.py file of your scrapy project:

from scrapy import Item, Field


class MvItem(Item):
    name = Field()
    party = Field()
    bill_prop_count = Field()
    res_prop_count = Field()
    speech_count = Field()

2 – Import the item object created to the spider code & yield items populated with the dict, instead of single dicts:

from your_project.items import MvItem

...

# COUNTING FUNCTIONS
def bill_prop_counter(self,response):
    name = response.meta.get('name')
    party = response.meta.get('party')

    billproposals = response.xpath("//tr[@valign='TOP']")

    yield MvItem(**{ 'bill_prop_count': len(billproposals),
            'name': name,
            'party': party})

def quest_prop_counter(self, response):
    name = response.meta.get('name')
    party = response.meta.get('party')

    researchproposals = response.xpath("//tr[@valign='TOP']")

    yield MvItem(**{'res_prop_count': len(researchproposals),
           'name': name,
           'party': party})

def speech_counter(self, response):
    name = response.meta.get('name')
    party = response.meta.get('party')

    speeches = response.xpath("//tr[@valign='TOP']")

    yield MvItem(**{ 'speech_count' : len(speeches),
           'name': name,
           'party': party})

The output csv will have all possible columns for the item:

bill_prop_count,name,party,res_prop_count,speech_count
,Abdullah DOĞRU,AK Parti,,11
,Mehmet Şükrü ERDİNÇ,AK Parti,,3
,Muharrem VARLI,MHP,,13
,Muharrem VARLI,MHP,0,
,Jülide SARIEROĞLU,AK Parti,,3
,İbrahim Halil FIRAT,AK Parti,,7
20,Burhanettin BULUT,CHP,,
,Ünal DEMİRTAŞ,CHP,,22
...

Now if you want to have all the three counts in the same row, you’ll have to change the design of your spider. Possibly one counting function at the time passing the item in the meta attribute.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement