For my thesis, I'm employing Scrapy as a Crawler for several Online Social Networks. It provides an extensive pipeline for scraping, processing and storing pieces of information (or Items).
I found myself using the Item pipeline in a way that it wasn't designed to use: I require multiple requests for different pages to complete the item (e.g., to crawl a user profile, I need to crawl posts, friends, photos, et cetera.). This seems to be a common request.
Naïve Approach
A naïve approach is to chain requests and callbacks. You can simply pass
the item (or in my case, an
ItemLoader)
around in the request.meta
hash:
~~~ {.sourceCode .python} def myFirstRequest(self, response):
1
2
3
4
5
6
l = ItemLoader(item=MyCoolItem(), response=response)
l.add_css('some_paragraphs', '.content > p')
# ...
yield Request('example.org/foo/bar', meta={ 'loader' : l }, callback=self.secondRequest) ~~~
In the second request, you can then recover the ItemLoader
from the
response.
~~~ {.sourceCode .python} def secondRequeest(self, response):
1
2
3
4
5
6
7
# Recover ItemLoader
l = response.meta['loader']
l.add_css('other_stuff', '.foobar')
# Complete the loader, yielding the completed item
yield l.load_item() ~~~
You can repeat these steps as often as you like, with one major consequence: The requests are now synchronous, and the retrieval of the item now depends on all steps. If one request fails (for whatever reason), the item is lost.
In my case, some of the requests may be unavailable (i.e., they may yield 40x responses depending on the profile data), this solution fails often and some of my partially extracted items are lost.
Workaround: Requests Stack
My current workaround is as follows: Pass the list of subsequent
requests around and provide the request with an errback
to catch
erroneous responses and continue execution.
The solution contains a new instance method, which performs two tasks:
- Call the next request so long as the stack isn't empty.
- Yield the item once completed.
The method is used as both the callback
and errback
of the request.
The other part is an object callstack
, which contains the request url,
and the actual processing callback. It is passed to all requests via the
meta
attribute.
~~~ {.sourceCode .python} def callnext(self, response): ''' Call next target for the item loader, or yields it if completed. '''
1
2
3
4
5
6
7
8
9
10
# Get the meta object from the request, as the response
# does not contain it.
meta = response.request.meta
# Items remaining in the stack? Execute them
if len(meta['callstack']) > 0:
target = meta['callstack'].pop(0)
yield Request(target['url'], meta=meta, callback=target['callback'], errback=self.callnext)
else:
yield meta['loader'].load_item()
def load_first(self, response):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Recover item(loader)
l = response.meta['loader']
# Use just as before
l.add_css(...)
# Build the call stack
callstack = [
{'url': '<some_url>',
'callback': self.load_second },
{'url': '<some_url>',
'callback': self.load_third }
]
return self.callnext(response)
def load_second(self, response):
1
2
3
4
5
6
7
# Recover item(loader)
l = response.meta['loader']
# Use just as before
l.add_css(...)
return self.callnext(response)
def load_third(self, response):
1
2
3
# ...
return self.callnext(response) ~~~
Note: This solution is still synchronous, and the requests are performed in the order of the call stack.
If you have a better way of runnings these requests, please let me know! =)