FRIHOST FORUMS SEARCH FAQ TOS BLOGS COMPETITIONS
You are invited to Log in or Register a free Frihost Account!

Scrapy




Yesterday during another boring phone call I googled for "fun python packages" and bumped into this nice article: "20 Python libraries you can't live without". While I already knew many of the packages mentioned there one caught my interest: Scrapy.

Scrapy seems to be an elegant way not only for parsing web pages but also for travelling web pages, mainly those which have some sort of 'Next' or 'Older posts' button you wanna click through to e.g. retrieve all pages from a blog. I installed Scrapy and ran into one import error, thus as mentioned in the FAQ and elsewhere I had to manually install pypiwin32:

Code:
pip install pypiwin32


Based on the example on the home page I wrote a little script to retrieve titles and URLs from my German blog "Axel Unterwegs" and enhanced it to write those into a Table-Of-Contents type HTML file, after figuring out how to overwrite the Init and Close method of my spider class.
Code:

import scrapy
header = """
<html><head>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
</head><body>
"""
footer = """
</body></html>   
"""

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['http://axelunterwegs.blogspot.co.uk/']
   
    def __init__(self, *a, **kw):
        super(BlogSpider, self).__init__(*a, **kw)
        self.file = open('blogspider.html','w')
        self.file.write(header)

    def parse(self, response):
        for title in response.css('h3.post-title'):
            t = title.css('a ::text').extract_first()
            url = title.css('a ::attr(href)').extract_first()
            self.file.write("<a target=\"_NEW_\" href=\"%s\">%s</a>\n<br/>" % (url.encode('utf8'),t.encode('utf8')))
            yield {'title': t, 'url': url}

        for next_page in response.css('a.blog-pager-older-link'):
            yield response.follow(next_page, self.parse)
           
    def spider_closed(self, spider):
        self.file.write(footer)
        self.file.close()


Thus, here is the TOC of my German blog. I tried to get the same done with my English blog here on Wordpress but have been struggling so far. One challenge is that the modern UI of Wordpress does not have any 'Older posts' type of button anymore; new postings are retrieved as soon as you scroll down. Also the parsing doesn't seem to work for now, but may be I figure it out some time later.



0 blog comments below




FRIHOST HOME | FAQ | TOS | ABOUT US | CONTACT US | SITE MAP
© 2005-2011 Frihost, forums powered by phpBB.