Language/WebCrawling

Scrapy Tutorial 공식 예제

아르비스 2020. 7. 13. 17:05

Scrapy 예제로 Naver나 Daum의 예제가 있으나 어쩐 일인지 잘 동작하지 않아서..

다시 Tutorial 예제부터 해보기로 함.

 

공식 예제는  http://quotes.toscrape.com 사이트의 링크를 순회하며 text와 authon를 스크레이핑하는 코드입니다.

 

1) Project 생성

> scrapy startproject tutorial




New Scrapy project 'tutorial', using template directory 'd:\utils\python\python38\lib\site-packages\scrapy\templates\project', created in:
    D:\db\Scrapy\tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com

 

2) Spider 생성

  project (tutorial) 폴더 에서 실행 

\tutorial> scrapy genspider dmoz 'dmoz.org'

Created spider 'dmoz' using template 'basic' in module:
  tutorial.spiders.dmoz

 

3) 가져올 Item 정의

  items.py 에 가져올 Items을 정의함. (예제에서는 Title, link, desc를 정의)

import scrapy

...

class DmozItem(scrapy.Item):
   title = scrapy.Field()
   link = scrapy.Field()
   desc = scrapy.Field()

DmozItem Class를 별도로 정의해줌.

 

 

4) Spider 편집

 genspider에 의해서 자동생성된 Spider py file을 수정해줌. (spiders/dmoz.py)

# -*- coding: utf-8 -*-
import scrapy
from tutorial.items import DmozItem


class DmozSpider(scrapy.Spider):
    name = 'dmoz'
    allowed_domains = ['dmoz.org']
    start_urls = ["https://dmoztools.net/Computers/Programming/Languages/Python/Books/",
                  "https://dmoztools.net/Computers/Programming/Languages/Python/Resources/"]

    def parse(self, response):
        for sel in response.xpath('//*[@class="title-and-desc"]'):
            title = sel.xpath('a/div[@class="site-title"]/text()').extract()
            link = sel.xpath('a/@href').extract()
            desc = sel.xpath('div[@class="site-descr "]/text()').extract()

            item = DmozItem()
            item['title'] = title
            item['link'] = link
            item['desc'] = desc
            yield item

5) 실행

  scrapy crawl {name} 을 통해서 실행함

\tutorial> scrapy crawl dmoz

2020-07-13 16:27:33 [scrapy.utils.log] INFO: Scrapy 2.0.0 started (bot: tutorial)
2020-07-13 16:27:33 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:21:23) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.19041-SP0
2020-07-13 16:27:33 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-07-13 16:27:33 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders']}
2020-07-13 16:27:33 [scrapy.extensions.telnet] INFO: Telnet Password: 7c37d53c5bae9aa1
2020-07-13 16:27:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
 ....
 {'desc': ['\r\n'
          '\t\t\t\r\n'
          '                                    Scripts, examples and news '
          'about Python programming for the Windows platform.\r\n'
          '                                    ',
          '\r\n                                  '],
 'link': ['http://win32com.goermezer.de/'],
 'title': ['Social Bug ']}
2020-07-13 16:27:36 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-13 16:27:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 755,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 14766,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'elapsed_time_seconds': 1.904345,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 7, 13, 7, 27, 36, 231065),
 'item_scraped_count': 22,
 'log_count/DEBUG': 25,
 'log_count/INFO': 10,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2020, 7, 13, 7, 27, 34, 326720)}
2020-07-13 16:27:36 [scrapy.core.engine] INFO: Spider closed (finished)

 

6) Json 형태 저장

   > scrapy crawl {name} -o -{filename} 형태로 실행

\tutorial> scrapy crawl dmoz -o items.json

2020-07-13 16:28:15 [scrapy.utils.log] INFO: Scrapy 2.0.0 started (bot: tutorial)
2020-07-13 16:28:15 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:21:23) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.19041-SP0
2020-07-13 16:28:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-07-13 16:28:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'FEED_FORMAT': 'json',
 'FEED_URI': 'items.json',
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders']}
2020-07-13 16:28:15 [scrapy.extensions.telnet] INFO: Telnet Password: 3518d1fdc9f852c1
2020-07-13 16:28:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-07-13 16:28:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-13 16:28:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-13 16:28:16 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-13 16:28:16 [scrapy.core.engine] INFO: Spider opened
2020-07-13 16:28:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-13 16:28:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-13 16:28:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dmoztools.net/robots.txt> (referer: None)
2020-07-13 16:28:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dmoztools.net/Computers/Programming/Languages/Python/Resources/> (referer: None)
2020-07-13 16:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dmoztools.net/Computers/Programming/Languages/Python/Resources/>
{'desc': ['\r\n'
          '\t\t\t\r\n'
          '                                    Contains links to assorted '
          'resources from the Python universe, compiled by PythonWare.\r\n'
          '                                    ',
          '\r\n                                  '],
 'link': ['http://www.pythonware.com/daily/'],
 'title': ["eff-bot's Daily Python URL "]}
 ....

 

items.json file 생성됨

[
{"title": ["eff-bot's Daily Python URL "], "link": ["http://www.pythonware.com/daily/"], "desc": ["\r\n\t\t\t\r\n                                    Contains links to assorted resources from the Python universe, compiled by PythonWare.\r\n                                    ", "\r\n                                  "]},
{"title": ["O'Reilly Python Center "], "link": ["http://oreilly.com/python/"], "desc": ["\r\n\t\t\t\r\n                                    Features Python books, resources, news and articles.\r\n                                    ", "\r\n                                  "]},
{"title": ["Python Developer's Guide "], "link": ["https://www.python.org/dev/"], "desc": ["\r\n\t\t\t\r\n                                    Resources for reporting bugs, accessing the Python source tree with CVS and taking part in the development of Python.\r\n                                    ", "\r\n                                  "]},
{"title": ["Social Bug "], "link": ["http://win32com.goermezer.de/"], "desc": ["\r\n\t\t\t\r\n                                    Scripts, examples and news about Python programming for the Windows platform.\r\n                                    ", "\r\n                                  "]},
{"title": ["Data Structures and Algorithms with Object-Oriented Design Patterns in Python "], "link": ["http://www.brpreiss.com/books/opus7/html/book.html"], "desc": ["\r\n\t\t\t\r\n                                    The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n                                    ", "\r\n                                  "]},
....
]

위 구분 했던, title, link, desc가 정의되어 가져옮.

 

싸이트의 파싱과 해당 xpath 정의가 관건인듯함.