Scrapy의 이해가 쉽지 않아 테스트
Scrapy version 확인
$ scrapy version
Scrapy 2.0.0
project 생성 (sample로 AWS headphone 검색)
$ scrapy startproject headphones
New Scrapy project 'headphones', using template directory 'd:\utils\python\python38\lib\site-packages\scrapy\templates\project', created in:
D:\db\Scrapy\headphones
You can start your first spider with:
cd headphones
scrapy genspider example example.com
위와 같이 실행하면, 다음과 같은 file이 생성된다.
headphones/
scrapy.cfg
headphones/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
이제 Spider를 생성한다.
위에 나온것 처럼
scrapy genspider 활용
(project 폴더) cd headphones 후 실행
※ 주의 project 명과 spider 명이 같으면 안됨.
headphone 이란 이름으로 spider 생성
PS D:\db\Scrapy\headphones> scrapy genspider headphone www.amazon.com
Created spider 'headphone' using template 'basic' in module:
headphones.spiders.headphone
spiders 폴더에 가면 ,headphone.py 라는 spider 파일이 생성됨.
headphones/
scrapy.cfg
headphones/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
headphone.py
headphone.py의 내용은 다음과 같음.
# -*- coding: utf-8 -*-
import scrapy
class HeadphoneSpider(scrapy.Spider):
name = 'headphone'
allowed_domains = ['www.amazon.com']
start_urls = ['http://www.amazon.com/']
def parse(self, response):
pass
AWS의 다음 URL 부분을 Crawling 할 예정임.
이 URL을 start URL로 만듬.
parse() 를 추가 수정하여 파일 변경함
# -*- coding: utf-8 -*-
import scrapy
class HeadphoneSpider(scrapy.Spider):
name = 'headphone'
allowed_domains = ['www.amazon.com']
start_urls = ['https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2']
def parse(self, response):
img_urls = response.css('img::attr(src)').extract()
with open('urls.txt', 'w') as f:
for u in img_urls:
f.write(u + "\n")
위 parse는 urls.txt 파일에,
조회 URL의 img_url을 작성하여 저장(생성)하는 parser 임
이제 top level directory에서 다음과 같이 실행함.
$ scrapy crawl headphone
여기서 'headphone'은 spider file 내에 정의한 name=" " 에 해당 하는 문구임.
실행결과는 urls.txt로 떨어짐.
https://images-na.ssl-images-amazon.com/images/G/01/gno/sprites/nav-sprite-global_bluebeacon-1x_optimized_layout1._CB468670774_.png
https://m.media-amazon.com/images/I/61JxBr0UreL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61oEU8lUE9L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/41Y2dfI6K1L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61JxBr0UreL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/815PM0CRkHL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/61oEU8lUE9L._AC_UL320_.jpg
https://m.media-amazon.com/images/I/41y+E9b0E+L._AC_UL320_.jpg
https://m.media-amazon.com/images/I/71B285Hk0XL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/51xZdrJfCzL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/51JgVRxaT3L._AC_UL320_.jpg
https://m.media-amazon.com/images/I/61iG2R1TdwL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/7153nIHtgJL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/61JblyazDuL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/61szYUD0neL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61RWUr-10QL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61514mttMWL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/71SUYBmMsRL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/617XS3ZQgUL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/51IdLe-+6kL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61DhYDhXXfL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/71p9Q1N0WgL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81tObBdoMhL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/51ntWa1Q0sL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61zQ40Pgf0L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61Uefs4HTZL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/71qNnPAiAAL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/51RmMzFVwCL._SS400_.jpg
https://m.media-amazon.com/images/I/41oDZTzxzML._SS400_.jpg
https://m.media-amazon.com/images/I/41S+8gxQN6L._SS400_.jpg
https://m.media-amazon.com/images/I/31eRRGvr1WL._SS400_.jpg
https://assoc-na.associates-amazon.com/abid/um?s=000-0000000-0000000&m=ATVPDKIKX0DER
위와 같이 결과가 생성됨.