Language/WebCrawling

Scraping Sample ( Search Headphone on AWS)

아르비스 2020. 7. 14. 08:41

Scrapy의 이해가 쉽지 않아 테스트 

 

Scrapy version 확인

$ scrapy version

Scrapy 2.0.0

 

project 생성 (sample로 AWS headphone 검색)

$ scrapy startproject headphones


New Scrapy project 'headphones', using template directory 'd:\utils\python\python38\lib\site-packages\scrapy\templates\project', created in:
    D:\db\Scrapy\headphones

You can start your first spider with:
    cd headphones
    scrapy genspider example example.com

위와 같이 실행하면, 다음과 같은 file이 생성된다.

headphones/
    scrapy.cfg    
    headphones/
        __init__.py
        items.py          
        middlewares.py    
        pipelines.py      
        settings.py       
        spiders/          
            __init__.py

 

이제 Spider를 생성한다.

위에 나온것 처럼 

 

scrapy genspider  활용

(project 폴더) cd headphones 후 실행

※ 주의 project 명과 spider 명이 같으면 안됨.

headphone 이란 이름으로 spider 생성

PS D:\db\Scrapy\headphones> scrapy genspider headphone www.amazon.com

Created spider 'headphone' using template 'basic' in module:
  headphones.spiders.headphone

spiders 폴더에 가면 ,headphone.py 라는 spider 파일이 생성됨.

headphones/
    scrapy.cfg    
    headphones/
        __init__.py
        items.py          
        middlewares.py    
        pipelines.py      
        settings.py       
        spiders/          
            __init__.py
            headphone.py

 

headphone.py의 내용은 다음과 같음.

# -*- coding: utf-8 -*-
import scrapy


class HeadphoneSpider(scrapy.Spider):
    name = 'headphone'
    allowed_domains = ['www.amazon.com']
    start_urls = ['http://www.amazon.com/']

    def parse(self, response):
        pass

 

AWS의 다음 URL 부분을 Crawling 할 예정임.

https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2

 

이 URL을 start URL로 만듬.

parse() 를 추가 수정하여 파일 변경함

# -*- coding: utf-8 -*-
import scrapy


class HeadphoneSpider(scrapy.Spider):
    name = 'headphone'
    allowed_domains = ['www.amazon.com']
    start_urls = ['https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2']

    def parse(self, response):
        img_urls = response.css('img::attr(src)').extract()
        with open('urls.txt', 'w') as f:
            for u in img_urls:
                f.write(u + "\n")

위 parse는 urls.txt 파일에, 

조회 URL의 img_url을 작성하여 저장(생성)하는 parser 임

 

 

이제 top level directory에서 다음과 같이 실행함.

$ scrapy crawl headphone

여기서 'headphone'은 spider file 내에 정의한 name="   " 에 해당 하는 문구임.

 

 

실행결과는 urls.txt로 떨어짐.

https://images-na.ssl-images-amazon.com/images/G/01/gno/sprites/nav-sprite-global_bluebeacon-1x_optimized_layout1._CB468670774_.png
https://m.media-amazon.com/images/I/61JxBr0UreL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61oEU8lUE9L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/41Y2dfI6K1L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61JxBr0UreL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/815PM0CRkHL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/61oEU8lUE9L._AC_UL320_.jpg
https://m.media-amazon.com/images/I/41y+E9b0E+L._AC_UL320_.jpg
https://m.media-amazon.com/images/I/71B285Hk0XL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/51xZdrJfCzL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/51JgVRxaT3L._AC_UL320_.jpg
https://m.media-amazon.com/images/I/61iG2R1TdwL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/7153nIHtgJL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/61JblyazDuL._AC_UL320_.jpg
https://m.media-amazon.com/images/I/61szYUD0neL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61RWUr-10QL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61514mttMWL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/71SUYBmMsRL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/617XS3ZQgUL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/51IdLe-+6kL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61DhYDhXXfL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/71p9Q1N0WgL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/81tObBdoMhL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/51ntWa1Q0sL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61zQ40Pgf0L._AC_UY218_.jpg
https://m.media-amazon.com/images/I/61Uefs4HTZL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/71qNnPAiAAL._AC_UY218_.jpg
https://m.media-amazon.com/images/I/51RmMzFVwCL._SS400_.jpg
https://m.media-amazon.com/images/I/41oDZTzxzML._SS400_.jpg
https://m.media-amazon.com/images/I/41S+8gxQN6L._SS400_.jpg
https://m.media-amazon.com/images/I/31eRRGvr1WL._SS400_.jpg
https://assoc-na.associates-amazon.com/abid/um?s=000-0000000-0000000&m=ATVPDKIKX0DER

위와 같이 결과가 생성됨.