如何建立一個Python Scrapy專案

語言: CN / TW / HK

How To Create A Python Scrapy Project

要在Scrapy中建立一個專案,你首先要確保你對這個框架有一個很好的介紹。這將確保Scrapy已經安裝並準備就緒。一旦你準備好了,我們將看看如何建立一個新的Python Scrapy專案,以及一旦建立了該專案該做什麼。這個過程對所有的Scrapy專案都是類似的,這是一個很好的練習,可以使用Scrapy練習網路刮擦。

啟動專案

為了開始這個專案,我們可以執行scrapy startproject命令,同時輸入我們將稱之為專案的名稱。目標網站位於https://books.toscrape.com。

``` scrapy $scrapy startproject bookstoscrape New Scrapy project 'bookstoscrape', using template directory '\python\python39\lib\site-packages\scrapy\templates\project', created in: C:\python\scrapy\bookstoscrape

You can start your first spider with: cd bookstoscrape scrapy genspider example example.com ```

我們可以在PyCharm中開啟該專案,此時專案的資料夾結構對你來說應該很熟悉。
scrapy bookstoscrape pycharm

genspider

一旦專案被建立,你要為該專案生成一個或多個Spider。這可以通過scrapy genspider命令完成。

bookstoscrape $scrapy genspider books books.toscrape.com Created spider 'books' using template 'basic' in module: bookstoscrape.spiders.books

scrapy genspider books bookstoscrape


books.py

這裡是Scrapy中新生成的蜘蛛的預設模板程式碼。為我們設定程式碼的結構是很好的。

```python import scrapy

class BooksSpider(scrapy.Spider): name = 'books' allowed_domains = ['books.toscrape.com'] start_urls = ['http://books.toscrape.com/']

def parse(self, response):
    pass

```

測試XPath和CSS選擇器

為了讓自己準備好向已經為我們建立的Spider新增程式碼,你首先需要弄清楚哪些選擇器會讓你得到你想要的資料。這是通過Scrapy shell完成的,方法是檢查目標頁面的源標記並在瀏覽器控制檯測試選擇器。

bookstoscrape $scrapy shell 'https://books.toscrape.com/' [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x000001F2C93E31F0> [s] item {} [s] request <GET https://books.toscrape.com/> [s] response <200 https://books.toscrape.com/> [s] settings <scrapy.settings.Settings object at 0x000001F2C93E3430> [s] spider <BooksSpider 'books' at 0x1f2c98485b0> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser

檢查HTML原始碼

在頁面上點選右鍵,你就可以檢查任何你喜歡的元素。
browser inspect source

我們對每本書及其相關資料感興趣,所有這些都包含在一個文章元素中。
how to determine xpath or css selectors

在瀏覽器控制檯測試XPath和CSS選擇器

Firefox和Chrome都提供了XPath和CSS選擇器工具,你可以在控制檯中使用。

$x('the xpath')

根據我們通過檢查上面的原始檔發現的情況,我們知道頁面上的每個圖書專案都在一個

標籤內,該標籤的類別是product_pod。如果我們使用XPath,那麼表示式$x('//article')就可以得到這第一頁上的所有20個圖書專案。
test xpath selector browser console

$$('the css selector')

如果你願意使用CSS選擇器版本,它提供了同樣的結果,那麼$$('.product_pod')就可以做到。
test css selector browser console

在Scrapy Shell中測試選擇器

一旦我們對XPath或CSS選擇器在瀏覽器控制檯中的工作有了概念,我們就可以在中測試它們,這是一個偉大的工具。在Scrapy Shell中輸入response.xpath('//article')response.css('.product_pod'),你會看到兩種情況下都返回了20個選擇器物件,這很有意義,因為在被搜刮的頁面上有20個圖書專案。

從外殼到蜘蛛

在瀏覽器的控制檯和Scrapy shell中嘗試這些XPath和CSS選擇器是有意義的。這樣可以很好地瞭解一旦開始向Scrapy框架提供的Spider模板程式碼新增自己的自定義程式碼時,哪些程式碼會起作用。

建立parse()方法

parse()方法的目的是檢視返回的響應,並對輸出進行解析。有很多方法可以構建Spider的這一部分,從非常基本的到更高階的,當你開始新增專案和專案載入器時。最初,唯一的目標是從該函式中返回產生一個[Python 字典]。我們將在這裡看一個使用yield的例子,我們要把自定義的程式碼新增到強調的模板中。

``` import scrapy

class BooksSpider(scrapy.Spider): name = 'books' allowed_domains = ['books.toscrape.com'] start_urls = ['http://books.toscrape.com/']

def parse(self, response):
    for book in response.xpath('//article'):
        yield {
            'booktitle': book.xpath('.//a/text()').get()
        }

```

Scrapy爬行 {你的蜘蛛}

我們現在可以使用scrapy crawl命令來執行Spider。

bookstoscrape $scrapy crawl books

控制檯中會有大量的輸出,但你應該能找到所有的書名。

{'booktitle': 'A Light in the ...'} {'booktitle': 'Tipping the Velvet'} {'booktitle': 'Soumission'} {'booktitle': 'Sharp Objects'} {'booktitle': 'Sapiens: A Brief History ...'} {'booktitle': 'The Requiem Red'} {'booktitle': 'The Dirty Little Secrets ...'} {'booktitle': 'The Coming Woman: A ...'} {'booktitle': 'The Boys in the ...'} {'booktitle': 'The Black Maria'} {'booktitle': 'Starving Hearts (Triangular Trade ...'} {'booktitle': "Shakespeare's Sonnets"} {'booktitle': 'Set Me Free'} {'booktitle': "Scott Pilgrim's Precious Little ..."} {'booktitle': 'Rip it Up and ...'} {'booktitle': 'Our Band Could Be ...'} {'booktitle': 'Olio'} {'booktitle': 'Mesaerion: The Best Science ...'} {'booktitle': 'Libertarianism for Beginners'} {'booktitle': "It's Only the Himalayas"}

我的yield語句沒有迭代!

重要的是!上面的例子使用的是yield語句而不是return語句。還要注意的是,我們在yield語句中使用的是XPath的子查詢。當你在一個迴圈中使用XPath來完成子查詢時,你必須在XPath選擇器中加入一個前導句號。如果你省略了前導句號,你將在迴圈執行的次數上得到第一個結果。

leading period xpath sub query


先大後小

當你使用XPath和CSS選擇器時,你很可能會看一下目標頁面,然後為你想搜刮的每一個不同的資訊獲得一個新的查詢。例如,我們的初始查詢選擇了20個文章元素,然後我們可以從那裡單獨縮小範圍。你不會想看一下這個頁面,然後說我想要這個頁面上每本書的標題、評級、價格和可用性。你不會為此使用80個不同的選擇器。你要在頂層抓取20本書,然後從每本書中獲取4條資料。下面的程式碼顯示瞭如何在原始XPath查詢上建立這些子查詢。

```python import scrapy

class BooksSpider(scrapy.Spider): name = 'books' allowed_domains = ['books.toscrape.com'] start_urls = ['http://books.toscrape.com/']

def parse(self, response):
    for book in response.xpath('//article'):
        yield {
            'booktitle': book.xpath('.//a/text()').get(),
            'bookrating': book.xpath('.//p').attrib['class'],
            'bookprice': book.xpath('.//div[2]/p/text()').get(),
            'bookavailability': book.xpath('.//div[2]/p[2]/i/following-sibling::text()').get().strip()
        }

```

bookavailability選擇器有點棘手。我們試圖獲得標籤之後的文字,然而該文字有點像無人區。為此,我們可以使用 following-sibling::text() 選擇器。我們還添加了strip()函式來去除一些空白,但我們很快就會了解到如何使用專案載入器來更好地處理這個問題。

```markup

In stock

```


Scrapy輸出

為了實際輸出我們捕獲的資料,我們可以在使用scrapy crawl命令時新增-o標誌,以輸出到CSV或json檔案。

``` bookstoscrape $scrapy crawl books -o books.json

```

一旦你運行了這個命令,你會看到Scrapy專案中出現一個新的檔案,裡面有你剛剛收集的所有資料。

how to output python scrapy data to json

books.json result
最後的結果是一個JSON檔案,其中有20個物件,每個物件有4個屬性,分別是標題、評級、價格和可用性。現在你可以在你收集的各種資料集上實踐你的資料科學技能。

[ { "booktitle": "A Light in the ...", "bookrating": "star-rating Three", "bookprice": "£51.77", "bookavailability": "In stock" }, { "booktitle": "Tipping the Velvet", "bookrating": "star-rating One", "bookprice": "£53.74", "bookavailability": "In stock" }, { "booktitle": "Soumission", "bookrating": "star-rating One", "bookprice": "£50.10", "bookavailability": "In stock" }, { "booktitle": "Sharp Objects", "bookrating": "star-rating Four", "bookprice": "£47.82", "bookavailability": "In stock" }, { "booktitle": "Sapiens: A Brief History ...", "bookrating": "star-rating Five", "bookprice": "£54.23", "bookavailability": "In stock" }, { "booktitle": "The Requiem Red", "bookrating": "star-rating One", "bookprice": "£22.65", "bookavailability": "In stock" }, { "booktitle": "The Dirty Little Secrets ...", "bookrating": "star-rating Four", "bookprice": "£33.34", "bookavailability": "In stock" }, { "booktitle": "The Coming Woman: A ...", "bookrating": "star-rating Three", "bookprice": "£17.93", "bookavailability": "In stock" }, { "booktitle": "The Boys in the ...", "bookrating": "star-rating Four", "bookprice": "£22.60", "bookavailability": "In stock" }, { "booktitle": "The Black Maria", "bookrating": "star-rating One", "bookprice": "£52.15", "bookavailability": "In stock" }, { "booktitle": "Starving Hearts (Triangular Trade ...", "bookrating": "star-rating Two", "bookprice": "£13.99", "bookavailability": "In stock" }, { "booktitle": "Shakespeare's Sonnets", "bookrating": "star-rating Four", "bookprice": "£20.66", "bookavailability": "In stock" }, { "booktitle": "Set Me Free", "bookrating": "star-rating Five", "bookprice": "£17.46", "bookavailability": "In stock" }, { "booktitle": "Scott Pilgrim's Precious Little ...", "bookrating": "star-rating Five", "bookprice": "£52.29", "bookavailability": "In stock" }, { "booktitle": "Rip it Up and ...", "bookrating": "star-rating Five", "bookprice": "£35.02", "bookavailability": "In stock" }, { "booktitle": "Our Band Could Be ...", "bookrating": "star-rating Three", "bookprice": "£57.25", "bookavailability": "In stock" }, { "booktitle": "Olio", "bookrating": "star-rating One", "bookprice": "£23.88", "bookavailability": "In stock" }, { "booktitle": "Mesaerion: The Best Science ...", "bookrating": "star-rating One", "bookprice": "£37.59", "bookavailability": "In stock" }, { "booktitle": "Libertarianism for Beginners", "bookrating": "star-rating Two", "bookprice": "£51.33", "bookavailability": "In stock" }, { "booktitle": "It's Only the Himalayas", "bookrating": "star-rating Two", "bookprice": "£45.17", "bookavailability": "In stock" } ]