python爬取網站的小説2_帥帥子

使用正則表達式

re.compile 函數

compile 函數用於編譯正則表達式，生成一個正則表達式（ Pattern ）對象，供 match() 和 search() 這兩個函數使用。

語法格式為：

re.compile(pattern[, flags])

參數：

pattern : 一個字符串形式的正則表達式
flags : 可選，表示匹配模式，比如忽略大小寫，多行模式等，具體參數為：
re.I 忽略大小寫
re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依賴於當前環境
re.M 多行模式
re.S 即為 . 並且包括換行符在內的任意字符（ . 不包括換行符）
re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依賴於 Unicode 字符屬性數據庫
re.X 為了增加可讀性，忽略空格和 # 後面的註釋

findall

在字符串中找到正則表達式所匹配的所有子串，並返回一個列表，如果沒有找到匹配的，則返回空列表。

注意： match 和 search 是匹配一次 findall 匹配所有。

語法格式為：

findall(string[, pos[, endpos]])

參數：

string : 待匹配的字符串。
pos : 可選參數，指定字符串的起始位置，默認為 0。
endpos : 可選參數，指定字符串的結束位置，默認為字符串的長度。

python爬蟲之小説網站--下載小説(正則表達式)

思路:

找到要下載的小説首頁,打開網頁源代碼進行分析(例:http://www.kanunu8.com/files/old/2011/2447.html)
分析自己要得到的內容,首先分析url,發現只有後面的是變化的，先獲得小説的沒有相對路徑，然後組合成新的url(每章小説的url)
獲得每章小説的內容，進行美化處理

源代碼

import re import requests # 要爬取的網站 url = 'http://www.kanunu8.com/book4/10509/' # 先獲取二進制，再進行解碼 txt = requests.get(url).content.decode('gbk') # txt.conten是二進制形式的 ---n<head>\r\n<title>\xd6\xd0\xb9\xfa\xba\xcf\xbb\xef\xc8\xcb # print(txt) m1 = re.compile(r'<td colspan="4" align="center">(.+)') # print(m1.findall(txt)) m2 = re.compile(r'<td( width="25%")?><a href="(.+\.html)">(.+)</a></td>') print(m2.findall(txt)) # 獲得小説的目錄以及對應的每個章節的相對路徑 raw = m2.findall(txt) sanguo = [] for i in raw: print([i[2],url+i[1]]) # ['第五章成功之母', 'http://www.kanunu8.com/book4/10509/184616.html'] # 生成每個章節對應的url sanguo.append([i[2],url+i[1]]) print("*"*100) print(sanguo) # [['第一章夢的起源', 'http://www.kanunu8.com/book4/10509/184612.html'], ['第二章偶像兄弟', 'http://www.kanunu8.com/book4/10509/184613.html'], ['第三章戀愛必修', 'http://www.kanunu8.com/book4/10509/184614.html'], ['第四章愛的代價', 'http://www.kanunu8.com/book4/10509/184615.html'], ['第五章成功之母', 'http://www.kanunu8.com/book4/10509/184616.html'], ['第六章命運轉折', 'http://www.kanunu8.com/book4/10509/184617.html'], ['第七章被迫下海', 'http://www.kanunu8.com/book4/10509/184618.html'], ['第八章漸行漸遠', 'http://www.kanunu8.com/book4/10509/184619.html'], ['第九章三箭合一', 'http://www.kanunu8.com/book4/10509/184620.html'], ['第十章夢想起航', 'http://www.kanunu8.com/book4/10509/184621.html'], ['第十一章領航夢想', 'http://www.kanunu8.com/book4/10509/184622.html'], ['第十二章平地波瀾', 'http://www.kanunu8.com/book4/10509/184623.html'], ['第十三章新的招牌', 'http://www.kanunu8.com/book4/10509/184624.html'], ['第十四章神的弱點', 'http://www.kanunu8.com/book4/10509/184625.html'], ['第十五章裂隙初現', 'http://www.kanunu8.com/book4/10509/184626.html'], ['第十六章上市之爭', 'http://www.kanunu8.com/book4/10509/184627.html'], ['第十七章夢想巔峯', 'http://www.kanunu8.com/book4/10509/184628.html'], ['第十八章乾綱獨斷', 'http://www.kanunu8.com/book4/10509/184629.html'], ['第十九章一劍穿心', 'http://www.kanunu8.com/book4/10509/184630.html'], ['第二十章渡盡劫波', 'http://www.kanunu8.com/book4/10509/184631.html'], ['尾\u3000聲', 'http://www.kanunu8.com/book4/10509/184632.html']] # 匹配每章節的正文內容 # 每章小説的正文在標籤中 m3 = re.compile(r'(.+)',re.S) # 小説中的 要被替換為空白 m4 = re.compile(r' ') #  也要被替換 m5 = re.compile(r'    ') # 新建一個txt 中國合夥人1.txt with open('中國合夥人1.txt','a') as f: for i in sanguo: # i[1] 是章節的url i_url = i[1] print("正在下載--->%s" % i[0]) # 根據每個章節的url，先獲取正文頁面的二進制，再編碼 r_nr = requests.get(i_url).content.decode('gbk') # 匹配正文：帶有的 n_nr = m3.findall(r_nr) print(n_nr) # 把 替換為空 sub()和replace()區別：sub()可以用正則 n = m4.sub('',n_nr[0]) # 把 也替換為空 n2 = m5.sub('',n) n2 = n2.replace('\n','') # 寫入txt # i[0]是章節名字 f.write('\n'+i[0]+'\n') f.write(n2)

python爬取網站的小説2

使用正則表達式

re.compile 函數

findall

python爬蟲之小説網站--下載小説(正則表達式)

思路:

相關截圖

源代碼