文章詳情頁

Python如何爬取51cto數(shù)據(jù)并存入MySQL

瀏覽：27日期：2022-07-13 09:39:14

實(shí)驗(yàn)環(huán)境

1.安裝Python 3.7

2.安裝requests, bs4，pymysql 模塊

實(shí)驗(yàn)步驟1.安裝環(huán)境及模塊

可參考https://www.jb51.net/article/194104.htm

2.編寫代碼

# 51cto 博客頁面數(shù)據(jù)插入mysql數(shù)據(jù)庫# 導(dǎo)入模塊import reimport bs4import pymysqlimport requests# 連接數(shù)據(jù)庫賬號(hào)密碼db = pymysql.connect(host=’172.171.13.229’, user=’root’, passwd=’abc123’, db=’test’, port=3306, charset=’utf8’)# 獲取游標(biāo)cursor = db.cursor()def open_url(url): # 連接模擬網(wǎng)頁訪問 headers = { ’user-agent’: ’Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ’ ’Chrome/57.0.2987.98 Safari/537.36’} res = requests.get(url, headers=headers) return res# 爬取網(wǎng)頁內(nèi)容def find_text(res): soup = bs4.BeautifulSoup(res.text, ’html.parser’) # 博客名 titles = [] targets = soup.find_all('a', class_='tit') for each in targets: each = each.text.strip() if '置頂' in each: each = each.split(’ ’)[0] titles.append(each) # 閱讀量 reads = [] read1 = soup.find_all('p', class_='read fl on') read2 = soup.find_all('p', class_='read fl') for each in read1: reads.append(each.text) for each in read2: reads.append(each.text) # 評(píng)論數(shù) comment = [] targets = soup.find_all('p', class_=’comment fl’) for each in targets: comment.append(each.text) # 收藏 collects = [] targets = soup.find_all('p', class_=’collect fl’) for each in targets: collects.append(each.text) # 發(fā)布時(shí)間 dates=[] targets = soup.find_all('a', class_=’time fl’) for each in targets: each = each.text.split(’：’)[1] dates.append(each) # 插入sql 語句 sql = '''insert into blog (blog_title,read_number,comment_number, collect, dates) values( ’%s’, ’%s’, ’%s’, ’%s’, ’%s’);''' # 替換頁面 xa0 for titles, reads, comment, collects, dates in zip(titles, reads, comment, collects, dates): reads = re.sub(’s’, ’’, reads) comment = re.sub(’s’, ’’, comment) collects = re.sub(’s’, ’’, collects) cursor.execute(sql % (titles, reads, comment, collects，dates)) db.commit() pass# 統(tǒng)計(jì)總頁數(shù)def find_depth(res): soup = bs4.BeautifulSoup(res.text, ’html.parser’) depth = soup.find(’li’, class_=’next’).previous_sibling.previous_sibling.text return int(depth)# 主函數(shù)def main(): host = 'https://blog.51cto.com/13760351' res = open_url(host) # 打開首頁鏈接 depth = find_depth(res) # 獲取總頁數(shù) # 爬取其他頁面信息 for i in range(1, depth + 1): url = host + ’/p’ + str(i) # 完整鏈接 res = open_url(url) # 打開其他鏈接 find_text(res) # 爬取數(shù)據(jù) # 關(guān)閉游標(biāo) cursor.close() # 關(guān)閉數(shù)據(jù)庫連接 db.close()if __name__ == ’__main__’: main()

3..MySQL創(chuàng)建對(duì)應(yīng)的表

CREATE TABLE `blog` ( `row_id` int(11) NOT NULL AUTO_INCREMENT COMMENT ’主鍵’, `blog_title` varchar(52) DEFAULT NULL COMMENT ’博客標(biāo)題’, `read_number` varchar(26) DEFAULT NULL COMMENT ’閱讀數(shù)量’, `comment_number` varchar(16) DEFAULT NULL COMMENT ’評(píng)論數(shù)量’, `collect` varchar(16) DEFAULT NULL COMMENT ’收藏?cái)?shù)量’, `dates` varchar(16) DEFAULT NULL COMMENT ’發(fā)布日期’, PRIMARY KEY (`row_id`)) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

Python如何爬取51cto數(shù)據(jù)并存入MySQL

4.運(yùn)行代碼，查看效果:

Python如何爬取51cto數(shù)據(jù)并存入MySQL

改進(jìn)版：

改進(jìn)內(nèi)容：

1.數(shù)據(jù)庫里面的某些字段只保留數(shù)字即可

2.默認(rèn)爬取的內(nèi)容都是字符串，存放數(shù)據(jù)庫的某些字段，最好改為整型，方便后面數(shù)據(jù)庫操作

1.代碼如下：

import reimport bs4import pymysqlimport requests# 連接數(shù)據(jù)庫db = pymysql.connect(host=’172.171.13.229’, user=’root’, passwd=’abc123’, db=’test’, port=3306, charset=’utf8’)# 獲取游標(biāo)cursor = db.cursor()def open_url(url): # 連接模擬網(wǎng)頁訪問 headers = { ’user-agent’: ’Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ’ ’Chrome/57.0.2987.98 Safari/537.36’} res = requests.get(url, headers=headers) return res# 爬取網(wǎng)頁內(nèi)容def find_text(res): soup = bs4.BeautifulSoup(res.text, ’html.parser’) # 博客標(biāo)題 titles = [] targets = soup.find_all('a', class_='tit') for each in targets: each = each.text.strip() if '置頂' in each: each = each.split(’ ’)[0] titles.append(each) # 閱讀量 reads = [] read1 = soup.find_all('p', class_='read fl on') read2 = soup.find_all('p', class_='read fl') for each in read1: reads.append(each.text) for each in read2: reads.append(each.text) # 評(píng)論數(shù) comment = [] targets = soup.find_all('p', class_=’comment fl’) for each in targets: comment.append(each.text) # 收藏 collects = [] targets = soup.find_all('p', class_=’collect fl’) for each in targets: collects.append(each.text) # 發(fā)布時(shí)間 dates=[] targets = soup.find_all('a', class_=’time fl’) for each in targets: each = each.text.split(’：’)[1] dates.append(each) # 插入sql 語句 sql = '''insert into blogs (blog_title,read_number,comment_number, collect, dates) values( ’%s’, ’%s’, ’%s’, ’%s’, ’%s’);''' # 替換頁面 xa0 for titles, reads, comment, collects, dates in zip(titles, reads, comment, collects, dates): reads = re.sub(’s’, ’’, reads) reads=int(re.sub(’D’, '', reads)) #匹配數(shù)字，轉(zhuǎn)換為整型 comment = re.sub(’s’, ’’, comment) comment = int(re.sub(’D’, '', comment)) #匹配數(shù)字，轉(zhuǎn)換為整型 collects = re.sub(’s’, ’’, collects) collects = int(re.sub(’D’, '', collects)) #匹配數(shù)字，轉(zhuǎn)換為整型 dates = re.sub(’s’, ’’, dates) cursor.execute(sql % (titles, reads, comment, collects,dates)) db.commit() pass# 統(tǒng)計(jì)總頁數(shù)def find_depth(res): soup = bs4.BeautifulSoup(res.text, ’html.parser’) depth = soup.find(’li’, class_=’next’).previous_sibling.previous_sibling.text return int(depth)# 主函數(shù)def main(): host = 'https://blog.51cto.com/13760351' res = open_url(host) # 打開首頁鏈接 depth = find_depth(res) # 獲取總頁數(shù) # 爬取其他頁面信息 for i in range(1, depth + 1): url = host + ’/p’ + str(i) # 完整鏈接 res = open_url(url) # 打開其他鏈接 find_text(res) # 爬取數(shù)據(jù) # 關(guān)閉游標(biāo) cursor.close() # 關(guān)閉數(shù)據(jù)庫連接 db.close()#主程序入口if __name__ == ’__main__’: main()

2.創(chuàng)建對(duì)應(yīng)表

CREATE TABLE `blogs` ( `row_id` int(11) NOT NULL AUTO_INCREMENT COMMENT ’主鍵’, `blog_title` varchar(52) DEFAULT NULL COMMENT ’博客標(biāo)題’, `read_number` int(26) DEFAULT NULL COMMENT ’閱讀數(shù)量’, `comment_number` int(16) DEFAULT NULL COMMENT ’評(píng)論數(shù)量’, `collect` int(16) DEFAULT NULL COMMENT ’收藏?cái)?shù)量’, `dates` varchar(16) DEFAULT NULL COMMENT ’發(fā)布日期’, PRIMARY KEY (`row_id`)) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

3.運(yùn)行代碼，驗(yàn)證

Python如何爬取51cto數(shù)據(jù)并存入MySQL

升級(jí)版

為了能讓小白就可以使用這個(gè)程序，可以把這個(gè)項(xiàng)目打包成exe格式的文件，讓其他人，使用電腦就可以運(yùn)行代碼，這樣非常方便！

1.改進(jìn)代碼：

#末尾修改為：if __name__ == ’__main__’: main() print('ntt所有數(shù)據(jù)已成功存放數(shù)據(jù)庫！!! n') time.sleep(5)

2.安裝打包模塊pyinstaller(cmd安裝）

pip install pyinstaller -i https://pypi.tuna.tsinghua.edu.cn/simple/

3.Python代碼打包

1.切換到需要打包代碼的路徑下面

2.在cmd窗口運(yùn)行 pyinstaller -F test03.py （test03為項(xiàng)目名稱）

Python如何爬取51cto數(shù)據(jù)并存入MySQL

4.查看exe包

在打包后會(huì)出現(xiàn)dist目錄，打好包就在這個(gè)目錄里面

Python如何爬取51cto數(shù)據(jù)并存入MySQL

5.運(yùn)行exe包，查看效果

Python如何爬取51cto數(shù)據(jù)并存入MySQL

檢查數(shù)據(jù)庫

Python如何爬取51cto數(shù)據(jù)并存入MySQL

總結(jié)：

1.這一篇博客，是在上一篇的基礎(chǔ)上改進(jìn)的，步驟是先爬取首頁的信息，再爬取其他頁面信息，最后在改進(jìn)細(xì)節(jié)，打包exe文件

2.我們爬取網(wǎng)頁數(shù)據(jù)大多數(shù)還是存放到數(shù)據(jù)庫的，所以這種方法很實(shí)用。

3.其實(shí)在此博客的基礎(chǔ)上還是可以改進(jìn)的，重要的是掌握方法即可。

以上就是本文的全部?jī)?nèi)容，希望對(duì)大家的學(xué)習(xí)有所幫助，也希望大家多多支持好吧啦網(wǎng)。

Python 編程

上一條：python 多線程死鎖問題的解決方案下一條：基于Python爬取51cto博客頁面信息過程解析

相關(guān)文章：

1. JS實(shí)現(xiàn)前端動(dòng)態(tài)分頁碼代碼實(shí)例2. 關(guān)于IDEA 2020.3 多窗口視圖丟失的問題3. javascript實(shí)現(xiàn)貪吃蛇小練習(xí)4. js實(shí)現(xiàn)碰撞檢測(cè)5. 一文帶你徹底理解Java序列化和反序列化6. 用Spring JMS使異步消息變得簡(jiǎn)單7. PHP驗(yàn)證碼工具－Securimage8. Python 制作查詢商品歷史價(jià)格的小工具9. Python 利用Entrez庫篩選下載PubMed文獻(xiàn)摘要的示例10. ASP.NET MVC使用jQuery ui的progressbar實(shí)現(xiàn)進(jìn)度條

排行榜

					
					一文帶你徹底理解Java序列化和反序列化
javascript實(shí)現(xiàn)貪吃蛇小練習(xí)
用Spring JMS使異步消息變得簡(jiǎn)單
關(guān)于IDEA 2020.3 多窗口視圖丟失的問題
PHP驗(yàn)證碼工具－Securimage
js實(shí)現(xiàn)碰撞檢測(cè)
JS實(shí)現(xiàn)前端動(dòng)態(tài)分頁碼代碼實(shí)例
AspectJ 支持JAVA 5的新特性
python 網(wǎng)頁解析器掌握第三方 lxml 擴(kuò)展庫與 xpath 的使用方法
python對(duì)批量WAV音頻進(jìn)行等長(zhǎng)分割的方法實(shí)現(xiàn)
Android 簡(jiǎn)單的實(shí)現(xiàn)滑塊拼圖驗(yàn)證碼功能