文章詳情頁

Django結(jié)合使用Scrapy爬取數(shù)據(jù)入庫的方法示例

瀏覽：110日期：2024-09-11 11:40:06

在django項目根目錄位置創(chuàng)建scrapy項目，django_12是django項目，ABCkg是scrapy爬蟲項目，app1是django的子應用

2.在Scrapy的settings.py中加入以下代碼

import osimport syssys.path.append(os.path.dirname(os.path.abspath(’.’)))os.environ[’DJANGO_SETTINGS_MODULE’] = ’django_12.settings’ # 項目名.settingsimport djangodjango.setup()

3.編寫爬蟲，下面代碼以ABCkg為例，abckg.py

# -*- coding: utf-8 -*-import scrapyfrom ABCkg.items import AbckgItem class AbckgSpider(scrapy.Spider): name = ’abckg’ #爬蟲名稱 allowed_domains = [’www.abckg.com’] # 允許爬取的范圍 start_urls = [’http://www.abckg.com/’] # 第一次請求的地址 def parse(self, response): print(’返回內(nèi)容：{}’.format(response)) ''' 解析函數(shù) :param response: 響應內(nèi)容 :return: ''' listtile = response.xpath(’//*[@id='container']/div/div/h2/a/text()’).extract() listurl = response.xpath(’//*[@id='container']/div/div/h2/a/@href’).extract() for index in range(len(listtile)): item = AbckgItem() item[’title’] = listtile[index] item[’url’] = listurl[index] yield scrapy.Request(url=listurl[index],callback=self.parse_content,method=’GET’,dont_filter=True,meta={’item’:item}) # 獲取下一頁 nextpage = response.xpath(’//*[@id='container']/div[1]/div[10]/a[last()]/@href’).extract_first() print(’即將請求:{}’.format(nextpage)) yield scrapy.Request(url=nextpage,callback=self.parse,method=’GET’,dont_filter=True) # 獲取詳情頁 def parse_content(self,response): item = response.meta[’item’] item[’content’] = response.xpath(’//*[@id='post-1192']/dd/p’).extract() print(’內(nèi)容為：{}’.format(item)) yield item

4.scrapy中item.py 中引入django模型類

pip install scrapy-djangoitem

from app1 import modelsfrom scrapy_djangoitem import DjangoItemclass AbckgItem(DjangoItem): # define the fields for your item here like: # name = scrapy.Field() # 普通scrapy爬蟲寫法 # title = scrapy.Field() # url = scrapy.Field() # content = scrapy.Field() django_model = models.ABCkg # 注入django項目的固定寫法，必須起名為django_model =django中models.ABCkg表

5.pipelines.py中調(diào)用save()

import jsonfrom pymongo import MongoClient# 用于接收parse函數(shù)發(fā)過來的itemclass AbckgPipeline(object): # i = 0 def open_spider(self,spider): # print(’打開文件’) if spider.name == ’abckg’: self.f = open(’abckg.json’,mode=’w’) def process_item(self, item, spider): # # print(’ABC管道接收：{}’.format(item)) # if spider.name == ’abckg’: # self.f.write(json.dumps(dict(item),ensure_ascii=False)) # # elif spider.name == ’cctv’: # # img = requests.get(item[’img’]) # # if img != ’’: # # with open(’圖片%d.png’%self.i,mode=’wb’)as f: # # f.write(img.content) # # self.i += 1 item.save() return item # 將item傳給下一個管道執(zhí)行 def close_spider(self,spider): # print(’關(guān)閉文件’) self.f.close()

6.在django中models.py中一個模型類，字段對應爬取到的數(shù)據(jù)，選擇適當?shù)念愋团c長度

class ABCkg(models.Model): title = models.CharField(max_length=30,verbose_name=’標題’) url = models.CharField(max_length=100,verbose_name=’網(wǎng)址’) content = models.CharField(max_length=200,verbose_name=’內(nèi)容’) class Meta: verbose_name_plural = ’爬蟲ABCkg’ def __str__(self): return self.title

7.通過命令啟動爬蟲：scrapy crawl 爬蟲名稱

8.django進入admin后臺即可看到爬取到的數(shù)據(jù)。

到此這篇關(guān)于Django結(jié)合使用Scrapy爬取數(shù)據(jù)入庫的方法示例的文章就介紹到這了,更多相關(guān)Django Scrapy爬取數(shù)據(jù)入庫內(nèi)容請搜索好吧啦網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持好吧啦網(wǎng)！

Django

上一條：django中顯示字符串的實例方法下一條：Django MTV和MVC的區(qū)別詳解

相關(guān)文章：

1. Python 合并拼接字符串的方法2. Linux刪除系統(tǒng)自帶版本Python過程詳解3. Python3 json模塊之編碼解碼方法講解4. Python 制作查詢商品歷史價格的小工具5. python 使用事件對象asyncio.Event來同步協(xié)程的操作6. ASP基礎(chǔ)知識VBScript基本元素講解7. ASP.NET MVC使用jQuery ui的progressbar實現(xiàn)進度條8. Python 利用Entrez庫篩選下載PubMed文獻摘要的示例9. Python sublime安裝及配置過程詳解10. Python插件機制實現(xiàn)詳解

排行榜

					
					Android 簡單的實現(xiàn)滑塊拼圖驗證碼功能
ASP.NET MVC使用jQuery ui的progressbar實現(xiàn)進度條
Django 模板中常用的過濾器實現(xiàn)
Android打包篇:Android Studio將代碼打包成jar包教程
springboot配置Jackson返回統(tǒng)一默認值的實現(xiàn)示例
淺談django不使用restframework自定義接口與使用的區(qū)別
樹型結(jié)構(gòu)列出指定目錄里所有文件的PHP類
Linux刪除系統(tǒng)自帶版本Python過程詳解
ASP基礎(chǔ)知識VBScript基本元素講解
Python字符串到字節(jié)的轉(zhuǎn)換。雙反斜杠問題
IntelliJ IDEA 2020.3通過重命名內(nèi)聯(lián)重構(gòu)代碼