日本不卡不码高清免费观看,久久国产精品久久w女人spa,黄色aa久久,三上悠亚国产精品一区二区三区

您的位置:首頁技術文章
文章詳情頁

python初步實現word2vec操作

瀏覽:37日期:2022-07-22 08:07:34

一、前言

一開始看到word2vec環境的安裝還挺復雜的,安了半天Cygwin也沒太搞懂。后來突然發現,我為什么要去安c語言版本的呢,我應該去用python版本的,然后就發現了gensim,安裝個gensim的包就可以用word2vec了,不過gensim只實現了word2vec里面的skip-gram模型。若要用到其他模型,就需要去研究其他語言的word2vec了。

二、語料準備

有了gensim包之后,看了網上很多教程都是直接傳入一個txt文件,但是這個txt文件長啥樣,是什么樣的數據格式呢,很多博客都沒有說明,也沒有提供可以下載的txt文件作為例子。進一步理解之后發現這個txt是一個包含巨多文本的分好詞的文件。如下圖所示,是我自己訓練的一個語料,我選取了自己之前用爬蟲抓取的7000條新聞當做語料并進行分詞。注意,詞與詞之間一定要用空格:

python初步實現word2vec操作

這里分詞使用的是結巴分詞。

這部分代碼如下:

import jiebaf1 =open('fenci.txt')f2 =open('fenci_result.txt', ’a’)lines =f1.readlines() # 讀取全部內容for line in lines: line.replace(’t’, ’’).replace(’n’, ’’).replace(’ ’,’’) seg_list = jieba.cut(line, cut_all=False) f2.write(' '.join(seg_list)) f1.close()f2.close()

還要注意的一點就是語料中的文本一定要多,看網上隨便一個語料都是好幾個G,而且一開始我就使用了一條新聞當成語料庫,結果很不好,輸出都是0。然后我就用了7000條新聞作為語料庫,分詞完之后得到的fenci_result.txt是20M,雖然也不大,但是已經可以得到初步結果了。

三、使用gensim的word2vec訓練模型

相關代碼如下:

from gensim.modelsimport word2vecimport logging # 主程序logging.basicConfig(format=’%(asctime)s:%(levelname)s: %(message)s’, level=logging.INFO)sentences =word2vec.Text8Corpus(u'fenci_result.txt') # 加載語料model =word2vec.Word2Vec(sentences, size=200) #訓練skip-gram模型,默認window=5 print model# 計算兩個詞的相似度/相關程度try: y1 = model.similarity(u'國家', u'國務院')except KeyError: y1 = 0print u'【國家】和【國務院】的相似度為:', y1print'-----n'## 計算某個詞的相關詞列表y2 = model.most_similar(u'控煙', topn=20) # 20個最相關的print u'和【控煙】最相關的詞有:n'for item in y2: print item[0], item[1]print'-----n' # 尋找對應關系print u'書-不錯,質量-'y3 =model.most_similar([u’質量’, u’不錯’], [u’書’], topn=3)for item in y3: print item[0], item[1]print'----n' # 尋找不合群的詞y4 =model.doesnt_match(u'書 書籍 教材 很'.split())print u'不合群的詞:', y4print'-----n' # 保存模型,以便重用model.save(u'書評.model')# 對應的加載方式# model_2 =word2vec.Word2Vec.load('text8.model') # 以一種c語言可以解析的形式存儲詞向量#model.save_word2vec_format(u'書評.model.bin', binary=True)# 對應的加載方式# model_3 =word2vec.Word2Vec.load_word2vec_format('text8.model.bin',binary=True)

輸出如下:

'D:program filespython2.7.0python.exe' 'D:/pycharm workspace/畢設/cluster_test/word2vec.py'D:program filespython2.7.0libsite-packagesgensimutils.py:840: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn('detected Windows; aliasing chunkize to chunkize_serial')D:program filespython2.7.0libsite-packagesgensimutils.py:1015: UserWarning: Pattern library is not installed, lemmatization won’t be available. warnings.warn('Pattern library is not installed, lemmatization won’t be available.')2016-12-12 15:37:43,331: INFO: collecting all words and their counts2016-12-12 15:37:43,332: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types2016-12-12 15:37:45,236: INFO: collected 99865 word types from a corpus of 3561156 raw words and 357 sentences2016-12-12 15:37:45,236: INFO: Loading a fresh vocabulary2016-12-12 15:37:45,413: INFO: min_count=5 retains 29982 unique words (30% of original 99865, drops 69883)2016-12-12 15:37:45,413: INFO: min_count=5 leaves 3444018 word corpus (96% of original 3561156, drops 117138)2016-12-12 15:37:45,602: INFO: deleting the raw counts dictionary of 99865 items2016-12-12 15:37:45,615: INFO: sample=0.001 downsamples 29 most-common words2016-12-12 15:37:45,615: INFO: downsampling leaves estimated 2804247 word corpus (81.4% of prior 3444018)2016-12-12 15:37:45,615: INFO: estimated required memory for 29982 words and 200 dimensions: 62962200 bytes2016-12-12 15:37:45,746: INFO: resetting layer weights2016-12-12 15:37:46,782: INFO: training model with 3 workers on 29982 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=52016-12-12 15:37:46,782: INFO: expecting 357 sentences, matching count from corpus used for vocabulary survey2016-12-12 15:37:47,818: INFO: PROGRESS: at 1.96% examples, 267531 words/s, in_qsize 6, out_qsize 02016-12-12 15:37:48,844: INFO: PROGRESS: at 3.70% examples, 254229 words/s, in_qsize 3, out_qsize 12016-12-12 15:37:49,871: INFO: PROGRESS: at 5.99% examples, 273509 words/s, in_qsize 3, out_qsize 12016-12-12 15:37:50,867: INFO: PROGRESS: at 8.18% examples, 281557 words/s, in_qsize 6, out_qsize 02016-12-12 15:37:51,872: INFO: PROGRESS: at 10.20% examples, 280918 words/s, in_qsize 5, out_qsize 02016-12-12 15:37:52,898: INFO: PROGRESS: at 12.44% examples, 284750 words/s, in_qsize 6, out_qsize 02016-12-12 15:37:53,911: INFO: PROGRESS: at 14.17% examples, 278948 words/s, in_qsize 0, out_qsize 02016-12-12 15:37:54,956: INFO: PROGRESS: at 16.47% examples, 284101 words/s, in_qsize 2, out_qsize 12016-12-12 15:37:55,934: INFO: PROGRESS: at 18.60% examples, 285781 words/s, in_qsize 6, out_qsize 12016-12-12 15:37:56,933: INFO: PROGRESS: at 20.84% examples, 288045 words/s, in_qsize 6, out_qsize 02016-12-12 15:37:57,973: INFO: PROGRESS: at 23.03% examples, 289083 words/s, in_qsize 6, out_qsize 22016-12-12 15:37:58,993: INFO: PROGRESS: at 24.87% examples, 285990 words/s, in_qsize 6, out_qsize 12016-12-12 15:38:00,006: INFO: PROGRESS: at 27.17% examples, 288266 words/s, in_qsize 4, out_qsize 12016-12-12 15:38:01,081: INFO: PROGRESS: at 29.52% examples, 290197 words/s, in_qsize 1, out_qsize 22016-12-12 15:38:02,065: INFO: PROGRESS: at 31.88% examples, 292344 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:03,188: INFO: PROGRESS: at 34.01% examples, 291356 words/s, in_qsize 2, out_qsize 22016-12-12 15:38:04,161: INFO: PROGRESS: at 36.02% examples, 290805 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:05,174: INFO: PROGRESS: at 38.26% examples, 292174 words/s, in_qsize 3, out_qsize 02016-12-12 15:38:06,214: INFO: PROGRESS: at 40.56% examples, 293297 words/s, in_qsize 4, out_qsize 12016-12-12 15:38:07,201: INFO: PROGRESS: at 42.69% examples, 293428 words/s, in_qsize 4, out_qsize 12016-12-12 15:38:08,266: INFO: PROGRESS: at 44.65% examples, 292108 words/s, in_qsize 1, out_qsize 12016-12-12 15:38:09,295: INFO: PROGRESS: at 46.83% examples, 292097 words/s, in_qsize 4, out_qsize 12016-12-12 15:38:10,315: INFO: PROGRESS: at 49.13% examples, 292968 words/s, in_qsize 2, out_qsize 22016-12-12 15:38:11,326: INFO: PROGRESS: at 51.37% examples, 293621 words/s, in_qsize 5, out_qsize 02016-12-12 15:38:12,367: INFO: PROGRESS: at 53.39% examples, 292777 words/s, in_qsize 2, out_qsize 22016-12-12 15:38:13,348: INFO: PROGRESS: at 55.35% examples, 292187 words/s, in_qsize 5, out_qsize 02016-12-12 15:38:14,349: INFO: PROGRESS: at 57.31% examples, 291656 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:15,374: INFO: PROGRESS: at 59.50% examples, 292019 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:16,403: INFO: PROGRESS: at 61.68% examples, 292318 words/s, in_qsize 4, out_qsize 22016-12-12 15:38:17,401: INFO: PROGRESS: at 63.81% examples, 292275 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:18,410: INFO: PROGRESS: at 65.71% examples, 291495 words/s, in_qsize 4, out_qsize 12016-12-12 15:38:19,433: INFO: PROGRESS: at 67.62% examples, 290443 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:20,473: INFO: PROGRESS: at 69.58% examples, 289655 words/s, in_qsize 6, out_qsize 22016-12-12 15:38:21,589: INFO: PROGRESS: at 71.71% examples, 289388 words/s, in_qsize 2, out_qsize 22016-12-12 15:38:22,533: INFO: PROGRESS: at 73.78% examples, 289366 words/s, in_qsize 0, out_qsize 12016-12-12 15:38:23,611: INFO: PROGRESS: at 75.46% examples, 287542 words/s, in_qsize 5, out_qsize 12016-12-12 15:38:24,614: INFO: PROGRESS: at 77.25% examples, 286609 words/s, in_qsize 3, out_qsize 02016-12-12 15:38:25,609: INFO: PROGRESS: at 79.33% examples, 286732 words/s, in_qsize 5, out_qsize 12016-12-12 15:38:26,621: INFO: PROGRESS: at 81.40% examples, 286595 words/s, in_qsize 2, out_qsize 02016-12-12 15:38:27,625: INFO: PROGRESS: at 83.53% examples, 286807 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:28,683: INFO: PROGRESS: at 85.32% examples, 285651 words/s, in_qsize 5, out_qsize 32016-12-12 15:38:29,729: INFO: PROGRESS: at 87.56% examples, 286175 words/s, in_qsize 6, out_qsize 12016-12-12 15:38:30,706: INFO: PROGRESS: at 89.86% examples, 286920 words/s, in_qsize 5, out_qsize 02016-12-12 15:38:31,714: INFO: PROGRESS: at 92.10% examples, 287368 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:32,756: INFO: PROGRESS: at 94.40% examples, 288070 words/s, in_qsize 4, out_qsize 22016-12-12 15:38:33,755: INFO: PROGRESS: at 96.30% examples, 287543 words/s, in_qsize 1, out_qsize 02016-12-12 15:38:34,802: INFO: PROGRESS: at 98.71% examples, 288375 words/s, in_qsize 4, out_qsize 02016-12-12 15:38:35,286: INFO: worker thread finished; awaiting finish of 2 more threads2016-12-12 15:38:35,286: INFO: worker thread finished; awaiting finish of 1 more threadsWord2Vec(vocab=29982, size=200, alpha=0.025)【國家】和【國務院】的相似度為: 0.387535493256-----2016-12-12 15:38:35,293: INFO: worker thread finished; awaiting finish of 0 more threads2016-12-12 15:38:35,293: INFO: training on 17805780 raw words (14021191 effective words) took 48.5s, 289037 effective words/s2016-12-12 15:38:35,293: INFO: precomputing L2-norms of word weight vectors和【控煙】最相關的詞有:禁煙 0.6038454175防煙 0.585186183453執行 0.530897378922煙控 0.516572892666廣而告之 0.508533298969履約 0.507428050041執法 0.494115233421禁煙令 0.471616715193修法 0.465247869492該項 0.457907706499落實 0.457776963711控制 0.455987215042這方面 0.450040221214立法 0.44820779562控煙辦 0.436062157154執行力 0.432559013367控煙會 0.430508673191進展 0.430286765099監管 0.429748386145懲罰 0.429243773222-----書-不錯,質量-生存 0.613928854465穩定 0.595371186733整體 0.592055797577----不合群的詞: 很-----2016-12-12 15:38:35,515: INFO: saving Word2Vec object under 書評.model, separately None2016-12-12 15:38:35,515: INFO: not storing attribute syn0norm2016-12-12 15:38:35,515: INFO: not storing attribute cum_table2016-12-12 15:38:36,490: INFO: saved 書評.modelProcess finished with exit code 0

python初步實現word2vec操作

python初步實現word2vec操作

python初步實現word2vec操作

python初步實現word2vec操作

以上這篇python初步實現word2vec操作就是小編分享給大家的全部內容了,希望能給大家一個參考,也希望大家多多支持好吧啦網。

標簽: python
相關文章:
日本不卡不码高清免费观看,久久国产精品久久w女人spa,黄色aa久久,三上悠亚国产精品一区二区三区
欧美天堂在线| av高清一区| 亚洲精品国产偷自在线观看| 国产情侣一区| 亚洲青青久久| 亚洲激情婷婷| 国产盗摄——sm在线视频| 久久精品国产999大香线蕉| 欧美亚洲三区| 91亚洲精品在看在线观看高清| 色8久久久久| 蜜桃成人av| 久久精品高清| 久久精选视频| 中文久久精品| 9色精品在线| 日韩亚洲在线| 久久亚洲色图| 亚洲一级淫片| 亚洲人成网77777色在线播放| 国产视频一区在线观看一区免费| 久久蜜桃精品| 日韩精品免费一区二区三区| 亚洲欧洲美洲av| 亚洲www啪成人一区二区| 激情偷拍久久| 亚洲a在线视频| 激情综合激情| 99pao成人国产永久免费视频| 99riav国产精品| 亚洲欧美日韩专区| 在线观看一区| 亚洲三级精品| 国产精品一区2区3区| 国产精品成人国产| 精品视频在线一区二区在线| 国产乱子精品一区二区在线观看| 精品欠久久久中文字幕加勒比| 久久精品国产99国产| 精品视频一区二区三区四区五区 | 色爱综合网欧美| 久久精品国产99久久| 欧美精品九九| 色综合视频一区二区三区日韩 | 麻豆精品久久| 另类综合日韩欧美亚洲| 99精品视频在线观看免费播放| 三级小说欧洲区亚洲区| 日韩**一区毛片| 国产一区二区三区四区五区传媒| 成人看片网站| 日本伊人午夜精品| 亚洲国产日韩欧美在线| 国产一区二区高清| 免费日韩视频| 欧美99久久| 久久久久久黄| 国产成年精品| 久久国产精品99国产| 国产精品美女午夜爽爽| 久久中文字幕av一区二区不卡| 亚洲香蕉网站| 亚洲一级大片| 日韩欧美三级| 男女男精品网站| 国产美女亚洲精品7777| 美女黄网久久| 久久男人av资源站| 国内精品福利| 蜜桃成人av| 日本少妇一区二区| 性一交一乱一区二区洋洋av| 国产免费av一区二区三区| 亚洲欧美日本日韩| 亚洲精品在线二区| 成人污污视频| 欧美国产美女| 国产色播av在线| 中文在线一区| 国产福利91精品一区二区| 日韩精品五月天| 欧美+日本+国产+在线a∨观看| 香蕉成人av| 久久久久中文| 视频一区免费在线观看| 日韩欧美自拍| 日韩欧美精品一区二区综合视频| 亚洲不卡系列| 99精品在线| 亚洲永久精品唐人导航网址| 日本欧美一区| 一区二区精品| 香蕉成人久久| 日韩天堂av| 久久精品九色| 你懂的国产精品| 国产劲爆久久| 久久婷婷一区| 蜜臀久久99精品久久一区二区| 黄色日韩在线| 中文字幕一区久| av亚洲免费| 国产精品一区二区中文字幕| а√在线中文在线新版| 欧美日韩激情在线一区二区三区| 国产欧美一区二区精品久久久| 久久高清精品| 911精品国产| 亚洲精品乱码| 国精品产品一区| 国产精品欧美在线观看| 免费人成黄页网站在线一区二区| av日韩中文| 91亚洲精品视频在线观看| 欧美日韩精品在线一区| 日韩欧美一区二区三区免费观看| 欧美日韩亚洲一区三区| 国产精品日本| 国产综合色产| 色综合www| 日韩欧美视频专区| 久久久久久自在自线| 精品一区二区三区中文字幕| 欧美日韩亚洲一区| 日韩av中文字幕一区二区三区| 亚洲开心激情| 日韩精品免费观看视频| 日本在线成人| 久久精品 人人爱| 日韩三级久久| 亚洲午夜免费| 亚州精品视频| 日韩精品免费一区二区夜夜嗨 | 中文字幕一区二区三区在线视频| 美女91精品| 香蕉久久久久久| 日日夜夜免费精品| 国产精品永久| 国产91欧美| 久久久777| 999国产精品999久久久久久| 久久精品国产www456c0m| 亚洲男女av一区二区| 国产劲爆久久| 欧美日韩一区二区综合| 国产亚洲人成a在线v网站| 成人免费电影网址| 国产一区二区三区亚洲| 亚州国产精品| 久久亚洲二区| 中文亚洲欧美| 亚洲精品自拍| 青青国产91久久久久久| 尤物精品在线| 亚洲欧洲免费| 国产伦精品一区二区三区在线播放| 欧美天堂一区| 久久精品午夜| 国内亚洲精品| 黄色精品网站| 青青草国产成人99久久| 久久久精品区| 伊人精品一区| 天堂精品久久久久| 精品久久一区| 亚洲国产日韩欧美在线| 日韩福利视频导航| 高清日韩中文字幕| 影音先锋久久| 国产欧美日韩影院| 日本久久成人网| 亚洲bt欧美bt精品777| 久久精品国产免费| 樱桃成人精品视频在线播放| 亚洲精品免费观看| 成人亚洲一区| 蜜桃久久av| 欧美激情福利| 九色精品91| 久久国产三级| 久久视频精品| 日韩美女精品| 久久夜夜操妹子| 日韩在线麻豆| 9999国产精品| 亚洲三区欧美一区国产二区| 日韩av片子| 少妇精品久久久一区二区 | 夜鲁夜鲁夜鲁视频在线播放| 好看不卡的中文字幕| 国产精品麻豆成人av电影艾秋| 激情综合激情| 国产精品nxnn| 欧美综合国产| 久久久久久婷| 亚洲aa在线| 久久一区二区三区电影| 久久国产婷婷国产香蕉| 国内精品福利| 精品国产91|