nltk计算tf-idf

TF-IDF是一种统计方法,用以评估一个词对于一个文档的重要程度

字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。

 TF-IDF的主要思想是:如果某个单词在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。

参考代码:

import nltk
from nltk.text import TextCollection

nltk.data.path = ['D:\xxx\\nltk_data\\nltk_data-gh-pages\packages']
sents = ['this is sentence one', 'this is sentence two', 'this is sentence three']
sents = [word for word in [nltk.word_tokenize(sent) for sent in sents]]
corpus = TextCollection(sents)
#计算词频
print(corpus.tf('two', nltk.word_tokenize('this is sentence two')))
#逆文档频率
print(corpus.idf('two'))
#tf_idf
print(corpus.tf_idf('two', nltk.word_tokenize('this is sentence two')))

更多阅读:TF-IDF算法介绍及实现


标签: nltk、tf、idf、sentence、sents、面试
  • 回复
隐藏