TF-IDF是一种统计方法,用以评估一个词对于一个文档的重要程度
字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。
TF-IDF的主要思想是:如果某个单词在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。
参考代码:
import nltk from nltk.text import TextCollection nltk.data.path = ['D:\xxx\\nltk_data\\nltk_data-gh-pages\packages'] sents = ['this is sentence one', 'this is sentence two', 'this is sentence three'] sents = [word for word in [nltk.word_tokenize(sent) for sent in sents]] corpus = TextCollection(sents) #计算词频 print(corpus.tf('two', nltk.word_tokenize('this is sentence two'))) #逆文档频率 print(corpus.idf('two')) #tf_idf print(corpus.tf_idf('two', nltk.word_tokenize('this is sentence two')))
更多阅读:TF-IDF算法介绍及实现