-
大小: 4KB文件类型: .py金币: 1下载: 0 次发布日期: 2021-05-04
- 语言: Python
- 标签: 20newsgroup python
资源简介
http://blog.csdn.net/abcjennifer/article/details/23615947
代码片段和文件信息
#first extract the 20 news_group dataset to /scikit_learn_data
from sklearn.datasets import fetch_20newsgroups
#all categories
#newsgroup_train = fetch_20newsgroups(subset=‘train‘)
#part categories
categories = [‘comp.graphics‘
‘comp.os.ms-windows.misc‘
‘comp.sys.ibm.pc.hardware‘
‘comp.sys.mac.hardware‘
‘comp.windows.x‘];
newsgroup_train = fetch_20newsgroups(subset = ‘train‘categories = categories);
def calculate_result(actualpred):
m_precision = metrics.precision_score(actualpred);
m_recall = metrics.recall_score(actualpred);
print ‘predict info:‘
print ‘precision:{0:.3f}‘.format(m_precision)
print ‘recall:{0:0.3f}‘.format(m_recall);
print ‘f1-score:{0:.3f}‘.format(metrics.f1_score(actualpred));
#print category names
from pprint import pprint
pprint(list(newsgroup_train.target_names))
#newsgroup_train.data is the original documents but we need to extract the
#TF-IDF vectors inorder to model the text data
from sklearn.feature_extraction.text import TfidfVectorizer HashingVectorizer
#vectorizer = TfidfVectorizer(sublinear_tf = True
# max_df = 0.5
# stop_words = ‘english‘);
#however Tf-Idf feather extractor makes the training set and testing set have
#divergent number of features. (Because they have different vocabulary in documents)
#So we use HashingVectorizer
vectorizer = HashingVectorizer(stop_words = ‘english‘non_negative = True
n_features = 100)
fea_train = vectorizer.fit_transform(newsgroup_train.data)
#return feature vector ‘fea_train‘ [n_samplesn_features]
print ‘Size of fea_train:‘ + repr(fea_train.shape)
#11314 documents 130107 vectors for all categories
print ‘The average feature sparsity is {0:.3f}%‘.format(
fea_train.nnz/float(fea_train.shape[0]*fea_train.shape[1])*100);
#####
相关资源
- python新手算法函数思想入门项目,包
- 已知空间坐标和对应的属性,利用p
- 小甲鱼pythons视频+课件+源代码(96天)
- 找出最长的句子最长的单词
- 如何封装一个带传参的python程序成可
- 疯狂的python学习笔记
- wxPython写的类似qq截图的小程序
- Python3.6.4+Django2.0.2 单表的增删改查和
- python三边定位模块
- Python帮助手册CHM版
- MIC数据关联性挖掘算法Python源码
- 船舶AIS数据轨迹可视化python代码.rar
- python mysql 简单银行存取款转账系统
- 麦子学院Python全套视频.txt
- python多线程批量端口扫描
- [麻省理工-计算机科学及编程导论][
- python视频教程 老男孩全栈工程师教程
- 读取ros包中rgb和depth图,python代码
- python爬取亚马逊排名
- Python数据分析与机器学习-Python库分析
- 决策树DecisionTree项目python代码实现
- (完整版)python考试复习题库.pdf
- linux 串口模拟 python脚本
- Python安卓QQ5.8协议源码
- Python二级模拟试题
- 《疯狂Python讲义》习题答案.rar
- Python零基础10天进阶班.docx
- Python数据科学指南_Code.zip
- Anaconda历史版本Python3.6版本.zip
- Python量化交易教程
评论
共有 条评论