基于svm的中文文本自动分类系统的语料库

大小: 10.19MB

文件类型: .zip

金币: 2

下载: 0 次

发布日期: 2023-10-07
语言: 其他
标签: 语料库 自动分类 svm

高速下载

资源简介

基于svm的中文文本自动分类系统的语料库, 包含了17类，全部都是自己爬的. 详情：http://blog.csdn.net/yinchuandong2/article/details/17717449

资源截图

小图大图

代码片段和文件信息

import urllib2
import urllib
import re
import chardet
import sys

class HTML_Tool:
    BgnCharToNoneRex = re.compile（“（\t|\n| ||）“）
    EndCharToNoneRex = re.compile（“<.*?>“）
    BgnPartRex = re.compile（““）
    CharToNewLineRex = re.compile（“（
|
|||
）“）
    CharToNextTabRex = re.compile（““）
    replaceTab = [（“<““<“）（“>““>“）（“&““&“）（“&““\““）（“nbsp;““ “）]
    
    def Replace_Char（selfx）:
        x = self.BgnCharToNoneRex.sub（““x）
        x = self.BgnPartRex.sub（“\n    “x）
        x = self.CharToNewLineRex.sub（“\n“x）
        x = self.CharToNextTabRex.sub（“\t“x）
        x = self.EndCharToNoneRex.sub（““x）

        for t in self.replaceTab:
            x = x.replace（t[0]t[1]）
        return x

class crawler:
    def __init__（self）:
        self.page = 11
        self.myTool = HTML_Tool（）
        self.urllist = []

        self.index = 1


    def downloadpage（selfurl）: 
        myResponse  = urllib2.urlopen（url）
        myPage = myResponse.read（）
        typeEncode = sys.getfilesystemencoding（）
        infoencode = chardet.detect（myPage）.get（‘encoding‘‘utf-8‘）
        html = myPage.decode（infoencode‘ignore‘）.encode（typeEncode）
        links = re.findall（‘        for link in links:
            link =‘http://studa.net‘ + link
            self.download（link）
        self.index =+ 1
        url = “http://www.studa.net/dilidizhi/index0“ + str（self.index）+“.html“
        self.getIndexPage（url）

    def download（selfurl）:
        print url
        url2 = url.replace（“.html““-2.html“）
        myResponse1  = urllib2.urlopen（url）
        myPage1 = myResponse1.read（）
        myResponse2  = urllib2.urlopen（url2）
        myPage2 = myResponse2.read（）
        typeEncode = sys.getfilesystemencoding（）
        infoencode = chardet.detect（myPage1）.get（‘encoding‘‘utf-8‘）
        html1 = myPage1.decode（infoencode‘ignore‘）.encode（typeEncode）
        html2 = myPage2.decode（infoencode‘ignore‘）.encode（typeEncode）
        myItems1 = re.findall（‘（.*?）

‘html1re.S）
        myItems2 = re.findall（‘（.*?）

‘html2re.S）        
        file_object1 = open（str（self.page）+‘.txt‘ ‘w+‘）
        file_object1.write（self.myTool.Replace_Char（myItems1[0]））
        file_object1.close（）
        self.page += 1
        file_object2 = open（str（self.page）+‘.txt‘ ‘w+‘）
        file_object2.write（self.myTool.Replace_Char（myItems2[0]））
        file_object2.close（）
        self.page += 1

    def getIndexPage（self url）:
        print url
        if self.page == 200:
            exit（）
        self.downloadpage（url）






crawler（）.getIndexPage（“http://www.studa.net/dilidizhi/index.html“）

属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----
     目录           0  2013-12-30 05:51  article\
     目录           0  2013-12-30 05:55  article\农林\
     文件        4731  2013-12-30 05:55  article\农林\65.txt
     文件         634  2013-12-30 05:55  article\农林\50.txt
     文件        4482  2013-12-30 05:55  article\农林\20.txt
     文件        3430  2013-12-30 05:55  article\农林\34.txt
     文件        7697  2013-12-30 05:55  article\农林\72.txt
     文件        5572  2013-12-30 05:55  article\农林\52.txt
     文件        4169  2013-12-30 05:55  article\农林\9.txt
     文件        3933  2013-12-30 05:55  article\农林\45.txt
     文件        5998  2013-12-30 05:55  article\农林\10.txt
     文件        2957  2013-12-30 05:55  article\农林\27.txt
     文件        5434  2013-12-30 05:55  article\农林\23.txt
     文件        1594  2013-12-30 05:55  article\农林\75.txt
     文件        4278  2013-12-30 05:55  article\农林\63.txt
     文件        2596  2013-12-30 05:55  article\农林\1.txt
     文件        6722  2013-12-30 05:55  article\农林\74.txt
     文件        3596  2013-12-30 05:55  article\农林\16.txt
     文件        1990  2013-12-30 05:55  article\农林\44.txt
     文件        4474  2013-12-30 05:55  article\农林\78.txt
     文件        2224  2013-12-30 05:55  article\农林\2.txt
     文件        4001  2013-12-30 05:55  article\农林\15.txt
     文件        2907  2013-12-30 05:55  article\农林\67.txt
     文件        6233  2013-12-30 05:55  article\农林\80.txt
     文件        9977  2013-12-30 05:55  article\农林\71.txt
     文件        1636  2013-12-30 05:55  article\农林\13.txt
     文件        2492  2013-12-30 05:55  article\农林\76.txt
     文件        7723  2013-12-30 05:55  article\农林\51.txt
     文件        2667  2013-12-30 05:55  article\农林\32.txt
     文件        2896  2013-12-30 05:55  article\农林\56.txt
     文件        2145  2013-12-30 05:55  article\农林\40.txt
............此处省略3222个文件信息

上一篇：fastReport教程demo、官方文档、中文帮助三合一
下一篇：.Net药店管理系统源码

共有条评论

基于svm的中文文本自动分类系统的语料库

资源简介

资源截图

代码片段和文件信息

评论

相关资源