红楼梦对章节的分析。

大小: 9KB

文件类型: .py

金币: 1

下载: 1 次

发布日期: 2021-10-08
语言: Python
标签: 数据分析

高速下载

资源简介

红楼梦对章节的分析。因为没有原文本。故无法运行。但代码段齐全而且功能多样，有多重数据视图，可以参考。

资源截图

小图大图

代码片段和文件信息

# -*- coding: utf-8 -*-
“““
Created on Tue Jun 19 15:38:44 2018

@author: lenovo

使用TF-IDF矩阵对章节进行聚类

余弦相似：是指通过测量两个向量的夹角的余弦值来度量它们之间的相似性。
当两个文本向量夹角余弦等于1时，这两个文本完全重复；
当夹角的余弦值接近于1时，两个文本相似；夹角的余弦越小，两个文本越不相关。

k-means聚类：对于给定的样本集A，按照样本之间的距离大小，
将样本集A划分为K个簇A_1A_2⋯A_K。
让这些簇内的点尽量紧密的连在一起，而让簇间的距离尽量的大。
K-Means算法是无监督的聚类算法。
目的是使得每个点都属于离它最近的均值（此即聚类中心）对应的簇A_i中。
这里的聚类分析使用的是nltk库。

下面的程序将使用k-means聚类算法对数据进行聚类分析，然后得到每一章所属类别，
并用直方图展示每一类有多少个章节。

  MDS降维、PCA降维、HC聚类


#【1】加载数据包及数据整理
“““

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
import jieba   #需要安装：pip install jieba
from pandas import read_csv
from scipy.cluster.hierarchy import dendrogramward
from scipy.spatial.distance import pdistsquareform
from sklearn.feature_extraction.text import CountVectorizer TfidfTransformer TfidfVectorizer
from sklearn.manifold import MDS
from sklearn.decomposition import PCA
import nltk
from nltk.cluster.kmeans import KMeansClusterer

## 设置字体和 设置pandas显示方式
font=FontProperties（fname = “C:/Windows/Fonts/Hiragino Sans GB W3.otf“size=14）

pd.set_option（“display.max_rows“8）
pd.options.mode.chained_assignment = None  # default=‘warn‘

## 读取停用词和需要的词典
stopword = read_csv（r“C:\Users\yubg\OneDrive\2018book\syl-hlm\my_stop_words.txt“header=Nonenames = [“Stopwords“]）
mydict = read_csv（r“C:\Users\yubg\OneDrive\2018book\syl-hlm\red_dictionary.txt“header=None names=[“Dictionary“]）

RedDream = read_csv（r“C:\Users\yubg\OneDrive\2018book\syl-hlm\red_UTF82.txt“header=Nonenames = [“Reddream“]）


#删除空白行和不需要的段，并重新设置索引
np.sum（pd.isnull（RedDream））  #查看数据是否有空白的行，如有则删除
indexjuan = RedDream.Reddream.str.contains（“^第+.+卷“） # 删除卷数据，使用正则表达式，包含相应关键字的索引
RedDream = RedDream[~indexjuan].reset_index（drop=True） ## 删除不需要的段，并重新设置索引


## 找出每一章节的头部索引和尾部索引
## 每一章节的标题
indexhui = RedDream.Reddream.str.match（“^第+.+回“）
chapnames = RedDream.Reddream[indexhui].reset_index（drop=True）

## 处理章节名，按照空格分割字符串
chapnamesplit = chapnames.str.split（“ “）.reset_index（drop=True）

## 建立保存数据的数据表
Red_df=pd.Dataframe（list（chapnamesplit）columns=[“Chapter““Leftname““Rightname“]）
## 添加新的变量
Red_df[“Chapter2“] = np.arange（1121）
Red_df[“ChapName“] = Red_df.Leftname+““+Red_df.Rightname
## 每章的开始行（段）索引
Red_df[“StartCid“] = indexhui[indexhui == True].index
## 每章的结束行数
Red_df[“endCid“] = Red_df[“StartCid“][1:len（Red_df[“StartCid“]）].reset_index（drop = True） - 1
Red_df[“endCid“][[len（Red_df[“endCid“]）-1]] = RedDream.index[-1]
## 每章的段落长度
Red_df[“Lengthchaps“] = Red_df.endCid - Red_df.StartCid
Red_df[“Artical“] = “Artical“

## 每章节的内容
for ii in Red_df.index:
    ## 将内容使用““连接
    chapid = np.arange（Red_df.StartCid[ii]+1int（Red_df.endCid[ii]））
    ## 每章节的内容替换掉空格
    Red_df[“Artical“][ii] = ““.join（list（RedDream.Reddream[chapid]））.replace（“\u3000“““）
## 计算某章有多少字
Red_df[“lenzi“] = Red_df.Artical.apply（len）


## 对红楼梦全文进行分词
## 数据表的行数
rowcol = Red_df.shape
## 预定义列表
Red_df[“cutword“]

上一篇：京东商品图片爬虫
下一篇：2019马哥全新Python全栈+自动化+爬虫+数据分析+go区块链+AI全能高薪工程师课程.txt

共有条评论

红楼梦对章节的分析。

资源简介

资源截图

代码片段和文件信息

评论

相关资源