爬虫源码：分页爬取，mysql数据库连接

大小: 7KB

文件类型: .py

金币: 1

下载: 0 次

发布日期: 2021-06-01
语言: Python
标签: python爬虫

高速下载

资源简介

本爬虫实现的功能：随便在豆瓣网站中选择一部电影，获取影片详细信息，并自动获取该影片的短评链接，再跳转到短评页面，获取各位观众的影评，最后将爬取的数据存储到数据库中。开发环境： python3 + pycharm +WIN +mysql

资源截图

小图大图

代码片段和文件信息

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import re
import os
import pymysql
#from itertools import islice

# 获取网页文档，创建BeautifulSoup对象
def get_soup（url）:
    res = requests.get（url）  # 获取网页
    res.encoding = ‘utf-8‘  # 最好设为utf-8，防止不必要的麻烦
    #print（res.text）
    soup = BeautifulSoup（res.text‘html.parser‘）
    # print（soup）
    return （soupres）  # 这里同时返回res是为后面的正则表达式服务

# 获取影片部分信息
def get_movie_comment（movie_url）:
    （soupres） = get_soup（movie_url）
    f = open（“F:\\SpiderDB\\movie.txt“ “a“encoding=‘utf-8‘）
    for items in soup.select（‘#content‘）:
        #print（items）
        movie_name = items.select（‘span‘）[0].text
        print（movie_name）
        f.write（movie_name + ““）
        for info in soup.select（‘#info‘）:
            #print（info）
            director = info.select（‘.attrs‘）[0].text
            print（director）
            editor = info.select（‘.attrs‘）[1].text
            actors = info.select（‘.attrs‘）[2].text.strip（）
            actor = actors.split（“/“）[0:2]
            actor = ‘‘.join（actor）
            print（actor）
            style = ‘ ‘.join（[style.text for style in info.select（‘span[property=“v:genre“]‘）]） # python 简洁一行
            print（style）
            time = info.select（‘span[property=“v:initialReleaseDate“]‘）[0].text
            time = re.split（‘\（‘time）[0]
            print（time）
            comment_url = items.select（‘.mod-hd span.pl a‘）[0][‘href‘]
            #print（type（comment_url））
            print（comment_url）
            f.write（director + ““）
            f.write（editor + ““）
            f.write（actor + ““）
            f.write（style + ““）
            f.write（time + ““）
            f.write（comment_url + ‘‘）
            f.close（）
            get_eval（comment_url）
            for i in range（2020020）: ##从start=20开始，间隔为20
                new_url = comment_url.replace（“status=P““start={}“）.format（i）
                #print（new_url）
                get_comment（new_url）

# 获取评论整体情况
def get_eval（comment_url）:
    （soup1 res1） = get_soup（comment_url）
    movie_comment = soup1.select（‘#content h1‘）[0].text.rsplit（‘ 短评‘）[0]
    f = open（“F:\\SpiderDB\\movie.txt“ “a“ encoding=‘utf-8‘）
    f.write（movie_comment + ““）
    for it in soup1.select（‘.comment-filter‘）:
        #print（it）
        good_eval = it.select（‘.filter-name‘）[1].text   #评价
        good_cp = it.select（‘span.comment-percent‘）[1].text    #好评率
        common_eval = it.select（‘.filter-name‘）[2].text
        common_cp =it.select（‘span.comment-percent‘）[2].text
        bad_eval = it.select（‘.filter-name‘）[3].text
        bad_cp = it.select（‘span.comment-percent‘）[3].text
        # print（common_eval）
        # print（bad_eval）
        f.write（good_eval + “:“）
        f.write（good_cp + ““）
        f.write（common_eval + “:“）
        f.write（common_cp + ““）
        f.write（bad_eval + “:“）
        f.write（bad_cp）
        f

上一篇：python 邻接矩阵三种方法实现有向图、无向图，并绘图显示
下一篇：python就业班.txt

共有条评论

爬虫源码：分页爬取，mysql数据库连接

资源简介

资源截图

代码片段和文件信息

评论

相关资源