多线程爬虫

大小: 7KB

文件类型: .py

金币: 1

下载: 0 次

发布日期: 2021-06-15
语言: Python
标签:
高速下载
资源简介

一个多线程的Python爬虫，使用threading，queue模块实现线程同步
资源截图

小图大图
代码片段和文件信息

“““
A web crawler for tao nv lang. Grabing all MM‘s images with hierarchical structure:
MM‘s name（folder） --> MM‘s albums（folder） --> MM‘s images（files）

Run the code with three steps:
1.Create a instance of web crawler class
2.Select a range of pages with start index and end index.
3.Run a class function named save_all（start end）

All images are going to save in your current path.

Please notice there are including chinese character. It may cause unreadable folder names
If your computer do not support chinese language.

Author:	Alien_gmx
Date:	10/26/2015
version: 2.0
version description:
Version 2.0 is a achivement of mutile threads crawler

1. Modify inner funtion: get_album_contents and get_images_urls.
2. Implement one producer and one consumer.
3. Set the max queue size = 12
“““
import urllib
import urllib2
import re
import os
from threading import Thread Lock Condition
from Queue import Queue


global images_index
images_index = 0
global images_q
images_q = Queue（maxsize = 12）
global threadLock
threadLock = Lock（）
#global condition
#condition = Condition（）
global root_path
root_path = os.getcwd（）

class crawler_producer（Thread）:
    def __init__（self start_index end_index）:
        Thread.__init__（self）
        self.mainurl = ‘http://mm.taobao.com/json/request_top_list.htm‘
        self.start_index = start_index
        self.end_index = end_index

    #get page information by page index
    def get_page（self index）:
        url = self.mainurl + ‘?page=‘ + str（index）
        req = urllib2.Request（url）
        resp = urllib2.urlopen（req）
        # GBK is the encode method of chinese character
        return resp.read（）.decode（‘gbk‘）

    #get page information by single url
    def get_details（self url）:
        resp = urllib2.urlopen（url）
        return resp.read（）.decode（‘gbk‘）

    #get main page contents including [0]:url of mm‘s page [1]:user id [2]:name [3]:age [4]:city
    def get_contents（self index）:
        page = self.get_page（index）
        pattern = re.compile（‘（.*?）.*?（.*?）.*?（.*?）‘
                             re.S）
        items = re.findall（pattern page）
        content = []
        for i in items:
            content.append（[i[0]i[1]i[2]i[3]i[4]]）
        return content

    def new_folder（self path name）:
        os.chdir（path）
        name = name.strip（）
        name = name.strip（‘.‘） #. cannot including in a name of folder
        is_exists = os.path.exists（name）
        if not is_exists:
            print ‘Creating a new folder:‘ name
            print path
            os.makedirs（name）
        else:
            print ‘The folder named‘ name ‘exists‘

        return_path = path + ‘/‘ + name
        return return_path

    def get_images_urls（self user_id album_id path_album）:
        ma
共有条评论
多线程爬虫

资源简介

资源截图

代码片段和文件信息

评论

相关资源