Crawler爬虫软件，轻松获取网络资源

大小: 4KB

文件类型: .java

金币: 1

下载: 0 次

发布日期: 2021-06-06
语言: Java
标签: 爬虫

高速下载

资源简介

网络爬虫，轻松获取网络资源！网络爬虫为搜索引擎从万维网下载网页。一般分为传统爬虫和聚焦爬虫。

资源截图

小图大图

代码片段和文件信息

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;

public class Crawler {

	private List urlWaiting = new ArrayList（）;		//A list of URLs that are waiting to be processed
	private List urlProcessed = new ArrayList（）;	//A list of URLs that were processed
	private List urlError = new ArrayList（）;		//A list of URLs that resulted in an error
	
	private int numFindUrl = 0;		//find the number of url

	public Crawler（） {}

	
	/**
	 * start crawling
	 */
	public void begin（） {
		
		while （!urlWaiting.isEmpty（）） {
			processURL（urlWaiting.remove（0））;
		}
		
		log（“finish crawling“）;
		log（“the number of urls that were found:“ + numFindUrl）;
		log（“the number of urls that were processed:“ + urlProcessed.size（））;
		log（“the number of urls that resulted in an error:“ + urlError.size（））;
	}

	/**
	 * Called internally to process a URL
	 * 
	 * @param strUrl
	 *            The URL to be processed.
	 */
	public void processURL（String strUrl） {
		URL url = null;
		try {
			url = new URL（strUrl）;
			log（“Processing: “ + url）;
			// get the URL‘s contents
			URLConnection connection = url.openConnection（）;
			connection.setRequestProperty（“User-Agent“ “Test Crawler for Course NIR“）;

			if （（connection.getContentType（） != null）
					&& !connection.getContentType（）.toLowerCase（）
							.startsWith（“text/“）） {
				log（“Not processing because content type is: “
						+ connection.getContentType（））;
				return;
			}

			// read the URL
			InputStream is = connection.getInputStream（）;
			Reader r = new InputStreamReader（is）;
			// parse the URL
			HTMLEditorKit.Parser parse = new HTMLParse（）.getParser（）;
			parse.parse（r new Parser（url） true）;
		} catch （IOException e） {
			urlError

上一篇：TestNG依赖包jcommander-1.48.jar
下一篇：JavaWeb聊天室毕设项目.txt

共有条评论

Crawler爬虫软件，轻松获取网络资源

资源简介

资源截图

代码片段和文件信息

评论

相关资源