资源简介

本系统以SpringBoot基础框架整合其他技术设计和搭建而成,选用webmagic框架实现单节点的网络爬虫系统,爬虫的生命周期为链接提取、页面下载、内容抽取、持久化,多线程抓取机制,Redis队列和集合实现网页去重和增量抓取,Redis队列和集合实现网页去重和增量抓取。搜索引擎的索引和搜索系统是利用全文搜索引擎框架(ElasticSearch)构建,由IK分词器实现语句分词地功能,ElasticSearch是一个企业分布式、高扩展、高实时的搜索与数据技术分析处理引擎,可以用于搜索各种文当,它提供可扩展的搜索,具有高效的海量数据搜索、分析和探索的能力。最后实现一个简单的web搜索页面,来模拟搜索引擎客户端

资源截图

代码片段和文件信息

/*
 * Copyright 2007-present the original author or authors.
 *
 * Licensed under the Apache License Version 2.0 (the “License“);
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *      https://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing software
 * distributed under the License is distributed on an “AS IS“ BASIS
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.net.*;
import java.io.*;
import java.nio.channels.*;
import java.util.Properties;

public class MavenWrapperDownloader {

    private static final String WRAPPER_VERSION = “0.5.6“;
    /**
     * Default URL to download the maven-wrapper.jar from if no ‘downloadUrl‘ is provided.
     */
    private static final String DEFAULT_DOWNLOAD_URL = “https://repo.maven.apache.org/maven2/io/takari/maven-wrapper/“
            + WRAPPER_VERSION + “/maven-wrapper-“ + WRAPPER_VERSION + “.jar“;

    /**
     * Path to the maven-wrapper.properties file which might contain a downloadUrl property to
     * use instead of the default one.
     */
    private static final String MAVEN_WRAPPER_PROPERTIES_PATH =
            “.mvn/wrapper/maven-wrapper.properties“;

    /**
     * Path where the maven-wrapper.jar will be saved to.
     */
    private static final String MAVEN_WRAPPER_JAR_PATH =
            “.mvn/wrapper/maven-wrapper.jar“;

    /**
     * Name of the property which should be used to override the default download url for the wrapper.
     */
    private static final String PROPERTY_NAME_WRAPPER_URL = “wrapperUrl“;

    public static void main(String args[]) {
        System.out.println(“- Downloader started“);
        File baseDirectory = new File(args[0]);
        System.out.println(“- Using base directory: “ + baseDirectory.getAbsolutePath());

        // If the maven-wrapper.properties exists read it and check if it contains a custom
        // wrapperUrl parameter.
        File mavenWrapperPropertyFile = new File(baseDirectory MAVEN_WRAPPER_PROPERTIES_PATH);
        String url = DEFAULT_DOWNLOAD_URL;
        if (mavenWrapperPropertyFile.exists()) {
            FileInputStream mavenWrapperPropertyFileInputStream = null;
            try {
                mavenWrapperPropertyFileInputStream = new FileInputStream(mavenWrapperPropertyFile);
                Properties mavenWrapperProperties = new Properties();
                mavenWrapperProperties.load(mavenWrapperPropertyFileInputStream);
                url = mavenWrapperProperties.getProperty(PROPERTY_NAME_WRAPPER_URL url);
            } catch (IOException e) {
                System.out.println(“- ERROR loading ‘“ + MAVEN_WRAPPER_PROPERTIES_PATH + “‘“);
            } finally {
                try {
                    if (mavenWrapperPropertyFileInputStream != null) {
 

 属性            大小     日期    时间   名称
----------- ---------  ---------- -----  ----
     目录           0  2020-06-14 12:54  search-engine\
     目录           0  2020-06-14 12:53  search-engine\search-engine\
     文件         333  2020-04-23 10:57  search-engine\search-engine\.gitignore
     目录           0  2020-06-14 12:53  search-engine\search-engine\.idea\
     文件         184  2020-04-29 21:00  search-engine\search-engine\.idea\.gitignore
     目录           0  2020-06-14 12:53  search-engine\search-engine\.idea\artifacts\
     文件         485  2020-04-23 11:16  search-engine\search-engine\.idea\artifacts\search_engine_war.xml
     文件       21281  2020-05-05 00:31  search-engine\search-engine\.idea\artifacts\search_engine_war_exploded.xml
     文件         830  2020-04-23 11:16  search-engine\search-engine\.idea\compiler.xml
     目录           0  2020-06-14 12:53  search-engine\search-engine\.idea\dataSources\
     目录           0  2020-06-14 12:53  search-engine\search-engine\.idea\dataSources\7f4171d0-2398-4e1d-a9d2-9c84daaa0f0d\
     目录           0  2020-06-14 12:53  search-engine\search-engine\.idea\dataSources\7f4171d0-2398-4e1d-a9d2-9c84daaa0f0d\storage_v2\
     目录           0  2020-06-14 12:53  search-engine\search-engine\.idea\dataSources\7f4171d0-2398-4e1d-a9d2-9c84daaa0f0d\storage_v2\_src_\
     目录           0  2020-06-14 12:53  search-engine\search-engine\.idea\dataSources\7f4171d0-2398-4e1d-a9d2-9c84daaa0f0d\storage_v2\_src_\schema\
     文件          76  2020-04-25 17:59  search-engine\search-engine\.idea\dataSources\7f4171d0-2398-4e1d-a9d2-9c84daaa0f0d\storage_v2\_src_\schema\information_schema.FNRwLQ.meta
     文件       29120  2020-05-05 01:49  search-engine\search-engine\.idea\dataSources\7f4171d0-2398-4e1d-a9d2-9c84daaa0f0d.xml
     文件         984  2020-04-25 18:01  search-engine\search-engine\.idea\dataSources.local.xml
     文件         525  2020-04-25 17:58  search-engine\search-engine\.idea\dataSources.xml
     目录           0  2020-06-14 12:53  search-engine\search-engine\.idea\dictionaries\
     文件         490  2020-04-30 18:16  search-engine\search-engine\.idea\dictionaries\qirui.xml
     文件         267  2020-04-29 21:01  search-engine\search-engine\.idea\encodings.xml
     文件         864  2020-04-29 21:08  search-engine\search-engine\.idea\jarRepositories.xml
     目录           0  2020-06-14 12:53  search-engine\search-engine\.idea\libraries\
     文件         462  2020-04-29 21:08  search-engine\search-engine\.idea\libraries\Maven__antlr_antlr_2_7_7.xml
     文件         568  2020-04-29 21:08  search-engine\search-engine\.idea\libraries\Maven__ch_qos_logback_logback_classic_1_2_3.xml
     文件         547  2020-04-29 21:08  search-engine\search-engine\.idea\libraries\Maven__ch_qos_logback_logback_core_1_2_3.xml
     文件         543  2020-04-29 21:08  search-engine\search-engine\.idea\libraries\Maven__commons_codec_commons_codec_1_13.xml
     文件         616  2020-04-29 21:08  search-engine\search-engine\.idea\libraries\Maven__commons_collections_commons_collections_3_2_2.xml
     文件         517  2020-04-29 21:08  search-engine\search-engine\.idea\libraries\Maven__commons_io_commons_io_1_3_2.xml
     文件         514  2020-04-29 21:08  search-engine\search-engine\.idea\libraries\Maven__com_alibaba_fastjson_1_2_28.xml
     文件         499  2020-04-29 21:08  search-engine\search-engine\.idea\libraries\Maven__com_carrotsearch_hppc_0_7_1.xml
............此处省略300个文件信息

评论

共有 条评论