randak/charlotte 问题修复 & 功能扩展

解决BUG、新增功能、兼容多环境部署,快速响应你的开发需求

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

randak/charlotte

Composer 安装命令:

composer require randak/charlotte

包简介

Charlotte crawls through your website and captures data in a database, including links between pages, H1 elements, scripts, stylesheets, and keywords.

README 文档

README

Author: Kristian Randall kristian.l.randall@gmail.com Copyright 2014

PHP-based web crawler for site analysis. Crawls your website and stores information about your pages, scripts and stylesheets in a Neo4j graph database. (Can be extended to use any database.)

Installation

Install using Composer:

composer require randak/charlotte:dev-master

Depending on your composer settings, you may need to run composer require everyman/neo4jphp:dev-master before you can install Charlotte. If you get an error about that package not being available, this is the likely solution.

In addition to installing Charlotte, you'll also need Neo4j, whether it be on the same machine or another server.

Configuration

After installation, you will need to set up your configuration. Currently, there is an example config file in the examples folder. The config will look something like this:

    crawler:
        start: http://www.example.com
        exclude:
            - "/^javascript\:void\(0\)$/"
            - "/^#.*/"
            - "/^\\/$/"
            - "/\.(pdf|zip|zi|png|jpg|jpeg|doc|ppt)$/i"
    connections:
        Neo4j:
            host: localhost
            port: 7474

You should set the URL here to be the homepage of the website you wish to crawl.

The exclude patterns are regular expressions that will match URLs you don't want to crawl. For example, we are ignoring certain file types, and any URL that starts with a #.

Usage

Charlotte is currently designed to be run from the command line only.

Create a file called crawl.php.

touch crawl.php

Insert the follow.

    <?php

    require('path/to/vendor/autoload.php'); //set this

    use Everyman\Neo4j\Client;
    use Charlotte\Charlotte;
    use Charlotte\Processor\Neo4jProcessor;
    use Symfony\Component\Yaml\Parser;

    $yaml = new Parser();
    $config = $yaml->parse(file_get_contents(__DIR__."/config.yml")); //set this

    $conn = $config["connections"]["Neo4j"];

    $client = new Client($conn["host"], $conn["port"]);

    $charlotte = new Charlotte();

    $charlotte->setConfig($config);
    $charlotte->setProcessor(new Neo4jProcessor($client));

    $charlotte->traverse();

Make sure everything is set up in your config.yml and that your database is open.

Run the script.

php crawl.php

Contributions

Contributions are welcome! If you'd like to contribute, please create an issue first.

Disclaimer

This crawler is intended to be used only on websites which are owned or operated by you. The developer(s) of this tool are not responsible for any use of this code that violates any laws or otherwise causes harm, and are not liable for any misuse.

统计信息

  • 总下载量: 5
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 0
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: GPL-3.0
  • 更新时间: 2014-03-21

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固