定制 jmajors/robotstxt 二次开发

按需修改功能、优化性能、对接业务系统,提供一站式技术支持

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

jmajors/robotstxt

Composer 安装命令:

composer require jmajors/robotstxt

包简介

A small package for parsing websites' robots.txt files

README 文档

README

Build Status

This is a small package to make parsing robots.txt rules easier. The URL matching follows the rules outlined by Google in their webmasters guide.

Quick example:

// basic usage
$robots  = new Robots\RobotsTxt();
$allowed = $robots->isAllowed("https://www.example.com/some/path"); // true
$allowed = $robots->isAllowed("https://www.another.com/example");   // false

Setup

Install via composer:

$ composer require jmajors/robotstxt

Make sure composer's autoloader is included in your project:

require __DIR__ . '/vendor/autoload.php';

That's it.

Usage

This package is a class made mainly for checking if a crawler is allowed to visit a particular URL. Use the isAllowed(string $url) method to check whether or not a crawler is disallowed from crawling a particular path, which returns true if the URL's path is not included in the robots.txt Disallowed rules (i.e. you're free to crawl), and false if the path is disallowed (no crawling!). Here's an example:

<?php
use Robots\RobotsTxt;

$robotsTxt = new RobotsTxt();
$allowed = $robotsTxt->isAllowed("https://www.example.com/this/is/fine"); // returns true

Additionally, setUserAgent($userAgent) will allow you to specify a User Agent in the request header.

$robotsTxt = new RobotsTxt();
$userAgent = 'RobotsTxtBot/1.0; (+https://github.com/jasonmajors/robotstxt)';
// set a user agent
$robotsTxt->setUserAgent($userAgent);
$allowed = $robotsTxt->isAllowed("https://www.example.com/not/sure/if/allowed");

// Alternatively...
$allowed = $robotsTxt->setUserAgent($userAgent)->isAllowed("https://www.someplace.com/a/path");

If for some reason there's no robots.txt file at the root of the domain, a MissingRobotsTxtException will be thrown.

<?php
// Typical usage
use Robots\RobotsTxt;
use Robots\Exceptions\MissingRobotsTxtException;
...

$robotsTxt = new RobotsTxt();
$userAgent = 'RobotsTxtBot/1.0; (+https://github.com/jasonmajors/robotstxt)';

try {
    $allowed = $robotsTxt->setUserAgent($userAgent)->isAllowed("https://www.example.com/some/path");
} catch (MissingRobotsTxtException $e) {
    $error = $e->getMessage();
    // Handle the error
}

Further, getDisallowed will return an array of the disallowed paths for User-Agent: *:

$robots     = new RobotsTxt();
$disallowed = $robots->getDisallowed("https://www.example.com");

TODO's

  • Add ability to check disallowed paths based on user agent
  • Return a list of user agents in the file

统计信息

  • 总下载量: 23
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 3
  • 点击次数: 0
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2017-01-30

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固