Crawl API

Diffbot has the ability to crawl entire domains and process all crawled pages. For a difference between crawling and processing see here.

To programmatically create or update crawljobs, use this API.

A full tutorial on using this API can be found here, and a working app powered by it at http://search.sitepoint.tools.

The Crawl API is also known as the Crawlbot.

Crawl API Class

class Swader\Diffbot\Api\Crawl

The Crawl API is used to create new crawljobs or modify existing ones. The Crawl API is atypical, and as such does not extend Swader\Diffbot\Abstracts\Api unlike the more entity-specific APIs.

Note that everything you can do with the Crawl API can also be done in the Diffbot UI.

Swader\Diffbot\Api\Crawl::__construct($name = null, $api = null)
Parameters:
  • $name (string) – [Optional] The name of crawljob to be created or modified.
  • $api (Swader\Diffbot\Interfaces\Api) – [Optional] The API to use while processing the crawled links.

The $name argument is optional. If omitted, the second argument is ignored and the Swader\Diffbot\Api\Crawl::call will return a list of all crawljobs on a given Diffbot token, with their information, in a Swader\Diffbot\Entity\EntityIterator collection of Swader\Diffbot\Entity\JobCrawl instances.

The $api argument is also optional, but must be an instance of Swader\Diffbot\Interfaces\Api if provided:

<?php

// ... set up Diffbot

$api = $diffbot->createArticleApi('crawl');
$crawljob = $diffbot->crawl('myCrawlJob', $api);

// ... crawljob setup
// $crawljob->setSeeds( ... )

$crawljob->call();

Swader\Diffbot\Api\Crawl::getName()
Returns:string

Returns the unique name of the crawljob. This name is later used to download datasets, or to modify the job.

Swader\Diffbot\Api\Crawl::setApi($api)
Parameters:
Returns:

$this

The API cannot be modified after a crawljob has been created. This method is useless on existing crawljobs (see https://www.diffbot.com/dev/docs/crawl/api.jsp)

The $api passed into this class will be used on Diffbot’s end to process all the pages the crawljob provides. For example, if you set http://sitepoint.com as the seed URL (see Swader\Diffbot\Api\Crawl::setSeeds), and an instance of the Swader\Diffbot\Api\Article API as the $api argument, all pages found on http://sitepoint.com will be processed with the Article API. The results won’t be returned - rather, they’ll be saved on Diffbot’s servers for searching later (see Swader\Diffbot\Api\Search).

The other APIs require a URL parameter in their constructor, but when crawling, it is crawlbot who is providing the URLs. To get around this requirement, use the string “crawl” instead of a URL when instantiating a new API for use with the Crawl API:

// ...
$api = $diffbot->createArticleApi('crawl');
// ...

Swader\Diffbot\Api\Crawl::setSeeds(array $seeds)
Parameters:
  • $seeds (array) – An array of URLs (seeds) which to crawl for matching links
Returns:

$this

By default Crawlbot will restrict spidering to the entire domain (“http://blog.diffbot.com” will include URLs at “http://www.diffbot.com”):

// ...
$crawljob->setSeeds(['http://sitepoint.com', 'http://blog.diffbot.com']);
// ...

Swader\Diffbot\Api\Crawl::setUrlCrawlPatterns(array $pattern = null)
Parameters:
  • $pattern (array) – [Optional] Array of strings to limit pages crawled to those whose URLs contain any of the content strings.
Returns:

$this

You can use the exclamation point to specify a negative string, e.g. !product to exclude URLs containing the string “product,” and the ^ and $ characters to limit matches to the beginning or end of the URL.

The use of a urlCrawlPattern will allow Crawlbot to spider outside of the seed domain(s); it will follow all matching URLs regardless of domain:

// ...
$crawljob->setUrlCrawlPatterns(['!author', '!page']);
// ...

Swader\Diffbot\Api\Crawl::setUrlCrawlRegex($regex)
Parameters:
  • $regex (string) – a regular expression string
Returns:

$this

Specify a regular expression to limit pages crawled to those URLs that match your expression. This will override any urlCrawlPattern value.

The use of a urlCrawlRegEx will allow Crawlbot to spider outside of the seed domain; it will follow all matching URLs regardless of domain.

Swader\Diffbot\Api\Crawl::setUrlProcessPatterns(array $pattern = null)
Parameters:
  • $pattern (array) – [Optional] array of strings to search for in URLs
Returns:

$this

Only URLs containing one or more of the strings specified will be processed by Diffbot. You can use the exclamation point to specify a negative string, e.g. !/category to exclude URLs containing the string “/category,” and the ^ and $ characters to limit matches to the beginning or end of the URL.

Swader\Diffbot\Api\Crawl::setUrlProcessRegex($regex)
Parameters:
  • $regex (string) – Regular expression string
Returns:

$this

Specify a regular expression to limit pages processed to those URLs that match your expression. This will override any urlProcessPattern value.

Swader\Diffbot\Api\Crawl::setPageProcessPatterns(array $pattern = null)
Parameters:
  • $pattern (array) – [Optional] Array of strings
Returns:

$this

Specify strings to look for in the HTML of the pages of the crawled URLs. Only pages containing one or more of those strings will be processed by the designated API. Very useful for limiting processing to pages with a certain class present (e.g. class=article) to further narrow down processing scope and reduce expenses (fewer API calls).

Swader\Diffbot\Api\Crawl::setMaxHops($input = -1)
Parameters:
  • $input (int) – [Optional] Maximum number of hops
Returns:

$this

Specify the depth of your crawl. A maxHops=0 will limit processing to the seed URL(s) only – no other links will be processed; maxHops=1 will process all (otherwise matching) pages whose links appear on seed URL(s); maxHops=2 will process pages whose links appear on those pages; and so on. By default, Crawlbot will crawl and process links at any depth.

Swader\Diffbot\Api\Crawl::setMaxToCrawl($input = 100000)
Parameters:
  • $input (type) – [Optional] Maximum number of URLs to spider
Returns:

$this

Note that spidering (crawling) does not affect the API quota, and reducing this will only contribute to the length of a crawljob (it will be done faster if the limit is reached sooner). For a difference between crawling and processing see here.

Swader\Diffbot\Api\Crawl::setMaxToProcess($input = 100000)
Parameters:
  • $input (type) – [Optional] Maximum number of URLs to process
Returns:

$this

Useful for limiting the number of API calls made, thus reducing / limiting expenses. For a difference between crawling and processing see here.

Swader\Diffbot\Api\Crawl::notify($string)
Parameters:
  • $string (string) – Email or URL
Returns:

$this

Throws:

InvalidArgumentException if the input parameter is not a number

If input is email address, end a message to this email address when the crawl hits the maxToCrawl or maxToProcess limit, or when the crawl completes.

If input is URL, you will receive a POST with X-Crawl-Name and X-Crawl-Status in the headers, and the full JSON response in the POST body.

This method can be called once with an email and another time with a URL in order to define both an email notification hook and a URL notification hook. An InvalidArgumentException will be thrown if the argument isn’t a valid string (neither a URL nor an email address).

Swader\Diffbot\Api\Crawl::setCrawlDelay($input = 0.25)
Parameters:
  • $input (float) – [Optional] delay between crawljob repeat executions, in floating point seconds. Defaults to 0.25 seconds.
Returns:

$this

Throws:

InvalidArgumentException if the input parameter is not a number

Wait this many seconds between each URL crawled from a single IP address. Specify the number of seconds as an integer or floating-point number.

Swader\Diffbot\Api\Crawl::setRepeat($input)
Parameters:
  • $input (float) – The wait period between crawljob restarts, expressed in floating point days. E.g. 0.5 is 12 hours, 7 is a week, 14.5 is 2 weeks and 12 hours, etc. By default, crawls will not be repeated.
Returns:

$this

Throws:

InvalidArgumentException if the input parameter is not a number

Swader\Diffbot\Api\Crawl::setOnlyProcessIfNew($int = 1)
Parameters:
  • $int (int) – [Optional] a boolean flag represented as an integer
Returns:

return value

By default repeat crawls will only process new (previously unprocessed) pages. Set to 0 to process all content on repeat crawls.

Swader\Diffbot\Api\Crawl::setMaxRounds($input = 0)
Parameters:
  • $input (type) – [Optional] The param’s description
Returns:

return value

Specify the maximum number of crawl repeats. By default (maxRounds=0) repeating crawls will continue indefinitely.

Swader\Diffbot\Api\Crawl::setObeyRobots($bool = true)
Parameters:
  • $bool (bool) – [Optional] Either true or false
Returns:

$this

Ignores robots.txt if set to false

Swader\Diffbot\Api\Crawl::roundStart($commit = true)
Parameters:
  • $commit (bool) – [Optional] Either true or false
Returns:

$this | Swader\Diffbot\Entity\EntityIterator

Force the start of a new crawl “round” (manually repeat the crawl). If onlyProcessIfNew is set to 1 (default), only newly-created pages will be processed. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.

Swader\Diffbot\Api\Crawl::pause($commit = true)
Parameters:
  • $commit (bool) – [Optional] Either true or false
Returns:

$this | Swader\Diffbot\Entity\EntityIterator

Pause a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.

Swader\Diffbot\Api\Crawl::unpause($commit = true)
Parameters:
  • $commit (bool) – [Optional] Either true or false
Returns:

$this | Swader\Diffbot\Entity\EntityIterator

Unpause a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.

Swader\Diffbot\Api\Crawl::restart($commit = true)
Parameters:
  • $commit (bool) – [Optional] Either true or false
Returns:

$this | Swader\Diffbot\Entity\EntityIterator

Restart a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.

Swader\Diffbot\Api\Crawl::delete($commit = true)
Parameters:
  • $commit (bool) – [Optional] Either true or false
Returns:

$this | Swader\Diffbot\Entity\EntityIterator

Delete a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.

Swader\Diffbot\Api\Crawl::buildUrl()
Returns:string

This method is called automatically when Swader\Diffbot\Abstracts\Api::call is called. It builds the URL which is to be called by the HTTPClient in Swader\Diffbot\Diffbot::setHttpClient, and returns it. This method can be used to get the URL for the purposes of testing in third party API clients like Postman.

Usage:

$api-> // ... set up API
$myUrl = $api->buildUrl();

Swader\Diffbot\Api\Crawl::call()
Returns:Swader\Diffbot\Entity\EntityIterator

When the API instance has been fully configured, this method executes the call. If all went well, will return a collection of Swader\Diffbot\Entity\JobCrawl objects, each with information about a job under the current Diffbot token. How many get returned depends on the action that was performed - see below.

JobCrawl Class

The JobCrawl class is a container of information about a crawljob. If a crawljob is being created with the Crawl API, the Crawl API will return a single instance of JobCrawl with the information about the created job. If the Crawl API is being called without settings, returns all the token’s crawljobs - each in a separate instance. If the crawl job is being deleted, restarted, paused, etc, only the instance pertaining to the relevant crawljob is returned.
class Swader\Diffbot\Entity\JobCrawl

Swader\Diffbot\Entity\JobCrawl::getMaxToCrawl()
Returns:int

Maximum number of pages to crawl with this crawljob

Swader\Diffbot\Entity\JobCrawl::getMaxToProcess()
Returns:int

Maximum number of pages to process with this crawljob

Swader\Diffbot\Entity\JobCrawl::getOnlyProcessIfNew()
Returns:bool

Whether or not the job was set to only process newly found links, ignoring old but potentially updated ones

Swader\Diffbot\Entity\JobCrawl::getSeeds()
Returns:array

Seeds as given to the crawljob on creation. Returned as an array, suitable for direct insertion into a new crawljob via Swader\Diffbot\Api\Crawl::setSeeds