Crawl API¶
Diffbot has the ability to crawl entire domains and process all crawled pages. For a difference between crawling and processing see here.
To programmatically create or update crawljobs, use this API.
A full tutorial on using this API can be found here, and a working app powered by it at http://search.sitepoint.tools.
The Crawl API is also known as the Crawlbot.
Crawl API Class¶
-
class
Swader\Diffbot\Api\
Crawl
¶
The Crawl API is used to create new crawljobs or modify existing ones. The Crawl API is atypical, and as such does not extend Swader\Diffbot\Abstracts\Api
unlike the more entity-specific APIs.
Note that everything you can do with the Crawl API can also be done in the Diffbot UI.
__construct¶
Swader\Diffbot\Api\Crawl::
__construct
($name = null, $api = null)¶
Parametri:
- $name (string) – [Optional] The name of crawljob to be created or modified.
- $api (Swader\Diffbot\Interfaces\Api) – [Optional] The API to use while processing the crawled links.
The
$name
argument is optional. If omitted, the second argument is ignored and theSwader\Diffbot\Api\Crawl::call
will return a list of all crawljobs on a given Diffbot token, with their information, in aSwader\Diffbot\Entity\EntityIterator
collection ofSwader\Diffbot\Entity\JobCrawl
instances.The
$api
argument is also optional, but must be an instance ofSwader\Diffbot\Interfaces\Api
if provided:<?php // ... set up Diffbot $api = $diffbot->createArticleApi('crawl'); $crawljob = $diffbot->crawl('myCrawlJob', $api); // ... crawljob setup // $crawljob->setSeeds( ... ) $crawljob->call();
getName¶
Swader\Diffbot\Api\Crawl::
getName
()¶
Vraća: string Returns the unique name of the crawljob. This name is later used to download datasets, or to modify the job.
setApi¶
Swader\Diffbot\Api\Crawl::
setApi
($api)¶
Parametri:
- $api (Swader\Diffbot\Interfaces\Api) – An instance of
Swader\Diffbot\Interfaces\Api
to process all crawled links.Vraća: $this
The API cannot be modified after a crawljob has been created. This method is useless on existing crawljobs (see https://www.diffbot.com/dev/docs/crawl/api.jsp)
The
$api
passed into this class will be used on Diffbot’s end to process all the pages the crawljob provides. For example, if you set http://sitepoint.com as the seed URL (seeSwader\Diffbot\Api\Crawl::setSeeds
), and an instance of theSwader\Diffbot\Api\Article
API as the$api
argument, all pages found on http://sitepoint.com will be processed with the Article API. The results won’t be returned - rather, they’ll be saved on Diffbot’s servers for searching later (seeSwader\Diffbot\Api\Search
).The other APIs require a URL parameter in their constructor, but when crawling, it is crawlbot who is providing the URLs. To get around this requirement, use the string “crawl” instead of a URL when instantiating a new API for use with the Crawl API:
// ... $api = $diffbot->createArticleApi('crawl'); // ...
setSeeds¶
Swader\Diffbot\Api\Crawl::
setSeeds
(array $seeds)¶
Parametri:
- $seeds (array) – An array of URLs (seeds) which to crawl for matching links
Vraća: $this
By default Crawlbot will restrict spidering to the entire domain (“http://blog.diffbot.com” will include URLs at “http://www.diffbot.com”):
// ... $crawljob->setSeeds(['http://sitepoint.com', 'http://blog.diffbot.com']); // ...
setUrlCrawlPatterns¶
Swader\Diffbot\Api\Crawl::
setUrlCrawlPatterns
(array $pattern = null)¶
Parametri:
- $pattern (array) – [Optional] Array of strings to limit pages crawled to those whose URLs contain any of the content strings.
Vraća: $this
You can use the exclamation point to specify a negative string, e.g. !product to exclude URLs containing the string “product,” and the ^ and $ characters to limit matches to the beginning or end of the URL.
The use of a urlCrawlPattern will allow Crawlbot to spider outside of the seed domain(s); it will follow all matching URLs regardless of domain:
// ... $crawljob->setUrlCrawlPatterns(['!author', '!page']); // ...
setUrlCrawlRegex¶
Swader\Diffbot\Api\Crawl::
setUrlCrawlRegex
($regex)¶
Parametri:
- $regex (string) – a regular expression string
Vraća: $this
Specify a regular expression to limit pages crawled to those URLs that match your expression. This will override any urlCrawlPattern value.
The use of a urlCrawlRegEx will allow Crawlbot to spider outside of the seed domain; it will follow all matching URLs regardless of domain.
setUrlProcessPatterns¶
Swader\Diffbot\Api\Crawl::
setUrlProcessPatterns
(array $pattern = null)¶
Parametri:
- $pattern (array) – [Optional] array of strings to search for in URLs
Vraća: $this
Only URLs containing one or more of the strings specified will be processed by Diffbot. You can use the exclamation point to specify a negative string, e.g. !/category to exclude URLs containing the string “/category,” and the ^ and $ characters to limit matches to the beginning or end of the URL.
setUrlProcessRegex¶
Swader\Diffbot\Api\Crawl::
setUrlProcessRegex
($regex)¶
Parametri:
- $regex (string) – Regular expression string
Vraća: $this
Specify a regular expression to limit pages processed to those URLs that match your expression. This will override any urlProcessPattern value.
setPageProcessPatterns¶
Swader\Diffbot\Api\Crawl::
setPageProcessPatterns
(array $pattern = null)¶
Parametri:
- $pattern (array) – [Optional] Array of strings
Vraća: $this
Specify strings to look for in the HTML of the pages of the crawled URLs. Only pages containing one or more of those strings will be processed by the designated API. Very useful for limiting processing to pages with a certain class present (e.g.
class=article
) to further narrow down processing scope and reduce expenses (fewer API calls).
setMaxHops¶
Swader\Diffbot\Api\Crawl::
setMaxHops
($input = -1)¶
Parametri:
- $input (int) – [Optional] Maximum number of hops
Vraća: $this
Specify the depth of your crawl. A maxHops=0 will limit processing to the seed URL(s) only – no other links will be processed; maxHops=1 will process all (otherwise matching) pages whose links appear on seed URL(s); maxHops=2 will process pages whose links appear on those pages; and so on. By default, Crawlbot will crawl and process links at any depth.
setMaxToCrawl¶
Swader\Diffbot\Api\Crawl::
setMaxToCrawl
($input = 100000)¶
Parametri:
- $input (type) – [Optional] Maximum number of URLs to spider
Vraća: $this
Note that spidering (crawling) does not affect the API quota, and reducing this will only contribute to the length of a crawljob (it will be done faster if the limit is reached sooner). For a difference between crawling and processing see here.
setMaxToProcess¶
notify¶
Swader\Diffbot\Api\Crawl::
notify
($string)¶
Parametri:
- $string (string) – Email or URL
Vraća: $this
Throws:
InvalidArgumentException
if the input parameter is not a numberIf input is email address, end a message to this email address when the crawl hits the maxToCrawl or maxToProcess limit, or when the crawl completes.
If input is URL, you will receive a POST with X-Crawl-Name and X-Crawl-Status in the headers, and the full JSON response in the POST body.
This method can be called once with an email and another time with a URL in order to define both an email notification hook and a URL notification hook. An InvalidArgumentException will be thrown if the argument isn’t a valid string (neither a URL nor an email address).
setCrawlDelay¶
Swader\Diffbot\Api\Crawl::
setCrawlDelay
($input = 0.25)¶
Parametri:
- $input (float) – [Optional] delay between crawljob repeat executions, in floating point seconds. Defaults to 0.25 seconds.
Vraća: $this
Throws:
InvalidArgumentException
if the input parameter is not a numberWait this many seconds between each URL crawled from a single IP address. Specify the number of seconds as an integer or floating-point number.
setRepeat¶
Swader\Diffbot\Api\Crawl::
setRepeat
($input)¶
Parametri:
- $input (float) – The wait period between crawljob restarts, expressed in floating point days. E.g. 0.5 is 12 hours, 7 is a week, 14.5 is 2 weeks and 12 hours, etc. By default, crawls will not be repeated.
Vraća: $this
Throws:
InvalidArgumentException
if the input parameter is not a number
setOnlyProcessIfNew¶
Swader\Diffbot\Api\Crawl::
setOnlyProcessIfNew
($int = 1)¶
Parametri:
- $int (int) – [Optional] a boolean flag represented as an integer
Vraća: return value
By default repeat crawls will only process new (previously unprocessed) pages. Set to 0 to process all content on repeat crawls.
setMaxRounds¶
Swader\Diffbot\Api\Crawl::
setMaxRounds
($input = 0)¶
Parametri:
- $input (type) – [Optional] The param’s description
Vraća: return value
Specify the maximum number of crawl repeats. By default (maxRounds=0) repeating crawls will continue indefinitely.
setObeyRobots¶
Swader\Diffbot\Api\Crawl::
setObeyRobots
($bool = true)¶
Parametri:
- $bool (bool) – [Optional] Either
true
orfalse
Vraća: $this
Ignores robots.txt if set to
false
roundStart¶
Swader\Diffbot\Api\Crawl::
roundStart
($commit = true)¶
Parametri:
- $commit (bool) – [Optional] Either
true
orfalse
Vraća: Force the start of a new crawl “round” (manually repeat the crawl). If onlyProcessIfNew is set to 1 (default), only newly-created pages will be processed. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.
pause¶
Swader\Diffbot\Api\Crawl::
pause
($commit = true)¶
Parametri:
- $commit (bool) – [Optional] Either
true
orfalse
Vraća: Pause a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.
unpause¶
Swader\Diffbot\Api\Crawl::
unpause
($commit = true)¶
Parametri:
- $commit (bool) – [Optional] Either
true
orfalse
Vraća: Unpause a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.
restart¶
Swader\Diffbot\Api\Crawl::
restart
($commit = true)¶
Parametri:
- $commit (bool) – [Optional] Either
true
orfalse
Vraća: Restart a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.
delete¶
Swader\Diffbot\Api\Crawl::
delete
($commit = true)¶
Parametri:
- $commit (bool) – [Optional] Either
true
orfalse
Vraća: Delete a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.
buildUrl¶
Swader\Diffbot\Api\Crawl::
buildUrl
()¶
Vraća: string This method is called automatically when
Swader\Diffbot\Abstracts\Api::call
is called. It builds the URL which is to be called by the HTTPClient inSwader\Diffbot\Diffbot::setHttpClient
, and returns it. This method can be used to get the URL for the purposes of testing in third party API clients like Postman.Usage:
$api-> // ... set up API $myUrl = $api->buildUrl();
call¶
Swader\Diffbot\Api\Crawl::
call
()¶
Vraća: Swader\Diffbot\Entity\EntityIterator
When the API instance has been fully configured, this method executes the call. If all went well, will return a collection of
Swader\Diffbot\Entity\JobCrawl
objects, each with information about a job under the current Diffbot token. How many get returned depends on the action that was performed - see below.
JobCrawl Class¶
The JobCrawl class is a container of information about a crawljob. If a crawljob is being created with the Crawl API, the Crawl API will return a single instance of JobCrawl with the information about the created job. If the Crawl API is being called without settings, returns all the token’s crawljobs - each in a separate instance. If the crawl job is being deleted, restarted, paused, etc, only the instance pertaining to the relevant crawljob is returned.
-
class
Swader\Diffbot\Entity\
JobCrawl
¶
getMaxToCrawl¶
Swader\Diffbot\Entity\JobCrawl::
getMaxToCrawl
()¶
Vraća: int Maximum number of pages to crawl with this crawljob
getMaxToProcess¶
Swader\Diffbot\Entity\JobCrawl::
getMaxToProcess
()¶
Vraća: int Maximum number of pages to process with this crawljob
getOnlyProcessIfNew¶
Swader\Diffbot\Entity\JobCrawl::
getOnlyProcessIfNew
()¶
Vraća: bool Whether or not the job was set to only process newly found links, ignoring old but potentially updated ones
getSeeds¶
Swader\Diffbot\Entity\JobCrawl::
getSeeds
()¶
Vraća: array Seeds as given to the crawljob on creation. Returned as an array, suitable for direct insertion into a new crawljob via
Swader\Diffbot\Api\Crawl::setSeeds