Diffbot Class

The Diffbot class is the first instance a developer must create when using the client. It serves as a container for global settings, and as a factory for the various API endpoint classes.

class Swader\Diffbot\Diffbot

The Diffbot class takes a single optional argument, the $token, which can be obtained here. Instantiate like so:

$diffbot = new Diffbot("my_token");

Alternatively, set the token globally, and instantiate without passing in the parameter:

Diffbot::setToken("my_token");
$diffbot = new Diffbot();

Note that if you instantiate without a global token set, and don’t pass in a token while instantiating either, you’ll get a Swader\Diffbot\Exceptions\DiffbotException thrown.

static Swader\Diffbot\Diffbot::setToken($token)
Parameters:
  • $token (string) – The token.
Returns:

void, or throws an \InvalidArgumentException if the token is invalid

Useful for setting a default token for all future instances.

Usage:

Diffbot::setToken("my_token");

Swader\Diffbot\Diffbot::getToken()
Returns:null or string

Returns either the instance token, or the globally defined one - or null if neither is defined

Usage:

echo $diffbot->getToken(); // "my_token"

Swader\Diffbot\Diffbot::setHttpClient(GuzzleHttp\Client $client)
Parameters:
  • $client (GuzzleHttp\Client) – The HTTP client.
Returns:

$this

Allows changing of HTTP clients used to send requests to the Diffbot API. Generally useful only during testing, but some edge cases may arise. This method does not need to be called for Diffbot to be usable - it will default to a new instance of the regular GuzzleHttpClient.

Usage:

$client = new GuzzleHttp\Client();
$diffbot->setHttpClient($client);

Swader\Diffbot\Diffbot::getHttpClient()

Returns the currently set HTTP client. Can be changed via Swader\Diffbot\Diffbot::setHttpClient.

Returns:GuzzleHttp\Client

Swader\Diffbot\Diffbot::setEntityFactory($factory)
Parameters:
Returns:

$this

Allows for changing the entity factory in use when returning and processing Diffbot-provided data. A custom Entity Factory might, for example, return Author entities (also custom) for all calls to a custom API set up in a user’s Diffbot account. This helps with getting fully consumable custom data right from the API source, rather than requiring additional processing.

If not explicitly set, defaults to built-in Swader\Diffbot\Factory\Entity.

Usage:

$newEntityFactory = new \My\Custom\EntityFactory();

$diffbot = new Diffbot('my_token');
$diffbot->setEntityFactory($newEntityFactory);

// @todo: Full tutorial about a custom Entity and EntityFactory

Swader\Diffbot\Diffbot::getEntityFactory()
Returns:Swader\Diffbot\Interfaces\EntityFactory

Returns the currently defined Swader\Diffbot\Interfaces\EntityFactory instance. This method generally isn’t needed outside of testing scenarios. See above for usage of the setter.

Swader\Diffbot\Diffbot::createProductApi($url)
Parameters:
  • $url (string) – URL which is to be processed, or the word “crawl”
Returns:

Swader\Diffbot\Api\Product

The product API turns web shops, catalogs, etc. into structured JSON (think eBay, Amazon...). This method creates an instance of the Swader\Diffbot\Api\Product class. The method accepts a single string as a parameter: either a URL which to process, or the word “crawl” if used in conjunction with the Swader\Diffbot\Diffbot::crawl method (see below). For a detailed directory of available methods and in depth usage examples, see the Swader\Diffbot\Api\Product documentation.

Usage:

$api = $diffbot->createProductApi("http://www.amazon.com/Oh-The-Places-Youll-Go/dp/0679805273/");
$result = $api->call();

echo $result->offerPrice; // $11.99
echo $result->getIsbn(); // 0679805273

Swader\Diffbot\Diffbot::createArticleApi($url)
Parameters:
  • $url (string) – URL which is to be processed, or the word “crawl”
Returns:

Swader\Diffbot\Api\Article

The article API turns online news posts, blog articles, etc. into structured JSON. This method creates an instance of the Swader\Diffbot\Api\Article class. The method accepts a single string as a parameter: either a URL which to process, or the word “crawl” if used in conjunction with the Swader\Diffbot\Diffbot::crawl method (see below). For a detailed directory of available methods and in depth usage examples, see the Swader\Diffbot\Api\Article documentation.

Usage:

$api = $diffbot->createArticleApi("http://techcrunch.com/2012/05/31/diffbot-raises-2-million-seed-round-for-web-content-extraction-technology/");
$result = $api->call();

echo $result->publisherCountry; // United States
echo $result->getAuthor(); // Sarah Perez

Swader\Diffbot\Diffbot::createImageApi($url)
Parameters:
  • $url (string) – URL which is to be processed, or the word “crawl”
Returns:

Swader\Diffbot\Api\Image

The image API finds images in a post and returns them as JSON. This method creates an instance of the Swader\Diffbot\Api\Image class. The method accepts a single string as a parameter: either a URL which to process for images, or the word “crawl” if used in conjunction with the Swader\Diffbot\Diffbot::crawl method (see below). For a detailed directory of available methods and in depth usage examples, see the Swader\Diffbot\Api\Image documentation. Note that unlike Product and Article, the Image API can return several Image entities (see usage below). If not iterated through, the result refers to the first image only.

Usage:

$api = $diffbot->createImageApi("http://smittenkitchen.com/blog/2012/01/buckwheat-baby-with-salted-caramel-syrup/");
$result = $api->call();

echo $result->naturalHeight; // 333

foreach ($result as $image) {
    echo $result->title;
    echo $result->getXPath();
}

Swader\Diffbot\Diffbot::createAnalyzeApi($url)
Parameters:
  • $url (string) – URL which is to be processed, or the word “crawl”
Returns:

Swader\Diffbot\Api\Analyze

The analyze API tries to autodetect the content it’s dealing with (image, product, article...) and extracts it into structured JSON. This method creates an instance of the Swader\Diffbot\Api\Analyze class. The method accepts a single string as a parameter: either a URL which to process, or the word “crawl” if used in conjunction with the Swader\Diffbot\Diffbot::crawl method (see below). The Analyze API is the default API used during Swader\Diffbot\Diffbot::crawl mode.

Usage:

$api = $diffbot->createAnalyzeApi("http://techcrunch.com/2012/05/31/diffbot-raises-2-million-seed-round-for-web-content-extraction-technology/");
$result = $api->call();

echo $result->publisherCountry; // United States
echo $result->getAuthor(); // Sarah Perez

Swader\Diffbot\Diffbot::createDiscussionApi($url)
Parameters:
  • $url (string) – URL which is to be processed, or the word “crawl”
Returns:

Swader\Diffbot\Api\Discussion

The discussion API turns online comments, forum topics or pages of reviews into structured JSON. Think Amazon review section, Youtube comments, article Disqus comments, etc. This method creates an instance of the Swader\Diffbot\Api\Discussion. The method accepts a single string as a parameter: either a URL which to process, or the word “crawl” if used in conjunction with the Swader\Diffbot\Diffbot::crawl method (see below). Like the Image API above, this one also returns several Swader\Diffbot\Api\Discussion entities per call, if available, along with other data - see usage below.

Usage:

$api = $diffbot->createDiscussionApi("http://boards.straightdope.com/sdmb/showthread.php?t=740315");
$result = $api->call();

echo $result->numPosts; // 43
echo $result->getParticipants(); // 23

foreach ($result as $post) {
    echo $post->getAuthor();
    echo $post->votes;
}

Swader\Diffbot\Diffbot::createCustomApi($url, $name)
Parameters:
  • $url (string) – URL which is to be processed, or the word “crawl”
  • $name (string) – Name of the custom API as defined in the Diffbot UI
Returns:

Swader\Diffbot\Api\Custom

Diffbot customers can define Custom APIs. For a tutorial on doing this, see here. What it comes down to, is that you can tell Diffbot how to recognize certain areas of a web page, and have it translate that into JSON for you if none of the standard APIs do the trick. This allows for much more lightweight and specific calls, resulting in a quicker turnaround and (usually) more precise data. This method creates an instance of the Swader\Diffbot\Api\Custom. The method accepts two parameters: either a URL which to process, or the word “crawl” if used in conjunction with the Swader\Diffbot\Diffbot::crawl method (see below), and the name of the custom API to use. Unlike other APIs, this one has no specific entity to return and instead returns a Swader\Diffbot\Entity\Wildcard entity which matches anything.

Usage:

$api = $api->createCustomApi("http://sitepoint.com/author/bskvorc", "AuthorFolio");
$result = $api->call();

echo $result->bio; // Bruno is a coder from Croatia with Master's Degrees in...

Swader\Diffbot\Diffbot::crawl($name = null, Swader\Diffbot\Api $api = null)
Parameters:
  • $name (string) – Name of the new crawljob. If omitted, activates read only mode and returns joint data about all defined crawljobs for the current Diffbot token.
  • $api (Swader\Diffbot\Api) – Instance of the API to process the crawled URLs. If omitted, defaults to Swader\Diffbot\Api\Analyze.
Returns:

Swader\Diffbot\Api\Crawl

The crawl method is used to create new Crawlbot job (crawljob). To find out more about Crawlbot and what, how and why it does what it does, see here. I also recommend reading the Crawlbot API docs and the Crawlbot support topics just so you can dive right in without being too confused by the code below.

In a nutshell, the Crawlbot crawls a set of seed URLs for links (even if a subdomain is passed to it as seed URL, it still looks through the entire main domain and all other subdomains it can find) and then processes all the pages it can find using the API you define (or opting for Analyze API by default). The result of the call is a collection of Swader\Diffbot\Entity\JobCrawl objects, each with details about a defined job. To actually get data obtained by crawling and processing, use the Swader\Diffbot\Diffbot::search API.

Here’s how you can create a crawljob (see detailed Swader\Diffbot\Api\Search for a step by step guide with explanations):

$url = 'crawl';
$articleApi = $diffbot->createArticleAPI($url)->setDiscussion(false);

$crawl = $diffbot->crawl('mycrawl_01', $articleApi);

$crawl->setSeeds(['http://sitepoint.com']);

$job = $crawl->call();

// See JobCrawl class to find out which getters are available
dump($job->getDownloadUrl("json")); // outputs download URL to JSON dataset of the job's result