Website Metadata Agent Creates events Receives events Dry runs huginn_website_metadata_agent

The WebsiteMetadata Agent extracts metadata from HTML. It supports schema.org microdata, embedded JSON-LD and the common meta tag attributes.

data HTML to use in the extraction process, use Liquid formatting to select data from incoming events.

url optionally set the source URL of the provided HTML (without an URL schema.org links can not be extracted properly)

result_key sets the key which contains the the extracted information.

merge set to true to retain the received payload and update it with the extracted result

Liquid formatting can be used in all options.


Readability Agent Creates events Receives events Dry runs huginn_readability_agent

The Readability Agent extracts the primary readable content of a website.

data HTML to use in the extraction process, use Liquid formatting to select data from incoming events.

tags comma separated list of HTML tags to sanitize

remove_empty_nodes remove <p> tags that have no text content; also removes <p> tags that contain only images

attributes comma separated whitelist of allowed HTML tag attributes

blacklist CSS selector of elements to explicitly remove

whitelist CSS selector of elements to explicitly scope to

result_key sets the key which contains the the extracted information.

merge set to true to retain the received payload and update it with the extracted result

clean_output Removes \t charcters and duplicate new lines from the output when set to true.

Liquid formatting can be used in all options.


Withings Agent Receives events Dry runs huginn_withings_agent

Adds Activities from Events to your Withings Account

This agent will create new activities in your Withings account from events. This is useful if you own multiple devices (e.g., a Fitbit and a Withings) and want to consolidate your calories consumption into one.

Withings has an OAUTHv2 API but it only allows to read activities. This agent simulates Withings’ Website to add activities using their private one. This is why it requires username and password. The password is not stored in plain text.

You can retrieve your user_id from the website. Login with your credentials then look at the URL. The user_id is the number at the beginning.

Example Event

The agent expects an event with the following fields

  {
    'activity_name'   => 'Walking',               # o/w defaults to `other`
    'timezone'        => 'America/Los_Angeles',   # o/w uses `options`
    'start_time'      => 1578401100,              # epoch
    'end_time'        => 1578404700,              # epoch
    'calories'        => 500,                     # kcal
    'distance'        => 1000,                    # in meters
    'intensity'       => 40,                      # defaults to 50
  }

If end_time is not available one can specify duration in seconds, similarly, if subcategory is known it can be specified instead of activity_name.

Valid activity_name are walking (1), running (2), hiking (3), bicycling (6), swimming (7), tennis (12), weights (16), class (17), elliptical (18), basketball (20), soccer (21), volleyball (24) and yoga (28).


Phantom Js Cloud Agent Creates events Receives events Dry runs

This Agent generates PhantomJs Cloud URLs that can be used to render JavaScript-heavy webpages for content extraction.

URLs generated by this Agent are formulated in accordance with the PhantomJs Cloud API. The generated URLs can then be supplied to a Website Agent to fetch and parse the content.

Sign up to get an api key, and add it in Huginn credentials.

Please see the Huginn Wiki for more info.

Options:

  • Api key - PhantomJs Cloud API Key credential stored in Huginn
  • Url - The url to render
  • Mode - Create a new clean event or merge old payload with new values (default: clean)
  • Render type - Render as html, plain text without html tags, or jpg as screenshot of the page (default: html)
  • Output as json - Return the page conents and metadata as a JSON object (default: false)
  • Ignore images - Skip loading of inlined images (default: false)
  • Url agent - A custom User-Agent name (default: Huginn - https://github.com/huginn/huginn)
  • Wait interval - Milliseconds to delay rendering after the last resource is finished loading. This is useful in case there are any AJAX requests or animations that need to finish up. This can safely be set to 0 if you know there are no AJAX or animations you need to wait for (default: 1000ms)

As this agent only provides a limited subset of the most commonly used options, you can follow this guide to make full use of additional options PhantomJsCloud provides.


Post Agent Creates events Receives events Consumes file pointer Dry runs

A Post Agent receives events from other agents (or runs periodically), merges those events with the Liquid-interpolated contents of payload, and sends the results as POST (or GET) requests to a specified url. To skip merging in the incoming event, but still send the interpolated payload, set no_merge to true.

The post_url field must specify where you would like to send requests. Please include the URI scheme (http or https).

The method used can be any of get, post, put, patch, and delete.

By default, non-GETs will be sent with form encoding (application/x-www-form-urlencoded).

Change content_type to json to send JSON instead.

Change content_type to xml to send XML, where the name of the root element may be specified using xml_root, defaulting to post.

When content_type contains a MIME type, and payload is a string, its interpolated value will be sent as a string in the HTTP request’s body and the request’s Content-Type HTTP header will be set to content_type. When payload is a string no_merge has to be set to true.

If emit_events is set to true, the server response will be emitted as an Event and can be fed to a WebsiteAgent for parsing (using its data_from_event and type options). No data processing will be attempted by this Agent, so the Event’s “body” value will always be raw text. The Event will also have a “headers” hash and a “status” integer value.

If output_mode is set to merge, the emitted Event will be merged into the original contents of the received Event.

Set event_headers to a list of header names, either in an array of string or in a comma-separated string, to include only some of the header values.

Set event_headers_style to one of the following values to normalize the keys of “headers” for downstream agents’ convenience:

  • capitalized (default) - Header names are capitalized; e.g. “Content-Type”
  • downcased - Header names are downcased; e.g. “content-type”
  • snakecased - Header names are snakecased; e.g. “content_type”
  • raw - Backward compatibility option to leave them unmodified from what the underlying HTTP library returns.

Other Options:

  • headers - When present, it should be a hash of headers to send with the request.
  • basic_auth - Specify HTTP basic auth parameters: "username:password", or ["username", "password"].
  • disable_ssl_verification - Set to true to disable ssl verification.
  • user_agent - A custom User-Agent name (default: “Faraday v0.12.1”).

This agent can consume a ‘file pointer’ event from the following agents with no additional configuration: FtpsiteAgent, S3Agent, LocalFileAgent. Read more about the concept in the wiki.

When receiving a file_pointer the request will be sent with multipart encoding (multipart/form-data) and content_type is ignored. upload_key can be used to specify the parameter in which the file will be sent, it defaults to file.


Rss Agent Creates events Dry runs

The RSS Agent consumes RSS feeds and emits events when they change.

This agent, using Feedjira as a base, can parse various types of RSS and Atom feeds and has some special handlers for FeedBurner, iTunes RSS, and so on. However, supported fields are limited by its general and abstract nature. For complex feeds with additional field types, we recommend using a WebsiteAgent. See this example.

If you want to output an RSS feed, use the DataOutputAgent.

Options:

  • url - The URL of the RSS feed (an array of URLs can also be used; items with identical guids across feeds will be considered duplicates).
  • include_feed_info - Set to true to include feed information in each event.
  • clean - Set to true to sanitize description and content as HTML fragments, removing unknown/unsafe elements and attributes.
  • expected_update_period_in_days - How often you expect this RSS feed to change. If more than this amount of time passes without an update, the Agent will mark itself as not working.
  • headers - When present, it should be a hash of headers to send with the request.
  • basic_auth - Specify HTTP basic auth parameters: "username:password", or ["username", "password"].
  • disable_ssl_verification - Set to true to disable ssl verification.
  • disable_url_encoding - Set to true to disable url encoding.
  • force_encoding - Set force_encoding to an encoding name if the website is known to respond with a missing, invalid or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
  • user_agent - A custom User-Agent name (default: “Faraday v0.12.1”).
  • max_events_per_run - Limit number of events created (items parsed) per run for feed.
  • remembered_id_count - Number of IDs to keep track of and avoid re-emitting (default: 500).

Ordering Events

To specify the order of events created in each run, set events_order to an array of sort keys, each of which looks like either expression or [expression, type, descending], as described as follows:

  • expression is a Liquid template to generate a string to be used as sort key.

  • type (optional) is one of string (default), number and time, which specifies how to evaluate expression for comparison.

  • descending (optional) is a boolean value to determine if comparison should be done in descending (reverse) order, which defaults to false.

Sort keys listed earlier take precedence over ones listed later. For example, if you want to sort articles by the date and then by the author, specify [["{{date}}", "time"], "{{author}}"].

Sorting is done stably, so even if all events have the same set of sort key values the original order is retained. Also, a special Liquid variable _index_ is provided, which contains the zero-based index number of each event, which means you can exactly reverse the order of events by specifying [["{{_index_}}", "number", true]].

If the include_sort_info option is set, each created event will have a sort_info key whose value is a hash containing the following keys:

  • position: 1-based index of each event after the sort
  • count: Total number of events sorted

In this Agent, the default value for events_order is [["{{date_published}}","time"],["{{last_updated}}","time"]].


Website Agent Creates events Receives events Dry runs

The Website Agent scrapes a website, XML document, or JSON feed and creates Events based on the results.

Specify a url and select a mode for when to create Events based on the scraped data, either all, on_change, or merge (if fetching based on an Event, see below).

The url option can be a single url, or an array of urls (for example, for multiple pages with the exact same structure but different content to scrape).

The WebsiteAgent can also scrape based on incoming events.

  • Set the url_from_event option to a Liquid template to generate the url to access based on the Event. (To fetch the url in the Event’s url key, for example, set url_from_event to {{ url }}.)
  • Alternatively, set data_from_event to a Liquid template to use data directly without fetching any URL. (For example, set it to {{ html }} to use HTML contained in the html key of the incoming Event.)
  • If you specify merge for the mode option, Huginn will retain the old payload and update it with new values.

Supported Document Types

The type value can be xml, html, json, or text.

To tell the Agent how to parse the content, specify extract as a hash with keys naming the extractions and values of hashes.

Note that for all of the formats, whatever you extract MUST have the same number of matches for each extractor except when it has repeat set to true. E.g., if you’re extracting rows, all extractors must match all rows. For generating CSS selectors, something like SelectorGadget may be helpful.

For extractors with hidden set to true, they will be excluded from the payloads of events created by the Agent, but can be used and interpolated in the template option explained below.

For extractors with repeat set to true, their first matches will be included in all extracts. This is useful such as when you want to include the title of a page in all events created from the page.

Scraping HTML and XML

When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in css or an XPath expression in xpath. It then evaluates an XPath expression in value (default: .) on each node in the node set, converting the result into a string. Here’s an example:

"extract": {
  "url": { "css": "#comic img", "value": "@src" },
  "title": { "css": "#comic img", "value": "@title" },
  "body_text": { "css": "div.main", "value": "string(.)" },
  "page_title": { "css": "title", "value": "string(.)", "repeat": true }
} or
"extract": {
  "url": { "xpath": "//*[@class="blog-item"]/a/@href", "value": "."
  "title": { "xpath": "//*[@class="blog-item"]/a", "value": "normalize-space(.)" },
  "description": { "xpath": "//*[@class="blog-item"]/div[0]", "value": "string(.)" }
}

“@attr” is the XPath expression to extract the value of an attribute named attr from a node (such as “@href” from a hyperlink), and string(.) gives a string with all the enclosed text nodes concatenated without entity escaping (such as &amp;). To extract the innerHTML, use ./node(); and to extract the outer HTML, use ..

You can also use XPath functions like normalize-space to strip and squeeze whitespace, substring-after to extract part of a text, and translate to remove commas from formatted numbers, etc. Instead of passing string(.) to these functions, you can just pass . like normalize-space(.) and translate(., ',', '').

Beware that when parsing an XML document (i.e. type is xml) using xpath expressions, all namespaces are stripped from the document unless the top-level option use_namespaces is set to true.

For extraction with array set to true, all matches will be extracted into an array. This is useful when extracting list elements or multiple parts of a website that can only be matched with the same selector.

Scraping JSON

When parsing JSON, these sub-hashes specify JSONPaths to the values that you care about.

Sample incoming event:

{ "results": {
    "data": [
      {
        "title": "Lorem ipsum 1",
        "description": "Aliquam pharetra leo ipsum."
        "price": 8.95
      },
      {
        "title": "Lorem ipsum 2",
        "description": "Suspendisse a pulvinar lacus."
        "price": 12.99
      },
      {
        "title": "Lorem ipsum 3",
        "description": "Praesent ac arcu tellus."
        "price": 8.99
      }
    ]
  }
}

Sample rule:

"extract": {
  "title": { "path": "results.data[*].title" },
  "description": { "path": "results.data[*].description" }
}

In this example the * wildcard character makes the parser to iterate through all items of the data array. Three events will be created as a result.

Sample outgoing events:

[
  {
    "title": "Lorem ipsum 1",
    "description": "Aliquam pharetra leo ipsum."
  },
  {
    "title": "Lorem ipsum 2",
    "description": "Suspendisse a pulvinar lacus."
  },
  {
    "title": "Lorem ipsum 3",
    "description": "Praesent ac arcu tellus."
  }
]

The extract option can be skipped for the JSON type, causing the full JSON response to be returned.

Scraping Text

When parsing text, each sub-hash should contain a regexp and index. Output text is matched against the regular expression repeatedly from the beginning through to the end, collecting a captured group specified by index in each match. Each index should be either an integer or a string name which corresponds to (?<name>...). For example, to parse lines of word: definition, the following should work:

"extract": {
  "word": { "regexp": "^(.+?): (.+)$", "index": 1 },
  "definition": { "regexp": "^(.+?): (.+)$", "index": 2 }
}

Or if you prefer names to numbers for index:

"extract": {
  "word": { "regexp": "^(?<word>.+?): (?<definition>.+)$", "index": "word" },
  "definition": { "regexp": "^(?<word>.+?): (?<definition>.+)$", "index": "definition" }
}

To extract the whole content as one event:

"extract": {
  "content": { "regexp": "\A(?m:.)*\z", "index": 0 }
}

Beware that . does not match the newline character (LF) unless the m flag is in effect, and ^/$ basically match every line beginning/end. See this document to learn the regular expression variant used in this service.

General Options

Can be configured to use HTTP basic auth by including the basic_auth parameter with "username:password", or ["username", "password"].

Set expected_update_period_in_days to the maximum amount of time that you’d expect to pass between Events being created by this Agent. This is only used to set the “working” status.

Set uniqueness_look_back to limit the number of events checked for uniqueness (typically for performance). This defaults to the larger of 200 or 3x the number of detected received results.

Set force_encoding to an encoding name (such as UTF-8 and ISO-8859-1) if the website is known to respond with a missing, invalid, or wrong charset in the Content-Type header. Below are the steps used by Huginn to detect the encoding of fetched content:

  1. If force_encoding is given, that value is used.
  2. If the Content-Type header contains a charset parameter, that value is used.
  3. When type is html or xml, Huginn checks for the presence of a BOM, XML declaration with attribute “encoding”, or an HTML meta tag with charset information, and uses that if found.
  4. Huginn falls back to UTF-8 (not ISO-8859-1).

Set user_agent to a custom User-Agent name if the website does not like the default value (Huginn - https://github.com/huginn/huginn).

The headers field is optional. When present, it should be a hash of headers to send with the request.

Set disable_ssl_verification to true to disable ssl verification.

Set unzip to gzip to inflate the resource using gzip.

Set http_success_codes to an array of status codes (e.g., [404, 422]) to treat HTTP response codes beyond 200 as successes.

If a template option is given, its value must be a hash, whose key-value pairs are interpolated after extraction for each iteration and merged with the payload. In the template, keys of extracted data can be interpolated, and some additional variables are also available as explained in the next section. For example:

"template": {
  "url": "{{ url | to_uri: _response_.url }}",
  "description": "{{ body_text }}",
  "last_modified": "{{ _response_.headers.Last-Modified | date: '%FT%T' }}"
}

In the on_change mode, change is detected based on the resulted event payload after applying this option. If you want to add some keys to each event but ignore any change in them, set mode to all and put a DeDuplicationAgent downstream.

Liquid Templating

In Liquid templating, the following variables are available:

  • _url_: The URL specified to fetch the content from. When parsing data_from_event, this is not set.

  • _response_: A response object with the following keys:

    • status: HTTP status as integer. (Almost always 200) When parsing data_from_event, this is set to the value of the status key in the incoming Event, if it is a number or a string convertible to an integer.

    • headers: Response headers; for example, {{ _response_.headers.Content-Type }} expands to the value of the Content-Type header. Keys are insensitive to cases and -/_. When parsing data_from_event, this is constructed from the value of the headers key in the incoming Event, if it is a hash.

    • url: The final URL of the fetched page, following redirects. When parsing data_from_event, this is set to the value of the url key in the incoming Event. Using this in the template option, you can resolve relative URLs extracted from a document like {{ link | to_uri: _response_.url }} and {{ content | rebase_hrefs: _response_.url }}.

Ordering Events

To specify the order of events created in each run, set events_order to an array of sort keys, each of which looks like either expression or [expression, type, descending], as described as follows:

  • expression is a Liquid template to generate a string to be used as sort key.

  • type (optional) is one of string (default), number and time, which specifies how to evaluate expression for comparison.

  • descending (optional) is a boolean value to determine if comparison should be done in descending (reverse) order, which defaults to false.

Sort keys listed earlier take precedence over ones listed later. For example, if you want to sort articles by the date and then by the author, specify [["{{date}}", "time"], "{{author}}"].

Sorting is done stably, so even if all events have the same set of sort key values the original order is retained. Also, a special Liquid variable _index_ is provided, which contains the zero-based index number of each event, which means you can exactly reverse the order of events by specifying [["{{_index_}}", "number", true]].

If the include_sort_info option is set, each created event will have a sort_info key whose value is a hash containing the following keys:

  • position: 1-based index of each event after the sort
  • count: Total number of events sorted