In order to fit a crawl to speed and resource constraints, the Web Crawler connector offers detailed configuration options for performance.
Connector Properties
The following settings are controlled by properties in the individual Web Crawler connector configuration.
Fetch Timeout
The value of a Web Crawler connector's Fetch Timeout property represents the maximum amount of time in seconds the connector will wait for a requested address to finish responding. If a web page takes longer than this value to finish downloading, the connection will be dropped and that page will be skipped. You can use this property to prevent the Web Crawler connector from getting hung up on a URL that is taking too long to finish responding.
You can set this property's value in the Web Crawler connector editor's Crawling Options tab, in the Performance subsection. The default value is 60 seconds.
Crawl Delay
The crawl delay is the amount of time in seconds the Web Crawler connector waits in between successive requests to the same server. If you are concerned about overwhelming the server with Web Crawler web requests, you can use a custom crawl delay to decrease the request rate.
There are three configurable properties which control the crawl delay:
- Default Crawl Delay - the default number of seconds to wait.
- Minimum Crawl Delay - minimum amount of time to wait when the connector's Threads Per Host property (see below) is set to a value greater than 1.
- Maximum Crawl Delay - maximum amount of time to wait if the crawl delay set in the site's
robots.txt
file is of greater value.
You can set these properties' values in the Web Crawler connector editor's Crawling Options tab, in the Performance subsection. The default values are 0 for Default Crawl Delay and Minimum Crawl Delay, and -1 (no maximum delay) for Maximum Crawl Delay.
Fetch Threads
Multiple threads can send and process Web Crawler connector web requests simultaneously.
There are two properties which control the connector's thread counts:
- Fetch Threads Total - The total number of threads polling URL queues and making requests.
- Threads Per Host - The maximum number of threads allowed to request from a single host at a given time. Specifying a value greater than 1 for this property will cause the Web Crawler connector to ignore any crawl delay specified in the target site's
robots.txt
file and to delay for seconds equal to the Minimum Crawl Delay property value (see above) instead.
You can set these properties' values on the Web Crawler connector, in the Crawling Options tab's Performance subsection. The default values are 10 for both properties.
Process Timeout and Inactivity Timeout
(Available in Web Crawler 1.4.9 and later releases.)
Long-running crawls can consume excessive memory while waiting for ingestion to complete. To prevent this, a Web Crawler connector will terminate itself under either of two conditions:
- If the Web Crawler connector's run time exceeds the period in milliseconds specified in its Process Timeout property value.
- If the Web Crawler connector has been inactive (i.e., has not audited any ingestion action) for a time exceeding the period in milliseconds specified in its Inactivity Timeout property value.
You can set these properties' values in the Web Crawler connector editor's Advanced tab. The default values are 604,800,000 milliseconds (7 days) for the Process Timeout and -1 (no limit) for the Inactivity Timeout. Setting a value of 0 or less for the Inactivity Timeout disables self-termination based on audit inactivity.
Max Memory
Because the Web Crawler is run in a separate JVM from other Attivio processes, it can be configured with an independent memory cap. By default, Web Crawler processes are capped at 2,048 MB of memory to avoid resource competition with the node and other processes on the connector host. The connector's Max Memory property can be used to change this memory limit.
You can set this property's value in the Web Crawler connector editor's Advanced tab. The default value is 2,048 MB (2 GB).
When setting this value, always bear in mind the amount of memory being used by other processes on the connector hosts (e.g., Attivio nodes, store and Perfmon servers, the Attivio Platform Agent service, etc.).
System Properties
The following settings are controlled by Attivio properties and are therefore global (affecting all Web Crawler connectors defined in the project).
HBase Mini-Cluster Memory
(This property only applies to unclustered Attivio projects.)
In unclustered Attivio projects, the Web Crawler connector uses an HBase (database) mini-cluster as a content store. By default, Attivio allocates 1536 MB (1.5 GB) of memory for this mini-cluster.
For unclustered Attivio projects which run multiple Web Crawler connectors concurrently, it may be necessary to increase the HBase mini-cluster's memory allocation. You can control this allocation by setting a value (expressed in MB) for the webcrawler.hbase.minicluster.memory.mb Attivio property. For instance, setting a value of "2048" will allocate 2048 MB (2 GB).
Increasing the HBase mini-cluster's memory allocation will increase Attivio's overall memory usage on hosts which run Web Crawler connectors.
HBase Mini-Cluster Heap Region Size
(This property is available in Web Crawler 1.4.10 and later releases, and only applies to unclustered Attivio projects.)
In unclustered Attivio projects, the Web Crawler connector uses an HBase (database) mini-cluster as a content store. By default, Attivio sets the region size in this HBase mini-cluster's heap to 8 MB. This is to ensure that no retrieved document exceeds the heap region size, since this causes heap fragmentation, increases garbage-collection activity, and tends to reduce Web Crawler connector performance.
For unclustered Attivio projects which ingest very large documents via Web Crawler connector, it may be necessary to increase the HBase mini-cluster's heap region size to maintain optimal performance. You can control this heap region size by setting a value for the webcrawler.hbase.minicluster.g1heapregionsize Attivio property. This property accepts numeric values with suffixes "m" for MB, "g" for GB, and so on, in the same format as used for the Java -Xmx
and -Xms
command-line flags. As an example, setting a value of "9m" will set the HBase mini-cluster's heap region size to 9 MB.