Page tree
Skip to end of metadata
Go to start of metadata

 

Following are the configurable properties of the Web Crawler connector, listed by the tab on which they appear in the Admin UI.

Scanner

Property

Type

Default

Required

Description

Connector Name

string<none>Yes

The name of the connector.

Node Set

string

(use default)

 

The name of the nodeset where this connector runs.  In simple examples, this is the local node. 

Seed URIs

List of strings<none>Yes

The list of URIs where the crawl will begin.

Include DomainsList of stringsseed domains If the URI belongs to any of these domains, it is fetched. By default, this filter populates with any host names/IP addresses of the seeds. This prevents a situation where the crawl frontier spreads indefinitely.

Crawl Depth

integer

3

 

The number of links the Web Crawler connector crawls from the seed before stopping. If set to 0, only the seed pages are crawled. Note that the size of the task increases geometrically as this number increases. Starting small is recommended.

Commit After Each Batchbooleantrue If true, when the crawler finishes fetching, parsing, deduplicating, and indexing a batch of URLs, it will commit the index.
Max Batch Sizeinteger1000 The maximum number of pages to fetch in each iteration before updating the crawl database and indexing the pages.

Document ID Prefix

string

<none>

 

Prefix to add to each document ID.

Ingest WorkflowstringfileIngest The workflow to feed crawled page docs through.
Incremental
Incrementalbooleanfalse Indicates whether to incrementally crawl the site or not. If false, the crawler will always re-fetch all previously-crawled pages when re-running the connector.
Sitemap URIsList of strings<none> The list of sitemap URIs to use for crawl optimization (see https://www.sitemaps.org/)
Default Fetch Intervalinteger

2592000 (30 days)


The default number of seconds between re-fetches of a page. This value will be overridden by metadata in sitemaps/robots files. If set to a very low value, the Web Crawler connector may continually re-fetch the same pages and never complete.
Maximum Fetch Intervalinteger7776000 (90 days)
The maximum number of seconds between re-fetches of a page. After this period, every page in the crawler store will be re-tried regardless of individual page status.

Advanced

See the Advanced Tab description here for common properties.

PropertyTypeDefaultDescription
Namespace IDstring<none>This is an internal field (display only).
Max Memorylong<none>The limit to the amount of memory in MB allocated to the remote Web Crawler process.
Extra ArgumentsList of strings<none>Any arguments to supply to the Java command that starts the remote Web Crawler process.
Process Timeoutlong604800000 (7 days)

(Available in Web Crawler 1.4.9 and later releases.) The amount of time (in milliseconds) to allow the webcrawler process to run before killing it to prevent memory leak and performance decline.

Inactivity Timeoutlong-1(Available in Web Crawler 1.4.9 and later releases.) The maximum amount of time (in milliseconds) that the webcrawler process can remain inactive before it will self-destruct. A value of 0 or less turns the self-destruct off.
Debug Portinteger0The port to listen on for a Java debugger connection. (No debug listening if 0 or below)
Default Http ClientbooleanfalseWhether or not to reset the HttpClient to factory defaults before requesting resources. Set this to true only if you are concerned that an HttpClient setting or request header is preventing connection to the server you're trying to crawl.
Crawlboolean

true

Check if you want to fetch pages, parse the content and fetch the outbound links.
IndexbooleantrueCheck if you want to index the crawled pages.
Remoteboolean

true

Check if you want to crawl from a remote process. If this is set to false, the connector will not be stoppable.
Crawler Store Namespacestring<none>The HBase namespace prefix for stored crawl content. Set to the name of the Web Crawler connector whose database this connector should use. Leave blank to use the connector's own namespace.
Javascript
Execute Javascriptbooleanfalse

(Available in WebCrawler 1.3.0 and later releases.) Set to true in order to fully fetch dynamically-generated content, such as Single-Page Applications created using modern frameworks such as Angular and React. A value of true will trigger the crawler to process HTML resources as they appear after any Javascript in the page has been executed.

When executing Javascript, it's best to limit the number of fetch threads to at most the number of CPUs available on the machine. Javascript execution can be a costly process.

Both the "Fetch Threads Total" and the "Threads Per Host" properties in the "Performance" subsection of the "Advanced" tab should be adjusted.

Javascript Wait Intervallong5000 (5 seconds)If "Execute Javascript" is set to true, this is the interval to wait (in milliseconds) for any Javascript promises to be resolved before considering the page to be completely rendered.
Chrome Binarystring{install-dir}bin/chromium/chromeThe path to the Chrome binary executable file for headless browser Javascript execution.
Chrome Profile Directorystring{local-data-directory}/chrome-profileThe path to the Chrome profile data directory used by the headless browser for temporary storage.
Javascript Mime TypesList of strings

text/html

application/x-html+xml

text/aspdotnet

text/asp

text/php

text/x-php

application/php

application/x-php

text/x-jsp

The list of mimetypes which may contain dynamic Javascript content that should be rendered via headless browser.
Sandbox Javascriptbooleantrue(Available in WebCrawler 1.4.0 and later releases.) Execute the Javascript in a sandbox environment so that the machine running the crawler cannot be affected by malicious code. Some kernels do not support sandboxing. If your kernel doesn't, you can switch this off as a workaround, but it is dangerous to execute Javascript from unknown/external sources without sandboxing.

Notes

See the Notes Tab description here.

Scheduler

See the Schedule Tab description here.

Crawling Options

The Crawling Options Tab controls let you specify which URI's are followed and which are ignored so that the crawler can be more productive.

Property

Type

Default

Description

User Agent

string

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36 (this simulates a browser's user agent)

When the Web Crawler connector visits a web server, it identifies itself to the server by its User-Agent identifier, which is sent as part of the HTTP request.

Ignore RobotsbooleanfalseIf true, the crawler will not look at the robots.txt file before starting to fetch pages.
Ignore Robots Crawl DelaybooleantrueIf true, the crawler will specifically ignore the crawl delay time interval when parsing the robots.txt file for a host.
Deleted Page Delayinteger1The number of days to wait between receiving a 404 HTTP status code for a previously successful page and deleting that page from the index.
Follow RedirectsbooleantrueIf true, redirects will be followed automatically at the same request time. Otherwise, redirects will be queued one depth below their origins to avoid loops and traps.
Ingest Destination Url

boolean

falseIf true, use the final destination address (after any redirects) as the 'uri' field of the ingest document fed to the index by the Web Crawler connector. Otherwise, the address which was originally requested is used.
Enable If-Modified-SincebooleantrueIf true, the Web Crawler connector will use an "If-Modified-Since" request header with the previous fetch time to avoid re-requesting pages that have not changed.
'Magic' Mimetype Detectionbooleanfalse(Available in WebCrawler 1.4.0 and later releases.) If true, use tika to extract highly specific content types from pages for more granular filtering and handling.
Includes

URI Include Patterns

List of strings<none>

If any of these regular expressions match the URI, it gets crawled.

Excludes

URI Exclude Patterns

List of strings

<none>

If any of these regular expressions match the URI, it does not get crawled.

URI Exclude ExtensionsList of strings.jpg, .jpeg, .ico, .tif, .png, .bmp, .gif, .wmf, .avi, .mpg, .wmv, .wma, .ram, .asx, .asf, .mp3, .mp4, .wav, .ogg, .vmarc, .tar, .iso, .img, .rpm, .cab, .rar, ace, .exe, .java, .jar, .prz, .wrl, .midr, .ps, .ttf, .mso, .class, .dll, .so, .bin, .biz, .cgi, .com, dcr, .fb, .flv, .fm, .ics, .mov, .svg, .css, .swf, .js, .min, .zip, .tar, .tar.gz, .ovaThe list of URI extensions for crawl exclusions. If the URI has one of these extensions, the URI does not crawl.
URI Normalization
URI Parameter Sortingboolean

true

If true, do URL normalization by sorting parameters prior crawling. This helps reduce duplicate page crawls for links found like http://www.example.com/page?a=1&b=2 and http://www.example.com/page?b=2&a=1

Doc ID Rewrite Rules

Map of strings<none>

These are regular expression mappings to normalize unique docid's created from URI's.

Proxy
HTTP Proxy Environment Variablestring<none>The value to set for the 'http_proxy' environment variable. This is used by the headless browser and should be applied when using a proxy in combination with javascript execution. It is not used for non-javascript proxy configuration.
HTTPS Proxy Environment Variablestring<none>The value to set for the 'https_proxy' environment variable. This is used by the headless browser and should be applied when using a proxy in combination with javascript execution. It is not used for non-javascript proxy configuration.

HTTP Proxy Host

string<none>Proxy host for all HTTP(S) traffic for this scanner. When crawling via a proxy, this property must be applied even if the above proxy environment variables are set.
HTTP Proxy Portinteger-1Proxy port for all HTTP(S) traffic for this scanner. When crawling via a proxy, this property must be applied even if the above proxy environment variables are set. A value of -1 will use the proxy for all ports.
HTTP Proxy Userstring<none>Proxy user name for all HTTP(S) traffic for this scanner. When crawling via a proxy, this property must be applied even if the above proxy environment variables are set.
HTTP Proxy Passwordstring<none>Proxy password for all HTTP(S) traffic for this scanner. When crawling via a proxy, this property must be applied even if the above proxy environment variables are set.
Performance
Fetch Timeoutinteger60 (1 minute)The maximum time to wait for a connection, fetch, or document to download, in seconds.
Parse Timeoutinteger300 (5 minutes)The maximum time to wait for signature and outlinks to be parsed from a webpage's contents.
Default Crawl Delaydouble0.0The default number of seconds to wait between successive requests to the same server. Note that this is over-ridden if a site's robots.txt file specifies a Crawl Delay.
Minimum Crawl Delayinteger0The minimum number of seconds to wait between successive requests to the same server. This value is applicable only if Threads Per Host is greater than 1 (i.e. the host blocking is turned off).
Maximum Crawl Delayinteger-1The maximum number of seconds to wait between successive requests to the same server. If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the crawler will skip the site, generating an error report. Set to -1 to prevent this behavior.
Fetch Threads Totalinteger10The total number of fetch threads to spread between url queues and make requests.
Threads Per Hostinteger10The maximum number of threads that should be allowed to access a single host at one time. Setting it to a value > 1 will cause the Crawl Delay value from robots.txt to be ignored and the Minimum Crawl Delay value set here will be used as a delay between successive requests to the same host instead.
Balance HTML TagsbooleantrueSpecifies if the NekoHTML parser should attempt to balance the tags in the parsed document. Balancing the tags fixes up many common mistakes by adding missing parent elements, automatically closing elements with optional end tags, and correcting unbalanced inline element tags. This feature is made optional as a performance enhancement for applications that only care about the appearance of specific elements, attributes, and/or content regardless of a document's ill-formed structure. This feature should only be turned off if the Web Crawler connector is encountering StackOverflowExceptions.

Security

To crawl websites which require a username and password, configure the parameters on this tab.

PropertyTypeDefaultDescription
Basic Authentication
Hoststring<none>The host to authenticate to. If left blank, authentication will apply to any requested host.
Portinteger-1The port to authenticate to. If left at -1, authentication will apply to any requested port.
Realmstring<none>The realm used for basic authentication. If left blank, authentication will apply to any realm.
Usernamestring<none>The username of the principal the crawler should authenticate as.
Passwordstring<none>The password of the principal the crawler should authenticate as.
Digest Authentication
Hoststring<none>The host to authenticate to. If left blank, authentication will apply to any requested host.
Portinteger-1The port to authenticate to. If left at -1, authentication will apply to any requested port.
Realmstring<none>The realm used for digest authentication. If left blank, authentication will apply to any realm.
Usernamestring<none>The username of the principal the crawler should authenticate as.
Passwordstring<none>The password of the principal the crawler should authenticate as.
NTLM Authentication
Hoststring<none>The host to authenticate to. If left blank, authentication will apply to any requested host.
Portinteger-1The port to authenticate to. If left at -1, authentication will apply to any requested port.
Realmstring<none>The realm used for NTLM authentication. If left blank, authentication will apply to any realm.
Usernamestring<none>The username of the principal the crawler should authenticate as.
Passwordstring<none>The password of the principal the crawler should authenticate as.
Domain  The Windows domain for NTLM authentication.
Form Based Authentication
Need To Login Regexstring<none>This regular expression helps determine which pages of a site require login first.
Methodstring<none>The HTTP Request method used to submit the form contents (PUT or POST). If using Javascript authentication, then this field should be the CSS selector for the form element.
Authentication URIstring<none>The URI that the login form should be submitted to.
ParametersMap of strings<none>A map of the parameters (username and password etc.) which are needed to authenticate. If using Javascript authentication, then the keys should be CSS selectors for the input elements.
Timeoutinteger0The time in seconds to wait between form authentication requests. Set to <= 0 to make it so that the crawled will re-authenticate for every URI that requires form authentication.
Execute JavascriptbooleanfalseWhether or not to execute Javascript when submitting form authentication credentials.
SAML Authentication
Login Page Regexstring<none>Only when a page redirects to a URL that matches this regex will the authentication steps be followed

Parameters

Map of strings<none>A map of the parameters (username and password etc.) which are needed to authenticate.
Custom Authentication
Default CookieMap of strings<none>Any default cookie name/value pairs to initialize before fetching resources.
Authentication JSONstring<none>A MultiStepAuthenticator can be defined in a JSON file and assigned to a Web Crawler connector here.
Custom Authenticator Classstring<none>The com.attivio.webcrawler.sdk.Authentication class can be extended to perform a fully customized authentication procedure. Use a fully qualified Java class name here to specify a custom Authenticator you want the crawler to use.

Index Options

Sometimes there are web pages that you want to traverse (as a bridge to other pages), but do not want to index. The following configuration parameters let you exclude crawled pages from the index.

Property

Type

Default

Description

Index Advanced Fieldsbooleantrue(Available in WebCrawler 1.4.0 and later releases.) Check if you want to track duplicates, link score, and anchortext in the index. Keeping these fields up to date has a performance cost.
Use Realtime Fieldsbooleanfalse(Available in WebCrawler 1.4.0 and later releases.) Check if you want to treat duplicates, linkfactor, linkcount, and anchortext fields as realtime fields in the index and send partial updates to complete crawls.
This may require some configuration changes in the schema if you are upgrading a project from an earlier WebCrawler version, but it will confer an ingestion performance improvement.

Index Canonical Urls

booleantrue(Available in WebCrawler 1.3.0 and later releases.) Check if you want to use the rel=canonical link address as the uri field of the ingest document if it's available.
Drop "noindex" Pagesbooleantrue(Available in WebCrawler 1.4.0 and later releases.) If true, html pages containing the "noindex" meta directive will be dropped before reaching the index.
Delete "noindex" PagesbooleantrueCheck if you want previously indexed html pages which now contain the "noindex" meta directive to be deleted from the index.

Index Mime Types

List of strings

text/xml

text/html

text/plain

application/pdf

application/xhtml+xml

text/aspdotnet

text/x-jsp

If the MIME type of a document is in this list, then the document is indexed. If left blank, documents will be indexed regardless of MIME type.

Index Exclude Patterns

List of strings<none>

If any of these regular expressions match the URI, it is not indexed.

Index Exclude ExtensionsList of strings<none>The list of extensions, that when matched against the URI, are not stored/indexed. These are typically file types that you want crawled because they may have outgoing links in them, but that you don't want indexed.
Deduplication
Drop Duplicatesbooleantrue(Available in WebCrawler 1.4.0 and later releases.) If true, pages with identical content will be dropped before reaching the index. After changing this value, reset the scanner.
Prefer First Duplicatebooleanfalse(Available in WebCrawler 1.4.0 and later releases.) If true, the representative url chosen over a set of duplicates will be the one which was first fetched.
Duplicate Preference RulesList of strings<none>(Available in WebCrawler 1.4.0 and later releases.) The regular expressions to match potential representative urls against when choosing from among duplicates. The duplicate url which matches the most expressions will be chosen over the others.
ModestringWEAK

(Available in WebCrawler 1.4.0 and later releases.) If "STRONG", pages will only be considered duplicates if their full raw content is identical. Otherwise, extracted text will be compared. 
Enabling "STRONG" mode will cause the deduplication process to err on the side of false negatives, whereas leaving it WEAK indicates a preference toward false positives.
False positives generally occur much less frequently than false negatives but are a larger problem and harder to resolve when they do occur.

Field Mappings

See the Field Mappings Tab description here.

All HTTP response headers are added to the Ingest Document by default and can be mapped to schema fields in the connector configuration to be made searchable.

To ingest a header such as Last-Modified, map the "Last-Modified" field to a field such as "last_modified_t" using the Field Mappings tab and add the field name and additional formats such as "EEE, dd MMM yyyy HH:mm:ss zzz" to the dateParser component on the Pallet page of the Admin UI.


 

  • No labels