Overview
You can load RSS content using the RSS Connector included in AIE's standard installation. This document describes how to configure the RSS Connector.
View incoming links.
Creating an RSS Connector
To create an RSS connector, do the following:
- Download and install the Site Harvester module.
Ensure that the AIE Agent is running and start AIE as described in Starting and Stopping AIE.
Open the Connectors page in the AIE Administrator by entering the following URL in a browser:
http://localhost:17000/admin/connectors
The AIE Administrator appears on the Connectors page.
- Click New. The New Connector screen appears.
- Select RSS Feed Connector, and click OK. The New Connector screen appears on the Scanner tab.
- Enter a Connector Name.
- Provide one or more Seed RSS URIs.
- Click the Crawling Options tab and expand the Proxy and Authentication areas.
Complete the fields as follows:
- Authentication Needed for RSS Feed - Set to true if the feed requires basic authentication.
- Authentication Needed for RSS Feed-Entry - Set to true if the page links in the RSS feed require basic authentication.
- Click the Other tab. The Other options appear.
- When Incremental checking is required (set to true), you must set the Crawler Store value to crawlerStore.
- Click Save to save the connector.
RSS Connector Parameters
These are the fields offered in the RSS Connector. (This is one of the Connector editors available in the AIE Administrator.)
Scanner Tab
Parameter | Description | Default Value |
---|---|---|
Connector Name | Enter a name for the connector. (required) | None |
Node Set | In a multi-node system, this is the ingestion node where this connector is available. | local |
Seed RSS URIs | Web-page addresses where the crawl will begin. ( required ) | None |
Document ID Prefix | Prefix to add to each document ID. | None |
Ingest Workflow | The ingestion workflow that receives the AttivioDocuments generated by the scanner. Always set Ingest Workflow to fileIngest. | fileIngest |
Notes Tab
See the Notes Tab description here.
Scheduler Tab
See the Schedule Tab description here.
Crawling Options Tab
The Crawling Options Tab controls let you specify which URI's are followed and which are ignored so that you can create more productive crawls.
Parameter | Description | Default Value |
---|---|---|
HTTP Proxy Host | Proxy host for all HTTP(S) traffic for this scanner. | None |
HTTP Proxy Port | Proxy port for all HTTP(S) traffic for this scanner. | -1 |
Authentication Methods | Select an authentication-method bean from the drop-down list if one is configured. (See Authentication, below, for more on authentication-method beans.) | None |
Authentication Needed for RSS Feed | Set to true if the feed requires basic authentication. | False |
Authentication Needed for RSS Feed-Entry | Set to true if the page links in the RSS feed require basic authentication. | False |
Other Tab
Parameter | Description | Default Value |
---|---|---|
Crawler Store | When Incremental checking is required (set to true on the Scanner tab), you must set the Crawler Store value to crawlerStore. | None |
Crawl Name | Enter a name for the crawl | None |
Authentication Methods | List of Spring Bean of AuthenticationMethod(s) to use for authentication | None |
Field Mappings Tab
See the Field Mappings Tab description here.
Advanced Tab
See the Advanced Tab description here.
Authentication
The RSS Connector uses the Site Harvester's basic authentication methods. These work with some web sites but not with all web sites. If you experience difficulty, please contact the AIE Support department.
Some web sites require authentication before the RSS Connector can crawl the pages. To configure RSS Connector to authenticate with web sites you must first create the appropriate spring beans in the project's configuration, and then use those spring beans in the Connector UI for the RSS Connector. The supported authentication bean is BasicOrDigestMethod. Multiple beans are grouped together in an AuthenticationList. Once you create the bean, you can then start the AIE project and select the authentication list bean with the Authentication Methods option in the connector editor's Other tab.
Basic Authentication
The following example shows how to configure the RSS Connector to use Basic Authentication via the AIE BasicOrDigestMethod bean.
You must create this bean in a new XML file and then redeploy your project. When using multiple authentication methods, you can specify realm and/or host and/or port to let Site Harvester know when to use which credential. For example, for a website that is on realm "realm3", create the following bean:
<beans xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.springframework.org/schema/beans" xmlns:util="http://www.springframework.org/schema/util" xmlns:sec="http://www.springframework.org/schema/security" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/util http://www.springframework.org/schema/util/spring-util.xsd http://www.springframework.org/schema/security http://www.springframework.org/schema/security/spring-security-3.1.xsd"> <bean id="myBasicAuthList" class="com.attivio.siteharvester.connector.AuthenticationList"> <meta key="classloader" value="module-siteharvester" /> <property name="methods"> <list> <bean name="realm3login" class="org.archive.modules.credential.BasicOrDigestMethod"> <meta key="classloader" value="module-siteharvester" /> <property name="realm" value="realm3" /> <!-- optional <property name="host" value="machine1" /> <property name="port" value="8000" /> --> <property name="login" value="jbrown" /> <property name="password" value="*{0RMSwa1iKXI=}" /> <!-- See the note below about password encryption. --> </bean> <!-- other realm beans can go here --> </list> </property> </bean> </beans>
Edit the connector in the AIE Administrator. Turn on authentication on the Crawling Options tab, and then point to the new authentication list on the Other tab.
Running the RSS Connector
Erasing the Index
While testing a new connector, you will frequently need to empty the index and try again. Four methods of deleting the index are described here.
To run the RSS Connector, do the following:
Open the Connectors page in the AIE Administrator by entering the following URL in a browser:
http://localhost:17000/admin/connectors
The AIE Administrator appears on the Connectors page.
- Right-click on the desired RSS Connector and select Start , then click the Start button when prompted again. The RSS crawl begins.
Content Encoding
This connector extracts content in binary form, and thus has no encoding limitations. However, the ingest workflow (typically fileIngest), which generates field values from the binary content, may not support all content encoding schemes.
Debugging
To debug login problems, try using a proxy, or look for errors in the log similar to the following:
2013-08-05 11:35:25,073 DEBUG AttivioFetchHttp - Crawl of http://localhost:17030/basic3/ returned status code 401 2013-08-05 11:35:25,073 WARN AttivioFetchHttp - ATTIVIO-PLATFORM-94 : BASIC or DIGEST credentials required but not found for URI http://localhost:17030/basic3/ on realm realm3
Prior to being used in a connector definition, the user password can be encrypted using the Encrypting Property Values guide.
RSS XML Mappings
This is how the elements in an 'item' element of an RSS feed are mapped to Attivio Documents (assumes RSS 2.0 spec).
XML Element | AttivioDocument field | Notes |
---|---|---|
title | title | |
link | .id | Attivio removes line breaks before storing in the document |
author | author | |
pubDate | date | |
description | text | |
category | cat | If multiple categories, they are joined into a single string delimited with the ',' character |
guid | (none) | |
comments | (none) | |
enclosure | (none) | |
source | (none) |