Page tree
Skip to end of metadata
Go to start of metadata

 

Overview

You can load RSS content using the RSS Connector included in AIE's standard installation.  This document describes how to configure the RSS Connector.

View incoming links.

 

Creating an RSS Connector

To create an RSS connector, do the following:

  1. Download and install the Site Harvester module.
  2. Ensure that the AIE Agent is running and start AIE as described in Starting and Stopping AIE.

  3. Open the Connectors page in the AIE Administrator by entering the following URL in a browser:

    http://localhost:17000/admin/connectors

    The AIE Administrator appears on the Connectors page.

  4. Click New. The New Connector screen appears.
  5. Select RSS Feed Connector, and click OK. The New Connector screen appears on the Scanner tab.
  6. Enter a Connector Name.
  7. Provide one or more Seed RSS URIs.
  8. Click the Crawling Options tab and expand the Proxy and Authentication areas.
  9. Complete the fields as follows:

    • Authentication Needed for RSS Feed - Set to true if the feed requires basic authentication.
    • Authentication Needed for RSS Feed-Entry - Set to true if the page links in the RSS feed require basic authentication.
  10. Click the Other tab. The Other options appear.
  11. When Incremental checking is required (set to true), you must set the Crawler Store value to crawlerStore.
  12. Click Save to save the connector.

RSS Connector Parameters

These are the fields offered in the RSS Connector. (This is one of the Connector editors available in the AIE Administrator.)

Scanner Tab

 

Parameter

Description

Default Value

Connector Name

Enter a name for the connector. (required)

None

Node Set

In a multi-node system, this is the ingestion node where this connector is available.

local

Seed RSS URIs

Web-page addresses where the crawl will begin. ( required )

None

Document ID Prefix

Prefix to add to each document ID.

None

Ingest Workflow

The ingestion workflow that receives the AttivioDocuments generated by the scanner. Always set Ingest Workflow to fileIngest.

fileIngest

Notes Tab

See the Notes Tab description here.

Scheduler Tab

See the Schedule Tab description here.

Crawling Options Tab

The Crawling Options Tab controls let you specify which URI's are followed and which are ignored so that you can create more productive crawls.

Parameter

Description

Default Value

HTTP Proxy Host

Proxy host for all HTTP(S) traffic for this scanner.None
HTTP Proxy PortProxy port for all HTTP(S) traffic for this scanner.-1
Authentication MethodsSelect an authentication-method bean from the drop-down list if one is configured. (See Authentication, below, for more on authentication-method beans.)None
Authentication Needed for RSS Feed
Set to true if the feed requires basic authentication.False
Authentication Needed for RSS Feed-Entry
Set to true if the page links in the RSS feed require basic authentication.False

Other Tab

Parameter

Description

Default Value

Crawler Store

When Incremental checking is required (set to true on the Scanner tab), you must set the Crawler Store value to crawlerStore.

None

Crawl Name

Enter a name for the crawl

None

Authentication Methods

List of Spring Bean of AuthenticationMethod(s) to use for authenticationNone

Field Mappings Tab

See the Field Mappings Tab description here.

Advanced Tab

See the Advanced Tab description here.

Authentication

The RSS Connector uses the Site Harvester's basic authentication methods. These work with some web sites but not with all web sites. If you experience difficulty, please contact the AIE Support department.

Some web sites require authentication before the RSS Connector can crawl the pages. To configure RSS Connector to authenticate with web sites you must first create the appropriate spring beans in the project's configuration, and then use those spring beans in the Connector UI for the RSS Connector. The supported authentication bean is BasicOrDigestMethod. Multiple beans are grouped together in an AuthenticationList. Once you create the bean, you can then start the AIE project and select the authentication list bean with the Authentication Methods option in the connector editor's Other tab.

Basic Authentication

The following example shows how to configure the RSS Connector to use Basic Authentication via the AIE BasicOrDigestMethod bean.

You must create this bean in a new XML file and then redeploy your project. When using multiple authentication methods, you can specify realm and/or host and/or port to let Site Harvester know when to use which credential. For example, for a website that is on realm "realm3", create the following bean:

<project-dir>\conf\bean\myBasicAuthList.xml
<beans xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xmlns="http://www.springframework.org/schema/beans" 
xmlns:util="http://www.springframework.org/schema/util" 
xmlns:sec="http://www.springframework.org/schema/security" 
xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/util http://www.springframework.org/schema/util/spring-util.xsd http://www.springframework.org/schema/security http://www.springframework.org/schema/security/spring-security-3.1.xsd">
  <bean id="myBasicAuthList" class="com.attivio.siteharvester.connector.AuthenticationList">
    <meta key="classloader" value="module-siteharvester" />
    <property name="methods">
      <list>
        <bean name="realm3login" class="org.archive.modules.credential.BasicOrDigestMethod">
          <meta key="classloader" value="module-siteharvester" />
          <property name="realm" value="realm3" />
          <!-- optional
          <property name="host" value="machine1" />
          <property name="port" value="8000" />
          -->
          <property name="login" value="jbrown" />
          <property name="password" value="*{0RMSwa1iKXI=}" /> <!-- See the note below about password encryption. -->
        </bean>
        <!-- other realm beans can go here -->
      </list>
    </property>
  </bean>
</beans>

Edit the connector in the AIE Administrator. Turn on authentication on the Crawling Options tab, and then point to the new authentication list on the Other tab.

 

Running the RSS Connector

Erasing the Index

While testing a new connector, you will frequently need to empty the index and try again. Four methods of deleting the index are described here.

To run the RSS Connector, do the following:

  1. Open the Connectors page in the AIE Administrator by entering the following URL in a browser:

    http://localhost:17000/admin/connectors

    The AIE Administrator appears on the Connectors page.

  2. Right-click on the desired RSS Connector and select Start , then click the Start button when prompted again. The RSS crawl begins.

 

Content Encoding

  This connector extracts content in binary form, and thus has no encoding limitations. However, the ingest workflow (typically fileIngest), which generates field values from the binary content, may not support all content encoding schemes.

Debugging

 

To debug login problems, try using a proxy, or look for errors in the log similar to the following:
2013-08-05 11:35:25,073 DEBUG AttivioFetchHttp - Crawl of http://localhost:17030/basic3/ returned status code 401
2013-08-05 11:35:25,073 WARN  AttivioFetchHttp - ATTIVIO-PLATFORM-94 : BASIC or DIGEST credentials required but not found for URI http://localhost:17030/basic3/ on realm realm3 

 

 

 

Prior to being used in a connector definition, the user password can be encrypted using the Encrypting Property Values guide.

RSS XML Mappings

This is how the elements in an 'item' element of an RSS feed are mapped to Attivio Documents (assumes RSS 2.0 spec).

XML ElementAttivioDocument fieldNotes
titletitle 
link.id

Attivio removes line breaks before storing in the document

authorauthor 
pubDatedate
descriptiontext 
categorycatIf multiple categories, they are joined into a single string delimited with the ',' character
guid(none) 
comments(none) 
enclosure(none) 
source(none) 
  • No labels