This example is a simple Gatherer that uses the default customizations. The only work that the user does to configure this Gatherer is to specify the list of URLs from which to gather (see Section 4).
To run this example, type:
% cd $HARVEST_HOME/gatherers/example-1
% ./RunGatherer
To view the configuration file for this Gatherer, look at example-1.cf. The first few lines are variables that specify some local information about the Gatherer (see Section 4.5). For example, each content summary will contain the name of the Gatherer ( Gatherer-Name) that generated it. The port number ( Gatherer-Port) that will be used to export the indexing information, as is the directory that contains the Gatherer ( Top-Directory). Notice that there is one RootNode URL and one LeafNode URL.
After the Gatherer has finished, it will start up the Gatherer daemon which will export the content summaries. To view the content summaries, type:
% gather localhost 9111 | more
The following SOIF object should look similar to those that this Gatherer generates.
@FILE { http://harvest.cs.colorado.edu/~schwartz/IRTF.html
Time-to-Live{7}: 9676800
Last-Modification-Time{1}: 0
Refresh-Rate{7}: 2419200
Gatherer-Name{25}: Example Gatherer Number 1
Gatherer-Host{22}: powell.cs.colorado.edu
Gatherer-Version{3}: 0.4
Update-Time{9}: 781478043
Type{4}: HTML
File-Size{4}: 2099
MD5{32}: c2fa35fd44a47634f39086652e879170
Partial-Text{151}: research problems
Mic Bowman
Peter Danzig
Udi Manber
Michael Schwartz
Darren Hardy
talk
talk
Harvest
talk
Advanced
Research Projects Agency
URL-References{625}:
ftp://ftp.cs.colorado.edu/pub/techreports/schwartz/RD.ResearchProblems.Jour.ps.Z
ftp://grand.central.org/afs/transarc.com/public/mic/html/Bio.html
http://excalibur.usc.edu/people/danzig.html
http://glimpse.cs.arizona.edu:1994/udi.html
http://harvest.cs.colorado.edu/~schwartz/Home.html
http://harvest.cs.colorado.edu/~hardy/Home.html
ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPCC94.Slides.ps.Z
ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPC94.Slides.ps.Z
http://harvest.cs.colorado.edu/harvest/Home.html
ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/IETF.Jul94.Slides.ps.Z
http://ftp.arpa.mil/ResearchAreas/NETS/Internet.html
Title{84}: IRTF Research Group on Resource Discovery
IRTF Research Group on Resource Discovery
Keywords{121}: advanced agency bowman danzig darren hardy harvest manber mic
michael peter problems projects research schwartz talk udi
}
Notice that although the Gatherer configuration file lists only 2 URLs (one in the RootNode section and one in the LeafNode section), there are more than 2 content summaries in the Gatherer's database. The Gatherer expanded the RootNode URL into dozens of LeafNode URLs by recursively extracting the links from the HTML file at the RootNode http://harvest.cs.colorado.edu/. Then, for each LeafNode given to the Gatherer, it generated a content summary for it as in the above example summary for http://harvest.cs.colorado.edu/~schwartz/IRTF.html.
The HTML summarizer will extract structured information about the Author and Title of the file. It will also extract any URL links into the URL-References attribute, and any anchor tags into the Partial-Text attribute. Other information about the HTML file such as its MD5 [21] and its size ( File-Size) in bytes are also added to the content summary.