星期日, 十一月 26, 2006

Nutch Segment 试用

Segments

The crawl created three segments in timestamped subdirectories in the segments directory, one for each generate/fetch/update cycle. The segread tool gives a useful summary of all of the segments:

bin/nutch segread -list -dir crawl-tinysite/segments/

giving the following tabular output (slightly reformatted to fit this page):

PARSED? STARTED           FINISHED          COUNT DIR NAME
true 20051025-12:13:35 20051025-12:13:35 1 crawl-tinysite/segments/20051025121334
true 20051025-12:13:37 20051025-12:13:37 1 crawl-tinysite/segments/20051025121337
true 20051025-12:13:39 20051025-12:13:39 2 crawl-tinysite/segments/20051025121339
TOTAL: 4 entries in 3 segments.


The PARSED? column is always true when using the crawl tool. This column is useful when running fetchers with parsing turned off, to be run later as a separate process. The STARTED and FINISHED columns indicate the times when fetching started and finished. This information is invaluable for bigger crawls, when tracking down why crawling is taking a long time. The COUNT column shows the number of fetched pages in the segment. The last segment, for example, has two entries, corresponding to pages C and C-duplicate.

Sometimes it is necessary to find out in more detail what is in a particular segment. This is done using the -dump option for segread. Here we dump the first segment (again, slightly reformatted to fit this page):

s=`ls -d crawl-tinysite/segments/* | head -1`
bin/nutch segread -dump $s


Recno:: 0
FetcherOutput::
FetchListEntry: version: 2
fetch: true
page: Version: 4
URL: http://keaton/tinysite/A.html
ID: 6cf980375ed1312a0ef1d77fd1760a3e
Next fetch: Tue Nov 01 11:13:34 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 0
Score: 1.0
NextScore: 1.0

anchors: 1
anchor: A
Fetch Result:
MD5Hash: fb8b9f0792e449cda72a9670b4ce833a
ProtocolStatus: success(1), lastModified=0
FetchDate: Tue Oct 25 12:13:35 BST 2005

Content::
url: http://keaton/tinysite/A.html
base: http://keaton/tinysite/A.html
contentType: text/html
metadata: {Date=Tue, 25 Oct 2005 11:13:34 GMT, Server=Apache-Coyote/1.1,
Connection=close, Content-Type=text/html, ETag=W/"1106-1130238131000",
Last-Modified=Tue, 25 Oct 2005 11:02:11 GMT, Content-Length=1106}
Content:












Alligators live in freshwater environments such as ponds,
marshes, rivers and swamps. Although alligators have
heavy bodies and slow metabolisms, they are capable of
short bursts of speed that can exceed 30 miles per hour.
Alligators' main prey are smaller animals that they can kill
and eat with a single bite. Alligators may kill larger prey
by grabbing it and dragging it in the water to drown.
Food items that can't be eaten in one bite are either allowed
to rot or are rendered by biting and then spinning or
convulsing wildly until bite size pieces are torn off.
(From
the
Wikipedia entry for Alligator
.)


B







ParseData::
Status: success(1,0)
Title: 'A' is for Alligator
Outlinks: 2
outlink: toUrl: http://en.wikipedia.org/wiki/Alligator
anchor: the Wikipedia entry for Alligator
outlink: toUrl: http://keaton/tinysite/B.html anchor: B
Metadata: {Date=Tue, 25 Oct 2005 11:13:34 GMT,
CharEncodingForConversion=windows-1252, Server=Apache-Coyote/1.1,
Last-Modified=Tue, 25 Oct 2005 11:02:11 GMT, ETag=W/"1106-1130238131000",
Content-Type=text/html, Connection=close, Content-Length=1106}

ParseText::
'A' is for Alligator Alligators live in freshwater environments such
as ponds, marshes, rivers and swamps. Although alligators have heavy
bodies and slow metabolisms, they are capable of short bursts of
speed that can exceed 30 miles per hour. Alligators' main prey are
smaller animals that they can kill and eat with a single bite.
Alligators may kill larger prey by grabbing it and dragging it in
the water to drown. Food items that can't be eaten in one bite are
either allowed to rot or are rendered by biting and then spinning or
convulsing wildly until bite size pieces are torn off.
(From the Wikipedia entry for Alligator .) B


There's a lot of data for each entry--remember this is just a single entry, for page A--but it breaks down into the following categories: fetch data, raw content, and parsed content. The fetch data, indicated by the FetcherOutput section, is data gathered by the fetcher to be propagated back to the WebDB during the update part of the generate/fetch/update cycle.

The raw content, indicated by the Content section, contains the page contents as retrieved by the fetcher, including HTTP headers and other metadata. (By default, the protocol-httpclient plugin is used to do this work.) This content is returned when you ask Nutch search for a cached copy of the page. You can see the HTML page for page A in this example.

Finally, the raw content is parsed using an appropriate parser plugin--determined by looking at the content type and then the file extension. In this case, parse-html was used, since the content type is text/html. The parsed content (indicated by the ParseData and ParseText sections) is used by the indexer to create the segment index.

没有评论: