Segments
The crawl created three segments in timestamped subdirectories in the segments directory, one for each generate/fetch/update cycle. The segread
tool gives a useful summary of all of the segments:
bin/nutch segread -list -dir crawl-tinysite/segments/
giving the following tabular output (slightly reformatted to fit this page):
PARSED? STARTED FINISHED COUNT DIR NAME
true 20051025-12:13:35 20051025-12:13:35 1 crawl-tinysite/segments/20051025121334
true 20051025-12:13:37 20051025-12:13:37 1 crawl-tinysite/segments/20051025121337
true 20051025-12:13:39 20051025-12:13:39 2 crawl-tinysite/segments/20051025121339
TOTAL: 4 entries in 3 segments.
The PARSED?
column is always true
when using the crawl
tool. This column is useful when running fetchers with parsing turned off, to be run later as a separate process. The STARTED
and FINISHED
columns indicate the times when fetching started and finished. This information is invaluable for bigger crawls, when tracking down why crawling is taking a long time. The COUNT
column shows the number of fetched pages in the segment. The last segment, for example, has two entries, corresponding to pages C and C-duplicate.
Sometimes it is necessary to find out in more detail what is in a particular segment. This is done using the -dump
option for segread
. Here we dump the first segment (again, slightly reformatted to fit this page):
s=`ls -d crawl-tinysite/segments/* | head -1`
bin/nutch segread -dump $s
Recno:: 0
FetcherOutput::
FetchListEntry: version: 2
fetch: true
page: Version: 4
URL: http://keaton/tinysite/A.html
ID: 6cf980375ed1312a0ef1d77fd1760a3e
Next fetch: Tue Nov 01 11:13:34 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 0
Score: 1.0
NextScore: 1.0
anchors: 1
anchor: A
Fetch Result:
MD5Hash: fb8b9f0792e449cda72a9670b4ce833a
ProtocolStatus: success(1), lastModified=0
FetchDate: Tue Oct 25 12:13:35 BST 2005
Content::
url: http://keaton/tinysite/A.html
base: http://keaton/tinysite/A.html
contentType: text/html
metadata: {Date=Tue, 25 Oct 2005 11:13:34 GMT, Server=Apache-Coyote/1.1,
Connection=close, Content-Type=text/html, ETag=W/"1106-1130238131000",
Last-Modified=Tue, 25 Oct 2005 11:02:11 GMT, Content-Length=1106}
Content:
Alligators live in freshwater environments such as ponds,
marshes, rivers and swamps. Although alligators have
heavy bodies and slow metabolisms, they are capable of
short bursts of speed that can exceed 30 miles per hour.
Alligators' main prey are smaller animals that they can kill
and eat with a single bite. Alligators may kill larger prey
by grabbing it and dragging it in the water to drown.
Food items that can't be eaten in one bite are either allowed
to rot or are rendered by biting and then spinning or
convulsing wildly until bite size pieces are torn off.
(From
the
Wikipedia entry for Alligator.)
ParseData::
Status: success(1,0)
Title: 'A' is for Alligator
Outlinks: 2
outlink: toUrl: http://en.wikipedia.org/wiki/Alligator
anchor: the Wikipedia entry for Alligator
outlink: toUrl: http://keaton/tinysite/B.html anchor: B
Metadata: {Date=Tue, 25 Oct 2005 11:13:34 GMT,
CharEncodingForConversion=windows-1252, Server=Apache-Coyote/1.1,
Last-Modified=Tue, 25 Oct 2005 11:02:11 GMT, ETag=W/"1106-1130238131000",
Content-Type=text/html, Connection=close, Content-Length=1106}
ParseText::
'A' is for Alligator Alligators live in freshwater environments such
as ponds, marshes, rivers and swamps. Although alligators have heavy
bodies and slow metabolisms, they are capable of short bursts of
speed that can exceed 30 miles per hour. Alligators' main prey are
smaller animals that they can kill and eat with a single bite.
Alligators may kill larger prey by grabbing it and dragging it in
the water to drown. Food items that can't be eaten in one bite are
either allowed to rot or are rendered by biting and then spinning or
convulsing wildly until bite size pieces are torn off.
(From the Wikipedia entry for Alligator .) B
There's a lot of data for each entry--remember this is just a single entry, for page A--but it breaks down into the following categories: fetch data, raw content, and parsed content. The fetch data, indicated by the FetcherOutput
section, is data gathered by the fetcher to be propagated back to the WebDB during the update part of the generate/fetch/update cycle.
The raw content, indicated by the Content
section, contains the page contents as retrieved by the fetcher, including HTTP headers and other metadata. (By default, the protocol-httpclient
plugin is used to do this work.) This content is returned when you ask Nutch search for a cached copy of the page. You can see the HTML page for page A in this example.
Finally, the raw content is parsed using an appropriate parser plugin--determined by looking at the content type and then the file extension. In this case, parse-html
was used, since the content type is text/html. The parsed content (indicated by the ParseData
and ParseText
sections) is used by the indexer to create the segment index.
没有评论:
发表评论