Thinking in Rain 雨之遐思: WebDB 研究

I created a contrived example with just four pages to understand the steps involved in the crawl process. Figure 1 illustrates the links between pages. C and C-dup (C-duplicate) have identical content.

Before we run the crawler, create a file called urls that contains the root URLs from which to populate the initial fetchlist. In this case, we'll start from page A.

echo 'http://keaton/tinysite/A.html' > urls

The crawl tool uses a filter to decide which URLs go into the WebDB (in steps 2 and 5 in the breakdown of crawl above). This can be used to restrict the crawl to URLs that match any given pattern, specified by regular expressions. Here, we just restrict the domain to the server on my intranet (keaton), by changing the line in the configuration file conf/crawl-urlfilter.txt from

+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

+^http://keaton/

Now we are ready to crawl, which we do with a single command:

bin/nutch crawl urls -dir crawl-tinysite -depth 3 >& crawl.log

The crawl uses the root URLs in urls to start the crawl, and puts the results of the crawl in the directory crawl-tinysite. The crawler logs its activity to crawl.log. The -depth flag tells the crawler how many generate/fetch/update cycles to carry out to get full page coverage. Three is enough to reach all of the pages in this example, but for real sites it is best to start with five (the default), and increase it if you find some pages aren't being reached.

We shall now look in some detail at the data structures crawl has produced.

The first thing to look at is the number of pages and links in the database. This is useful as a sanity check to give us some confidence that the crawler did indeed crawl the site, and how much of it. The readdb tool parses the WebDB and displays portions of it in human-readable form. We use the -stats option here:

bin/nutch readdb crawl-tinysite/db -stats

which displays:

Number of pages: 4
Number of links: 4

As expected, there are four pages in the WebDB (A, B, C, and C-duplicate) and four links between them. The links to Wikipedia are not in the WebDB, since they did match the pattern in the URL filter file. Both C and C-duplicate are in the WebDB since the WebDB doesn't de-duplicate pages by content, only by URL (which is why A isn't in twice). Next, we can dump all of the pages, by using a different option for readdb:

bin/nutch readdb crawl-tinysite/db -dumppageurl

which gives:

Page 1: Version: 4
URL: http://keaton/tinysite/A.html
ID: fb8b9f0792e449cda72a9670b4ce833a
Next fetch: Thu Nov 24 11:13:35 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 1
Score: 1.0
NextScore: 1.0

Page 2: Version: 4
URL: http://keaton/tinysite/B.html
ID: 404db2bd139307b0e1b696d3a1a772b4
Next fetch: Thu Nov 24 11:13:37 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 3
Score: 1.0
NextScore: 1.0

Page 3: Version: 4
URL: http://keaton/tinysite/C-duplicate.html
ID: be7e0a5c7ad9d98dd3a518838afd5276
Next fetch: Thu Nov 24 11:13:39 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 0
Score: 1.0
NextScore: 1.0

Page 4: Version: 4
URL: http://keaton/tinysite/C.html
ID: be7e0a5c7ad9d98dd3a518838afd5276
Next fetch: Thu Nov 24 11:13:40 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 0
Score: 1.0
NextScore: 1.0

Each page appears in a separate block, with one field per line. The ID field is the MD5 hash of the page contents: note that C and C-duplicate have the same ID. There is also information about when the pages should be next fetched (which defaults to 30 days), and page scores. It is easy to dump the structure of the web graph, too:

bin/nutch readdb crawl-tinysite/db -dumplinks

which produces:

from http://keaton/tinysite/B.html
to http://keaton/tinysite/A.html
to http://keaton/tinysite/C-duplicate.html
to http://keaton/tinysite/C.html

from http://keaton/tinysite/A.html
to http://keaton/tinysite/B.html

For sites larger than a few pages, it is less useful to dump the WebDB in full using these verbose formats. The readdb tool also supports extraction of an individual page or link by URL or MD5 hash. For example, to examine the links to page B, issue the command:

bin/nutch readdb crawl-tinysite/db -linkurl http://keaton/tinysite/B.html

to get:

Found 1 links.
Link 0: Version: 5
ID: fb8b9f0792e449cda72a9670b4ce833a
DomainID: 3625484895915226548
URL: http://keaton/tinysite/B.html
AnchorText: B
targetHasOutlink: true

Notice that the ID is the MD5 hash of the source page A.

There are other ways to inspect the WebDB. The admin tool can produce a dump of the whole database in plain-text tabular form, with one entry per line, using the -textdump option. This format is handy for processing with scripts. The most flexible way of reading the WebDB is through the Java interface. See the Nutch source code and API documentation for more details. A good starting point is org.apache.nutch.db.WebDBReader, which is the Java class that implements the functionality of the readdb tool (readdb is actually just a synonym for org.apache.nutch.db.WebDBReader).

Thinking in Rain 雨之遐思

星期日, 十一月 26, 2006

WebDB 研究

没有评论:

博客归档

我的简介