星期日, 十一月 26, 2006

Nutch Page分析

nutch中每个page就是webdb中page database的一行(row), page中含有如下代码内容:

Page 1: Version: 4
URL: http://keaton/tinysite/A.html
ID: fb8b9f0792e449cda72a9670b4ce833a
Next fetch: Thu Nov 24 11:13:35 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 1
Score: 1.0
NextScore: 1.0

其中:

The ID field is the MD5 hash of the page contents, the same contents have same MD5 hash

没有评论: