nutch中每个page就是webdb中page database的一行(row), page中含有如下代码内容:
Page 1: Version: 4
URL: http://keaton/tinysite/A.html
ID: fb8b9f0792e449cda72a9670b4ce833a
Next fetch: Thu Nov 24 11:13:35 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 1
Score: 1.0
NextScore: 1.0
其中:
The ID field is the MD5 hash of the page contents, the same contents have same MD5 hash
没有评论:
发表评论