星期日, 十一月 26, 2006

Nutch CrawTool代码粗析

CrawlTool.java文件是整个Nutch的入口, 从这里着手分析可以从宏观上掌握nutch的结构. 为了加强可读性, 如下代码进行过删减.

String db = new File(dir + "/db").getCanonicalPath();
String segments = new File(dir + "/segments").getCanonicalPath();
NutchFileSystem nfs = NutchFileSystem.parseArgs(args, 0);

// 创建一个新的WebDB
WebDBAdminTool.main(prependFileSystem(fs, nameserver, new String[] { db, "-create"}));

//把开始抓取的根Url放入WebDb
WebDBInjector.main(prependFileSystem(fs, nameserver, new String[] { db, "-urlfile", rootUrlFile }));

for (int i = 0; i < depth; i++) {
//从WebDb的新segment中生成fetchlist
FetchListTool.main(prependFileSystem(fs, nameserver, new String[] { db, segments } ));
String segment = getLatestSegment(nfs, segments);

//根据 fetchlist 列表抓取网页的内容
Fetcher.main(prependFileSystem(fs, nameserver, new String[] { "-threads", ""+threads, segment } ));

// 根据抓取回来的网页链接url更新 WebDB
UpdateDatabaseTool.main(prependFileSystem(fs, nameserver, new String[] { db, segment } ));
}

//用计算出来的网页url权重scores更新segments
UpdateSegmentsFromDb updater = new UpdateSegmentsFromDb(nfs, db, segments, dir);
updater.run();

File workDir = new File(dir, "workdir");
File[] segmentDirs = nfs.listFiles(new File(segments));

//对抓取回来的网页建立索引,每个segment的索引都是单独建立的
for (int i = 0; i < segmentDirs.length; i++) {
IndexSegment.main(prependFileSystem(fs, nameserver, new String[] { segmentDirs[i].toString(), "-dir", workDir.getPath() } ));
}

//在segment的索引中消除重复的内容和重复的url
DeleteDuplicates.main(prependFileSystem(fs, nameserver, new String[] { segments }));

//合并多个segment的索引到一个大索引,为搜索提供索引库
IndexMerger merger = new IndexMerger(nfs, segmentDirs, new File(dir + "/index"), workDir);
merger.merge();

没有评论: