不死虫的古堡: [develop]Nutch 初体验爬行企业内部网

转自我的javaeye blog：http://xusulong.javaeye.com/blog/663411

前些日子琢磨着想搭建一个搜索引擎，自己写成本有点高，虽然以前写过爬虫，但是索引排序估计要烦得多

nutch 是一个开源的、Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。是一个应用程序，可以以 Lucene 为基础实现搜索引擎应用。

选定nutch之后，开始着手学习使用nutch，英文水平还不够，只能看看nutch的简单的tutorial，但是真正当教程，我还是选择了中文，可以让第一个搜索跑起来之后再选择学习英文的文档，以便更深的理解。

我选择的教程是 nutch入门学习

准备工作：

我的系统是Ubuntu 9.10，java -version 1.6.0_20-b02，nutch 1.0，以及tomcat 6.0.26

jdk和tomcat一般大家做过java和web开发都会有装，不赘述，有几点需要注意的列出来
1. tomcat的bin/catalina.sh中加入JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.20，这点我深受其害，开始没有设置，运行bin/nutch crawl的时候总是说JAVA_HOME is not set，我一想我明明设置了java环境变量的，java-version也是正常的，各种google，确定各种地方可以设置JAVA_HOME的地方，都无济于事，最后在一个角落找到，在此文件中可以添加JAVA_HOME，然后运行，居然可以，但是我不明白，nutch爬虫的运行应该是不依赖于tomcat的，tomcat只是用于搜索。这点未参透。
tomcat，jdk搞定之后是nutch，我直接将nutch放在用户名下面的nutch目录，然后将其中的nutch.war复制到 tomcat的webapp中，并取代ROOT（解压，重命名目录）

配置nutch：

这里参考nutch入门学习，我把改的地方说明出来。

增加要抓取的页面(以www.163.com为例)
1. [root@localhost nutch]#mkdir urls
2. [root@localhost nutch]#echo http://www.163.com/>>urls/163
3. 163文件中输入http://news.163.com/
编辑conf/crawl-urlfilter.txt文件，设定要抓取的网址信息。
[root@localhost nutch]#vi conf/crawl-urlfilter.txt
修改MY.DOMAIN.NAME为:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*163.com/
编辑conf/nutch-site.xml文件，增加代理的属性，并编辑相应的属性值
Xml代码
1. <property>
2. <name>http.agent.name</name>
3. <value></value>
4. <description>HTTP 'User-Agent' request header. MUST NOT be empty -
5. please set this to a single word uniquely related to your
6. organization.
7. NOTE: You should also check other related properties:
8. http.robots.agents
9. http.agent.description
10. http.agent.url
11. http.agent.email
12. http.agent.version
13. and set their values appropriately.
14. </description>
15. </property>
16. <property>
17. <name>http.agent.description</name>
18. <value></value>
19. <description>Further description of our bot- this text is used in
20. the User-Agent header. It appears in parenthesis after the agent
21. name.
22. </description>
23. </property>
24. <property>
25. <name>http.agent.url</name>
26. <value></value>
27. <description>A URL to advertise in the User-Agent header. This will
28. appear in parenthesis after the agent name. Custom dictates that this
29. should be a URL of a page explaining the purpose and behavior of this
30. crawler.
31. </description>
32. </property>
33. <property>
34. <name>http.agent.email</name>
35. <value></value>
36. <description>An email address to advertise in the HTTP 'From' request
37. header and User-Agent header. A good practice is to mangle this
38. address (e.g. 'info at example dot com') to avoid spamming.
39. </description>
```
<property> <name>http.agent.name</name> <value></value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> <property> <name>http.agent.description</name> <value></value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value></value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value></value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description>
```
nutch入门学习中说这里就算是不修改也无所谓，这里的设置，是因为nutch遵守了robots协议，在获取response时，把自己的相关信息提交给被爬行的网站，以供识别。但是我这样设置出现了错误提示，即http.agent.name需要设置，我将value设置成 xusulong*（记住有*）即可。其他可以不设置了。

配置tomcat：

设定搜索目录
(是由于默认的segment路径与我们实际的路径不符所造成的)
[root@localhost nutch]#cd ~/tomcat
[root@localhost tomcat]#vi webapps/ROOT/WEB-INF/classes/nutch-site.xml
增加四行代码，修改成为
Xml代码
1. <configuration>
2. <property>
3. <name>searcher.dirname>
4. <value>/home/whu/nutch/crawl.demovalue>
5. property>
6. configuration>
```
<configuration> <property> <name>searcher.dir</name> <value>/home/whu/nutch/crawl.demo</value> </property> </configuration>
```
这里的/home/whu/nutch/crawl.demo是我的nutch路径，爬虫到时候的数据就会放在程序新建的crawl.demo下面，即 nutch抓取的页面的保存目录。
nutch对中文的支持还不完善，需要修改tomcat文件夹下conf/server.xml文件
[root@localhost tomcat]#vi conf/server.xml
增加两句，修改为
<Connector port="8080"
maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
enableLookups="false" redirectPort="8443" acceptCount="100"
connectionTimeout="20000" disableUploadTimeout="true"
URIEncoding="UTF-8" useBodyEncodingForURI="true" />

抓取网页：

whu@leopard:~/nutch$ bin/nutch crawl urls -dir crawl.demo -depth 2 -threads 4 -topN 5 >& crawl.log

具体的参数nutch入门学习有解释，也可以参见nutch的官方网站。这里只抓取少量站点。

这时候 crawl.log会记录抓取的信息，我中间遇到过

如下几个错误：

http.agent.name需要设置问题
Input path does not exist问题，这个多试几次路径即可，只要这里的crawl.demo和配置tomcat中的路径对应，记得出错的时候把出错的目录删除，否则下次还是出错。

测试结果：

运行tomcat，进入首页，搜索网易，结果如下：

搞了一个下午和晚上，泪流满面，中途还有其他的错误我记不大清楚了，总之严重的错误我列出来了，仔细看系统如何报错，google之，仔细发现错误才是王道。

不死虫的古堡

Monday, May 10, 2010

[develop]Nutch 初体验爬行企业内部网

准备工作：

配置nutch：

配置tomcat：

抓取网页：

测试结果：

No comments:

Post a Comment

Monday, May 10, 2010

[develop]Nutch 初体验 爬行企业内部网

准备工作：

配置nutch：

配 置tomcat：

抓取网页：

测试结果：

No comments:

Post a Comment

[develop]Nutch 初体验爬行企业内部网

配置tomcat：