Showing posts with label 编程之美. Show all posts

Wednesday, November 30, 2011

programming languages and their statements

最近看了《黑客与画家》，其中有一点感受就是语言的能力是有差异的，姑且不去研究究竟谁是老大，了解各种语言的优劣总是没有害处的，正巧在hack news上看到了这篇文章programming languages and their statements，关于各个语言以及对这些语言的陈述，同时还可以对各种语言进行对比。按照字母序罗列了50种语言，如下：

programming languages

Agda

Assembler

Clojure

Cobol

Common Lisp

Delphi

Eiffel

ELisp

Erlang

Factor

Forth

Fortran

Groovy

Haskell

Haxe

Java

Javascript

Mathematica

Matlab

Mozart-Oz

Objective C

O'Caml

Pascal

Perl

Prolog

Python

REBOL

Ruby

Scala

Scheme

Shell

Smalltalk

Standard ML

Visual Basic

每种语言对应的页面大概如下：

说明在这系列活动中有多少人响应了该语言

罗列该语言在哪些statement的rank较高或者较低，各10条，从这些statements中我们很容易了解该语言适合做什么以及不适合做什么

然后列出和这种语言最像以及最不像的5种语言，方面比较

提供和某种语言比较的功能，选择某种语言既可以比较

关于这个语言的所有陈述（排名越高说明赞成的人越多），从这个可以进一步了解该语言，当然更好的了解语言的方法应该是去使用它，但是可能不是每个人都能够这么做

举个python的例子，如下

Python

Based on 87822 responses from 7295 people, we've built up the following picture of Python

Ranked highly in

I would use this language for casual scripting

This language would be good for teaching children to write software

This language is good for beginners

Code written in this language is very readable

I find this language easy to prototype in

I would use this language as a scripting language embedded inside a larger application

It is easy to tell at a glance what code in this language does

This language excels at text processing

I would use this language to write a command-line app

This language is well suited for an agile development approach using short iterations.

Ranked low in

This language is unusually bad for beginners

Writing code in this language is a lot of work

I often get angry when writing code in this language

There is a lot of accidental complexity when writing code in this language

This language has an annoying syntax

This language makes it easy to shoot yourself in the foot

I often feel like I am not smart enough to write this language

This language has a niche outside of which I would not use it

Developers who primarily use this language often burn out after a few years

This is a low level language

Most similar to

Ruby

Clojure

Groovy

Haxe

Scala

Most dissimilar from

Assembler

Fortran

Cobol

All statements

I would use this language for casual scripting

This language would be good for teaching children to write software

This language is good for beginners

Code written in this language is very readable

I find this language easy to prototype in

I would use this language as a scripting language embedded inside a larger application

It is easy to tell at a glance what code in this language does

This language excels at text processing

I would use this language to write a command-line app

This language is well suited for an agile development approach using short iterations.

I would use this language for a web project

This language has a good community

This language has a good library distribution mechanism.

This language encourages writing code that is easy to maintain.

Libraries in this language tend to be well documented.

This language has a wide variety of agreed-upon conventions, which are generally adhered to reasonably well, and which increase my productivity

This language is best for very small projects

I would use this language for a desktop GUI project

The resources for learning this language are of high quality

This is a high level language

This language is expressive

I find code written in this language very elegant

This language is good for scientific computing

Third-party libraries are readily available, well-documented, and of high quality

I often write things in this language with the intent of rewriting them in something else later

This language encourages writing reusable code.

I use this language out of choice

I usually use this language on solo projects

I enjoy using this language

This language has well-organized libraries with consistent, carefully thought-out interfaces

I can imagine this will be a popular language in twenty years time

This language is very flexible

I can imagine using this language in my day job

I rarely have difficulty abstracting patterns I find in my code

This language is well documented

I regularly use this language

There are many good open-source tools for this language

I would use this language for writing server programs

There is a wide variety of open source code written in this language

I would like to write more of this language than I currently do

It is easy to debug programs written in this language when it goes wrong

This language excels at symbolic manipulation

This language has unusual features that I often miss when using other languages

Programs written in this language tend to play well with others

When I run into problems my colleagues can provide me with immediate help with this language

Code written in this language tends to be terse

I still discover new features of this language on a fairly regular basis

This language has a very coherent design

Code written in this language will usually run in all the major implementations if it runs in one of them.

I would list this language on my resume

I would recommend most programmers learn this language, regardless of whether they have a specific need for it

This language is good for numeric computing

I usually use this language on projects with many other members

This language is best for very large projects

This language has a very rigid idea of how things should be done

This language has a very dogmatic community

This language is likely to have a strong influence on future languages

When I write code in this language I can be very sure it is correct

Learning this language significantly changed how I use other languages.

I know many other people who use this language

I know this language well

Code written in this language tends to be very reliable

There are many good tools for this language

Learning this language improved my ability as a programmer

This language is easier to use for it's problem domain by removing unneeded expressiveness (such as not being Turing complete).

This language is good for distributed computing

I would use this language for mobile applications

This language has a high quality implementation

This language is likely to be a passing fad

I use many applications written in this language

If this language didn't exist, I would have trouble finding a satisfactory replacement

This language is large

This language matches it's problem domain particularly well.

It's unusual for me to discover unfamiliar features

This is a mainstream language

This language has many features which feel "tacked on"

I find it easy to write efficient code in this language

This language is built on a small core of orthogonal features

This language is likely to be around for a very long time

The semantics of this language are much different than other languages I know.

This language excels at concurrency

This language is minimal

There are many good commercial tools for this language

I use a lot of code written in this language which I really don't want to have to make changes to

I would use this language for writing embedded programs

This language is frequently used for applications it isn't suitable for

If my code in this language successfully compiles, there is a good chance my code is correct.

I would use this language for writing programs for an embedded hardware platform

This language allows me to write programs where I know exactly what they are doing under the hood

I enjoy playing with this language but would never use it for "real code"

This language has a niche in which it is great

Programs written in this language will usually work in future versions of the language

It is too easy to write code in this language that looks like it does one thing but actually does something else

I am sometimes embarrassed to admit to my peers that I know this language

Code written in this language tends to be verbose

Programs written in this language tend to be efficient

This language is suitable for real-time applications

I am reluctant to admit to knowing this language

I learned this language early in my career as a programmer

The thought that I may still be using this language in twenty years time fills me with dread

This language has a strong static type system

This is a low level language

Developers who primarily use this language often burn out after a few years

This language has a niche outside of which I would not use it

I often feel like I am not smart enough to write this language

This language makes it easy to shoot yourself in the foot

This language has an annoying syntax

There is a lot of accidental complexity when writing code in this language

I often get angry when writing code in this language

Writing code in this language is a lot of work

This language is unusually bad for beginners

Saturday, November 26, 2011

Could Not Find HelloAndroid.apk! win7

开始折腾下android，第一个遇到的问题自然要记下

按照官方的教程一步一步走来，helloword居然不能work，错误如下

Could Not Find HelloAndroid.apk!

这里和这里都搜到了相关问题的描述，不过在我这里不work，因为我没有相应的路径，不过欣慰的是他的部分方法我拿过来用就ok了

把我的市区设置成美帝的市区时区，然后run helloworld 就ok了，随后既可以将市区时区更改回来：）

Thursday, November 3, 2011

bash 条件表达式 'test' '[' '[['

一直以来对test、[、和[[没有深研究，今天代码被挑刺了，就在这个几点上，因此决定好好在研究一下，翻了几个blog，同时又看了Bash Reference，本文的文字主要摘自Bash Reference，外加自己的一点解释、以及例子等。

首先看到6.4 Bash Conditional Expressions

Bash Conditional Expressions

开头说到条件表达式通常用于 [[ compound command and the test and [builtin commands.

Expressions may be unary or binary. Unary expressions are often used to examine the status of a file. There are string operators and numeric comparison operators as well（用于检查文件的status，以及字符串操作和算术比较）. If the fileargument to one of the primaries is of the form/dev/fd/N, then file descriptor N is checked. If the fileargument to one of the primaries is one of/dev/stdin,/dev/stdout, or/dev/stderr, file descriptor 0, 1, or 2, respectively, is checked.

When used with ‘[[’, The ‘<’ and ‘>’ operators sort lexicographically using the current locale（此时基于字典序）.

Unless otherwise specified, primaries that operate on files follow symbolic links and operate on the target of the link, rather than the link itself.

文件
-a file: True if file exists.
-b file: True if file exists and is a block special file.
-c file: True if file exists and is a character special file.
-d file: True if file exists and is a directory.
-e file: True if file exists.
-f file: True if file exists and is a regular file.
-g file: True if file exists and its set-group-id bit is set.
-h file: True if file exists and is a symbolic link.
-k file: True if file exists and its "sticky" bit is set.
-p file: True if file exists and is a named pipe (FIFO).
-r file: True if file exists and is readable.
-s file: True if file exists and has a size greater than zero.
-t fd: True if file descriptor fd is open and refers to a terminal.
-u file: True if file exists and its set-user-id bit is set.
-w file: True if file exists and is writable.
-x file: True if file exists and is executable.
-O file: True if file exists and is owned by the effective user id.
-G file: True if file exists and is owned by the effective group id.
-L file: True if file exists and is a symbolic link.
-S file: True if file exists and is a socket.
-N file: True if file exists and has been modified since it was last read.
file1 -nt file2: True if file1 is newer (according to modification date) than file2, or if file1 exists and file2 does not.
file1 -ot file2: True if file1 is older than file2, or if file2 exists and file1 does not.
file1 -ef file2: True if file1 and file2 refer to the same device and inode numbers.
操作符
-o optname: True if shell option optnameis enabled. The list of options appears in the description of the-ooption to the set builtin (see The Set Builtin).
字符串
-z string: True if the length of string is zero.
-n string
string: True if the length of string is non-zero.
string1 == string2
string1 = string2: True if the strings are equal.‘=’ should be used with the test command for posix conformance.
string1 != string2: True if the strings are not equal.
string1 < string2: True if string1 sorts before string2 lexicographically.
string1 > string2: True if string1 sorts after string2 lexicographically.
算术
arg1 OP arg2: OPis one of ‘-eq’, ‘-ne’, ‘-lt’, ‘-le’, ‘-gt’, or ‘-ge; ’. These arithmetic binary operators return true if arg1 is equal to, not equal to, less than, less than or equal to, greater than, or greater than or equal to arg2, respectively. Arg1 and arg2 may be positive or negative integers（必须是整数）.

下面就依次来看看'[['和'test' '['

先看'[['

位于3.2.4.2 Conditional Constructs

[[...]]

          [[ expression ]]

Return a status of 0 or 1 depending on the evaluation of the conditional expression expression. Expressions are composed of the primaries described below in Bash Conditional Expressions. Word splitting and filename expansion are not performed on the words between the ‘[[’ and ‘]]’; tilde expansion, parameter and variable expansion, arithmetic expansion, command substitution, process substitution, and quote removal are performed. Conditional operators such as ‘-f’ must be unquoted to be recognized as primaries.

When used with ‘[[’, The ‘<’ and ‘>’ operators sort lexicographically using the current locale.

When the ‘==’ and ‘!=’ operators are used, the string to the right of the operator is considered a pattern and matched according to the rules described below in Pattern Matching. If the shell option nocasematch (see the description of shopt in The Shopt Builtin) is enabled, the match is performed without regard to the case of alphabetic characters. The return value is 0 if the string matches (‘==’) or does not match (‘!=’)the pattern, and 1 otherwise. Any part of the pattern may be quoted to force it to be matched as a string.

An additional binary operator, ‘=~’, is available, with the same precedence as ‘==’ and ‘!=’. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex3)). The return value is 0 if the string matches the pattern, and 1 otherwise. If the regular expression is syntactically incorrect, the conditional expression's return value is 2. If the shell option nocasematch (see the description of shopt in The Shopt Builtin) is enabled, the match is performed without regard to the case of alphabetic characters. Any part of the pattern may be quoted to force it to be matched as a string. Substrings matched by parenthesized subexpressions within the regular expression are saved in the array variable BASH_REMATCH. The element of BASH_REMATCH with index 0 is the portion of the string matching the entire regular expression. The element of BASH_REMATCH with index n is the portion of the string matching the nth parenthesized subexpression.

Expressions may be combined using the following operators, listed in decreasing order of precedence:

( expression ): Returns the value of expression. This may be used to override the normal precedence of operators.
! expression: True if expression is false.
expression1 && expression2: True if both expression1 and expression2 are true.
expression1 || expression2: True if either expression1 or expression2 is true.

The && and || operators do not evaluate expression2 if the value of expression1 is sufficient to determine the return value of the entire conditional expression.

`再来看 'test'和'['`

位于4.1 Bourne Shell Builtins

test和[是等价的，不过用的时候形式不一样如if test expression，和if [ expression ]

Evaluate a conditional expression expr. Each operator and operand must be a separate argument. Expressions are composed of the primaries described below in Bash Conditional Expressions. testdoes not accept any options, nor does it accept and ignore an argument of--as signifying the end of options.When the [ form is used, the last argument to the command must be a ].Expressions may be combined using the following operators, listed in decreasing order of precedence. The evaluation depends on the number of arguments; see below.

! expr: True if expr is false.
( expr ): Returns the value of expr. This may be used to override the normal precedence of operators.
expr1 -a expr2: True if both expr1 and expr2 are true.
expr1 -o expr2: True if either expr1 or expr2 is true.

The test and [ builtins evaluate conditional expressions using a set of rules based on the number of arguments.这里把test其实是一个函数，根据不同的参数个数来判断出结果

0 arguments: The expression is false.
1 argument: The expression is true if and only if the argument is not null.
2 arguments: If the first argument is ‘!’, the expression is true if and only if the second argument is null. If the first argument is one of the unary conditional operators (see Bash Conditional Expressions), the expression is true if the unary test is true. If the first argument is not a valid unary operator, the expression is false.
3 arguments: If the second argument is one of the binary conditional operators (see Bash Conditional Expressions), the result of the expression is the result of the binary test using the first and third arguments as operands. The ‘-a’ and ‘-o’ operators are considered binary operators when there are three arguments. If the first argument is ‘!’, the value is the negation of the two-argument test using the second and third arguments. If the first argument is exactly ‘(’ and the third argument is exactly ‘)’, the result is the one-argument test of the second argument. Otherwise, the expression is false.
4 arguments: If the first argument is ‘!’, the result is the negation of the three-argument expression composed of the remaining arguments. Otherwise, the expression is parsed and evaluated according to precedence using the rules listed above.
5 or more arguments: The expression is parsed and evaluated according to precedence using the rules listed above.

[ ]和[[ ]]的区别

这两者的大部分功能都是一样的，但是后者[[]]比前者[]更加完善。

1.[[有扩展功能，通配符匹配（==和!=）和正则表达式匹配（=~），这在上面讲解[[的时候专门有两个段落说明

==和!=的右操作符为pattern进行3.5.8.1 Pattern Matching

这个时候，假如要比较的右侧字符串中有特殊字符如*，则需要用"引起来

[shell]
$ [[ "GFW" == G*W ]]
$ echo $?
0
$ [[ "GFW" == "G*W" ]]
$ echo $?
1
[/shell]

=~的右操作符作为正则表达式，这里正则的引号可用可不用

[shell]
$ [[ "GFW" =~ "G*W" ]]
$ echo $?
0
$ [[ "GFW" =~ 'G*W' ]]
$ echo $?
0
$ [[ "GFW" =~ G*W ]]
$ echo $?
0
[/shell]

2.[的二元操作符的右操作符（通常是某变量）不能为空（null），当然可以通过加引号来解决

首先使用[如下：

[shell]
$ [ "GFW" == $fuck ]
bash: [: GFW: unary operator expected
[/shell]

就会报错，而加了引号就没有问题了，这个时候当作一个空字符串处理

[shell]
$ [ "GFW" == "$fuck" ]
$ echo $?
1
[/shell]

假如是使用[[就不存在该问题，加不加引号都ok，如下：

[shell]
$ [[ "GFW" == $fuck ]]
$ echo $?
1
$ [[ "GFW" == "$fuck" ]]
$ echo $?
1
[/shell]

另外还有个(( ))

((...))

          (( expression ))

The arithmetic expression is evaluated according to the rules described below (see Shell Arithmetic). If the value of the expression is non-zero, the return status is 0; otherwise the return status is 1. This is exactly equivalent to

          let "expression"

See Bash Builtins, for a full description of the let builtin.

Shell Arithmetic如下

The shell allows arithmetic expressions to be evaluated, as one of the shell expansions or by the letand the

-ioption to the declarebuiltins.

Evaluation is done in fixed-width integers with no check for overflow, though division by 0 is trapped and flagged as an error. The operators and their precedence, associativity, and values are the same as in the C language. The following list of operators is grouped into levels of equal-precedence operators. The levels are listed in order of decreasing precedence.

id++ id--: variable post-increment and post-decrement
++id --id: variable pre-increment and pre-decrement
- +: unary minus and plus
! ~: logical and bitwise negation
**: exponentiation
* / %: multiplication, division, remainder
+ -: addition, subtraction
<< >>: left and right bitwise shifts
<= >= < >: comparison
== !=: equality and inequality
&: bitwise AND
^: bitwise exclusive OR
|: bitwise OR
&&: logical AND
||: logical OR
expr ? expr : expr: conditional operator
= *= /= %= += -= <<= >>= &= ^= |=: assignment
expr1 , expr2: comma

Shell variables are allowed as operands; parameter expansion is performed before the expression is evaluated. Within an expression, shell variables may also be referenced by name without using the parameter expansion syntax. A shell variable that is null or unset evaluates to 0 when referenced by name without using the parameter expansion syntax. The value of a variable is evaluated as an arithmetic expression when it is referenced, or when a variable which has been given the integerattribute using ‘declare -i’ is assigned a value. A null value evaluates to 0. A shell variable need not have its integer attribute turned on to be used in an expression.

Constants with a leading 0 are interpreted as octal numbers. A leading ‘0x’ or ‘0X’ denotes hexadecimal. Otherwise, numbers take the form [base#]n, where base is a decimal number between 2 and 64 representing the arithmetic base, and n is a number in that base. If base#is omitted, then base 10 is used. The digits greater than 9 are represented by the lowercase letters, the uppercase letters, ‘@’, and ‘_’, in that order. If base is less than or equal to 36, lowercase and uppercase letters may be used interchangeably to represent numbers between 10 and 35.

Operators are evaluated in order of precedence. Sub-expressions in parentheses are evaluated first and may override the precedence rules above.

Sunday, June 5, 2011

构建可搜索的基于Web的Google图表

在Poynter看到的关于Google Chart Tools的介绍 How to make searchable, Web-based Google charts。参考该文章，也试用了下Google Chart Tools。

　　大量的数据可视化需要有专业的知识或者需要话费大量时间精力以及资源，而google的The Google Visualization API（web）使得此工作变得简单，不管你是designer, developer, Web producer 还是hobbyist。

　　废话不多说，直接上例子，为了体验实验，我自己构造了例子，而没有采用原文的实例。

首先打开Google Visualization API homepage，选择Bar Chart。

在下面的图中，点击Google Visualization API playground的字样。

然后打开一个新的页面，左侧为API或者code的列表，右侧为当前API或者code的代码，而下面为对应的bar chart，修改code，点击 run code，bar chart会有相应的变化。

原始的代码如下 [javascript]function drawVisualization() {
// Create and populate the data table.
var data = new google.visualization.DataTable();
var raw_data = [['Austria', 1336060, 1538156, 1576579, 1600652, 1968113, 1901067],
['Bulgaria', 400361, 366849, 440514, 434552, 393032, 517206],
['Denmark', 1001582, 1119450, 993360, 1004163, 979198, 916965],
['Greece', 997974, 941795, 930593, 897127, 1080887, 1056036]];

var years = [2003, 2004, 2005, 2006, 2007, 2008];

data.addColumn('string', 'Year');
for (var i = 0; i < raw_data.length; ++i) {
data.addColumn('number', raw_data[i][0]);
}

data.addRows(years.length);

for (var j = 0; j < years.length; ++j) {
data.setValue(j, 0, years[j].toString());
}
for (var i = 0; i < raw_data.length; ++i) {
for (var j = 1; j < raw_data[i].length; ++j) {
data.setValue(j-1, i+1, raw_data[i][j]);
}
}

// Create and draw the visualization.
new google.visualization.BarChart(document.getElementById('visualization')).
draw(data,
{title:"Yearly Coffee Consumption by Country",
width:600, height:400,
vAxis: {title: "Year"},
hAxis: {title: "Cups"}}
);
}[/javascript]

通常情况下，需要修改这几个参数即可。
1. 第4行var raw_data的内容
2. 第9行var years的内容
3. 第11行和第32行对应的Y轴的坐标名称
4. 第33行的X轴的坐标名称
5. 第30行的chart的名称
6. 其他详细参数设置可以参考文档的Configuration Options部分

我的修改之后的代码和charts分别如下所示：[javascript]function drawVisualization() {
// Create and populate the data table.
var data = new google.visualization.DataTable();
var raw_data = [['econsh', 40000, 55381],
['mushi', 20000, 43816],
['wods', 30000, 5816]];

var years =['article num', 'time'];

data.addColumn('string', 'properties');
for (var i = 0; i < raw_data.length; ++i) {
data.addColumn('number', raw_data[i][0]);
}

data.addRows(years.length);

for (var j = 0; j < years.length; ++j) {
data.setValue(j, 0, years[j].toString());
}
for (var i = 0; i < raw_data.length; ++i) {
for (var j = 1; j < raw_data[i].length; ++j) {
data.setValue(j-1, i+1, raw_data[i][j]);
}
}

// Create and draw the visualization.
new google.visualization.BarChart(document.getElementById('visualization')).
draw(data,
{title:"SBBSert statictics",
width:600, height:400,
vAxis: {title: "properties"},
hAxis: {title: "nums"}}
);
}[/javascript]

注意：这个bar chart貌似不支持中文，有了中文的话，就不能正常工作，已经反馈给google

Wednesday, June 1, 2011

Machine Learning Demos

这是Basilio Noris博士的杰作，主要针对现有的机器学习的分类、距离、回归等算法的现有source code并不是很好使用以及理解，实现了一个交互式的GUI，把一些库和例子结合起来，对这些算法进行了更好的可视化和比较，该GUI支持Windows，Linux，以及Mac。用户可以根据自己的机器选择安装进行体验。详细的使用和介绍参考这里Machine Learning Demos。

界面如下：

实现的方法如下：

Classification	Regression	Dynamical Systems	Clustering	Projections
Support Vector Machine (SVM) (C, nu, Pegasos) Relevance Vector Machine (RVM) Gaussian Mixture Models (GMM) Multi-Layer Perceptron + BackPropagation Gentle AdaBoost + Naive Bayes Approximate K-Nearest Neighbors (KNN)	Support Vector Regression (SVR) Relevance Vector Regression (RVR) Gaussian Mixture Regression (GMR) MLP + BackProp Approximate KNN Sparse Optimized Gaussian Processes (SOGP) Locally Weighed Projection Regression (LWPR)	GMM+GMR LWPR SVR SEDS SOGP (Slow!) MLP KNN	K-Means Soft K-Means Kernel K-Means GMM One Class SVM	Principal Component Analysis (PCA) Kernel PCA Independent Component Analysis (ICA) Linear Discriminant Analysis (LDA) Fisher Linear Discriminant EigenFaces to 2D (using PCA)

Friday, May 20, 2011

阴影、背景以及边界的CSS builder

CSS对于Web页面的重要性不言自明，而不懂得CSS的用户甚至是非资深的Web开发者常常也会头疼CSS的阴影，边界等问题，尤其是要做的漂亮，有自己的style，这里推荐大家一个神器。

　　Layer Styles是一个在线的CSS builder for shadows, backgrounds, and borders。它支持的styles包括：

Drop shadow

Inner shadow

Background

Border

Border radius

　　每个style都有几个属性（如Background的属性包括Opacity（不透明度），Gradient（梯度），以及梯度的style和angle），属性发生变化时，页面上的div会实时变化，同时下面的CSS code也会跟着变化，这样我们不必精通CSS，也照样可以编写适合自己的CSS代码。当然，对于IE，表示无能为力。

　　下面的截图是我自己随手涂鸦的作品：

　　对应的CSS代码：

[css]

<code>border: 1px solid black;
border-radius: 13px;
background-image: -moz-linear-gradient(top, white, black);
background-image: -webkit-gradient(linear, center top, center bottom, from(white), to(black));
box-shadow: 0 1px 5px 6px rgba(14,232,50,0.75), inset 0 1px 1px 1px #ffb812;</code>

[/css]

Sunday, May 15, 2011

激活了google storage

刚刚激活了Google Storage，如何激活可以参照这里，提醒大家一个坑爹的事情，由于激活google storage需要turn on billing，如下：
Turn on billing.

Before you can use Google Storage, you need to enable billing for your project. To do so, click the Billing tab and enable billing. Enabling billing does not necessarily mean you will be charged. See Pricing and Terms for more information.

follow 了步骤去做，利用google checkout，显示需要支付usd0，但是支付的国家里面没有选项china，于是我选择了香港，地址填了九龙的某地方，然后交易发生的时候收到短信扣除网上当地币8.00元，貌似这个是交易费，而不是交易的费用，类似于手续费，因此并不是真正的free的，或者是由于我的卡是大陆的，选择的交易地点却是hongkong导致。

激活之后，5G的空间既可以使用了，有两种方式可以访问空间，google storage manager和GSUtil。由于刚刚激活，我简单的浏览了前者的使用，功能基本上包括建立buckets，folder，上传folder，files，再有就是删除和分享，没有我想象的查看代码的功能（因为是storage for developers）。GSUtil的功能则强大了许多，提供类似linux 命令行的操作，如"gsutil cp"，"gsutil cat"等。

其他的功能，等待进一步探讨。

Wednesday, October 20, 2010

类型转换-基类和派生类之间的转换

    对于内置类型，类型之间的转换比较明显，而且接触得比较多，但是对于自定义类型，尤其是基类和派生类之间到底可以有哪些转换我还是比较模糊，翻了翻书，同时自己试了试，总结如下（如有不对地方，欢迎支持）：
1.子类转成父类
using namespace std;
class A {
    public:
        void display()
        {
            cout << "in A" << endl;
        }
};

class B : public A {
    public:
        void display()
        {
            cout << "in B" << endl;
        }
};

int main(int argc, char *argv[])
{
    A a;
    a.display();// A
    B b;
    //a = b; 隐式转换
    //两种旧的强制转换
    //a = A(b); function-style cast
    //a = (A)b; c-style cast
    //推荐
    b.display(); //B
    a = static_cast<A>(b);
    a.display();          // A
}
当然转了的时候，b就转成了a

2. 父类转子类
假如参考上面的做法，将a转换给b的话，4种方法都是不可行的，那么父类在什么情况下可以转成子类呢？
参考了c++ primer的dynamic_cast操作符的解释：
可以使用dynamic_cast操作符将基类类型对象的引用或指针转换为同一层次中其他类型的引用或者指针。与dynamic_cast一起使用的指针必须是邮箱的--为0或者指向一个对象。
注意：dynamic_cast涉及运行时类型检查，如果绑定到引用或者指针的对象不是目标对象，则dynamic_cast失败的（我认为本质上指针指向的实际对象还是和目标同类型，只是指针是基类而已）。如果转换到指针类型的dynamic_cast失败，则dynamic_cast的结果为0；如果转换到引用类型的dynamic_cast失败，则抛出一个bad_cast类型的异常。
同时，需要基类至少带有一个虚函数，（这点我认为是因为运行时类型检查，类似多态）

例子：

2.1 目标类型和运行时类型不一致，dynamic_cast的结果为0

class A {
    public:
    virtual void test()
    {
        cout << "A" << endl;
    }
};

class B : public A {
    public:
    void test()
    {
        cout << "B" << endl;
    }
};

int main(int argc, char *argv[])
{
    A *a = new A();
    B *b = dynamic_cast<B*>(a); // can't
    a->test();
    if (b != NULL ) {
        b->test();
    }
}

output:

2.2 没有虚函数：error

class A {
    public:
    void test()
    {
        cout << "A" << endl;
    }
};

class B : public A {
    public:
    void test()
    {
        cout << "B" << endl;
    }
};

int main(int argc, char *argv[])
{
    A *a = new B();
    B *b = dynamic_cast<B*>(a); // 没有虚函数can't
    a->test();
    if (b != NULL ) {
        b->test();
    }
}

compiler error：cannot dynamic_cast ‘a’ (of type ‘class A*’) to type ‘class B*’ (source type is not polymorphic)

假如目标类型和运行时类型一致，且基类含虚函数的话，即可以

如

class A {
    public:
    virtual void test()
    {
        cout << "A" << endl;
    }
};

class B : public A {
    public:
    void test()
    {
        cout << "B" << endl;
    }
};

int main(int argc, char *argv[])
{
    A *a = new B();
    B *b = dynamic_cast<B*>(a); // can't
    a->test();
    if (b != NULL ) {
        b->test();
    }
}

output:

B
B

Monday, October 18, 2010

c专家编程-对链接的思考

本章主要是对如何link的思考，包括编译的过程，编译时候的选项，动态连接，静态链接等等，另外就是要提防interpositioning（编写与库函数同名函数）。
给个直观的图来说明编译器的组成：

静态链接：如果函数库的一份copy是可执行文件的物理组成部分。以.a结尾
    动态链接：如果可执行文件只是包含了文件名，让载入器在运行时能够寻找程序所需要的函数库。即just in time JIT 链接以.so结尾
    直观上的，静态库要比动态库大。
    动态链接的优点在于体积小，以及可以共享函数库，同时，函数库升级更加容易
    关于链接相关知识，还可以参考Makefile文件的编写，chinaunix的这个写得很有味道

Wednesday, October 13, 2010

c风格字符串的疑问

最近再看c++ primer，比本科时候看的时候体会要深得多，以前看来真的是打酱油的。
看到c风格字符串的时候，有了几个疑问，如下：
程序 1 如下：

#include
#include
#include

using namespace std;
int main(int argc, char *argv[])
{
    const char ca[] = {'h', 'e', 'l', 'l', 'o'};
    cout << strlen(ca) << endl;
    int i = 0;
    while (ca[i] != '\0') {
        cout << ca[i++] << endl;
    }
    cout << strlen(ca) << endl;
}

windows xp下的mingw 结果
D:\c_c_plus>a.exe
5
h
e
l
l
o

6

当时我非常纳闷怎么ca的长度会变化了(5，6)，试了几次结果都是这样，没有想明白，于是换到了linux下同样的程序，结果如下

sulong@sulong-desktop:~/Documents/c_c_plus$ ./a.out 
8
h
e
l
l
o

6

这里，ca的长度再次变化了（5，8），我的第一反应是在此好像strlen对ca不起作用了，仔细看书，归结出原因在此：

strlen等标准库函数的参数是c风格字符串（c++ primer中描述到，传递给这些函数的指针必须具有非零值，而且指向以null结束的字符数组中的第一个元素）；
而c风格字符串有一点值得注意，需要以null结束，如char ca1[] = {'1', '2'}就不是，而char ca2={'1', '2', '\0'}则是，同时字符串字面量也是c风格字符串的实例；
strlen总是假定其参数字符串以null字符结束，当调用该函数时，系统将会从实参指向的内存空间开始一致搜索结束符，知道恰好遇到null位置，strlen返回的这一段内存空间内总共有多少个字符；
当实参是非c风格字符的时候，这个数值是不可预知的；

了解原因之后，修改程序，程序 2 和结果如下（windows和linux都正确，这里只列出linux的）：

#include
#include
#include

using namespace std;
int main(int argc, char *argv[])
{
    const char ca[] = {'h', 'e', 'l', 'l', 'o', '\0'};
    cout << strlen(ca) << endl;
    int i = 0;
    while (ca[i] != '\0') {
        cout << ca[i++] << endl;
    }
    cout << strlen(ca) << endl;
}

output：
5
h
e
l
l
o
5

不过有一点不是很明白，对于第一个程序，为什么对ca进行解引用之后，ca的长度变化了？windows（5，6），linux（8，6），我的猜测是c++允许计算数组的超出末端的地址，但是不允许对此地址进行解引用操作，否则结果是未定义的。
除了strlen函数，我同时测试了其他cstring中的库函数，参数必须严格安装说明和建议，否则结果也是未定义的，如下程序 3 ：

#include
#include

using namespace std;

int main(int argc, char* argv[])
{
    const char *c1 = "hello";
    const char *c2 = "world";
    char pc[5 + 5 + 1];
    strncpy(pc, c1, 5);
    
    cout << pc << endl;
}

output:
hello6

因为strncpy(pc, c1, 5)的5只够存储hello，而null字符也是需要空间的，使用的时候，时刻记住一定要算上结束符null，需要修改为6以上（当然不能超过pc的size），结果才能正确输出为hello。

两点感悟：
1. 少用c风格字符串，用string
2. 多多编程测试细节

Sunday, October 10, 2010

c专家编程-数组和指针的恩怨情仇

这里把4，9，10章的内容结合在一起，主要谈论的是数组，指针的使用，以及何时相同，何时不同。
通常情况下对于数组和指针单独使用的时候，还是比较清晰的，这里就只简单提下容易混的地方。

什么时候数组和指针式相同的
c语言标准作了如下说明：
规则1. 表达式中的数组名（与声明不同）被编译器当做一个指向数组第一个元素的指针

int a[10], *p, i = 2;
p = a;
p[i];
p = a;
*(p + i);
p = a + i;
*p是等同的
需要声明这里有极其特殊的理我，对数组的引用不能用指向数组第一个元素的指针来代替
sizefo的时候，sizeof(数组)是数组的大学，而sizeof(指针)是指针的长度

规则2. 下标总是与真正的偏移量相同

c语言把数组下标改成指针偏移量的根本原因是指针和偏移量是底层硬件使用的基本类型
使用&去数组的地址
数组是一个字符串（或宽字符串）常量初始值

规则3. 在函数参数的声明中，数组名被编译器当作指向该数组第一个元素的指针

作为形参的数组和指针等同主要出于效率的考虑，假如用传值，传递整个数组代价很大，而指针则不同
这里建议参数定义为指针

除上述情况外，定义和指针必须匹配，如果定义数组，在其他文件对他进行声明时候，必须声明为数组，指针也是

同时需要注意的是，数组名被改写成一个指针参数 并不是地规定义的，数组的数组改写为数组的指针，而不是指针的指针，本来是数组的需要改变，而本来是指针的不需要改变

其他：关于多维数组以及字符串数组等情况不再说明，需要详细参考原文啦：）
--
BlogSpot： http://xusulong.blogspot.com Twitter： http://twitter.com/econsh

c专家编程-分析c语言的声明

本章主要说明声明如何构成，如何去解读声明，以及哪些声明是非法的，包括对typedef的详解

1. 声明合法与否
a. 函数的返回值不能是一个函数，如f()()，可以是函数指针，如int(* fun())()；
b. 函数的返回值不能是一个数组，如f()[]，可以是指向数组的指针，如int(*foo())[]；
c. 数组里面不能有函数，如a[]()，数组里面可以有函数指针，如int(* a[])()，可以有其他数组，如
int a[][]

2. c语言声明的优先级规则

        A 声明从它的名字开始读取，然后按照优先级顺序依次读取；

        B 优先级从高到低依次是：

            B.1 声明中被括号括起来的那部分；

            B.2 后缀操作符：括号（）表示这是一个函数，而方括号[]表示这是一个数组；

            B.3 前缀操作符：星号*标识“指向……的指针”；

        C 如果const和（或者）volatile关键字的后面紧跟类型说明符（如int，long等），那么它作用于类型说明符，在其他情况下，const和（或）volatile关键字作用于它左边紧邻的指针星号。

举例说明：char * const * （*next）（）；

        A      next                ——next为声明的名字

        B.1 （*next）              ——next为一个指向……的指针

        B.2 （*next）（）          ——next是一个函数指针

        B.3 *（*next）（）         ——next是一个函数指针，这个函数返回一个指向……的指针

        C    char * const     ——指向字符类型的常量指针

故 char * const *（*next）（）；的含义就是： next是一个函数指针，这个函数返回一个指向字符类型的常量指针

3. 图示解析c语言的声明

此图也是一种解析c声明的方法，不过2中的ABC的方式更加简单明了

c专家编程-这货不是bug，而是语言特性

本章从c语言的一些看上去有点缺陷的地方来提醒我们对相应的知识点需要加倍注意

1 switch的fall through，这个很明了，case后记得加break，否则依次执行

2 字符串会自动连接，如

#include

int main(int agrc, char* argv[])
{
    printf("hello"
            "world \n");
}

会打印出结果helloworld，这个时候要注意，在下面例子中

char * a[] = {
    "one",
    "two"
    "three"
};

因为"two" 之后少了逗号","而变成了"one"和"twothree"组成的字符串数组

3 优先级以及操作符的重载（比如*可以是乘法，也用于指针），有时候并不像想象的很自然的意思，需要对优先级更加理解和掌握

4 局部变量在堆栈分配内存，函数退出，内存被回收问题，可以通过用全局变量，静态变量，显示分配内存，让调用者提供内存（传入以分配内存指针）等等方式来解决

5 lint程序不应该分出来，主要意思是，代码需要更多的检验

c专家编程-穿越时空的迷雾

这是本书的第一部分，主要讲述c语言的前世今生，这里点一下几点

1 K&R C，即Brain Kernighan和Dennis Ritchie

2 ANSI C，这里面说的很幽默，其实ANSI C应该叫做ISO C，因为ANSI采纳的是ISO C，因为在标准之前，已经交了ANSI C，已经广泛使用了。

3 可移植的代码（ portable code）：严格遵循标准的程序应该是这样的

3.1 只使用已经确定的特性（在某些正确情况下的做法，标准并未明确规定应该怎样做，如参数求职顺序）

3.2 不突破任何由编译器实现的限制

3.3 不产生任何依赖与编译器定义的未确定的或未定义的特性的输出

4 多多阅读ANSI C，里面对细节的问题，描述的很清楚，有相应的约束条件

4.1 里面例举了const char **p的形参，char ** a的实参不相容的例子来说名赋值如何合法等等

Friday, October 8, 2010

初读C专家编程（Expert C Programming）

    国庆前入手c专家编程，甚为喜欢，虽然假期回家做了不少活，但是依然看得津津有味，粗略地完成了10章，习题之类的并没有去做，回顾的时候会去试试，先有个总体的概念也比较不错。
Expert C Programming其实是tooold的书了，94年，但因为ANSI C并没有很大的改动，以及所述内容的典型和有趣，一直畅销。
    书的style不像其他教条的书籍，很多故事充斥其中，让你豁然开朗，也会令人捧腹，其中多次调侃sun公司，也设计了apple，以及不少知名it公司和名人。
    在讲述每个知识点的时候，通常会涉及以下内容

阐明观点
铺开来陈述原理，包括为何ansi c这么去定规则，规则的细细剖析
类似知识点，或者容易混淆的知识点，之间的比较
例子
图示
编程挑战
轻松一下，回顾过往因为相关知识点引起的bug造成的趣闻等

通过这些方面的陈述，对一个知识点的理解慢慢加深，搞清楚所以然来
另外，本书并不是一本c语言的语法，使用等的详细的讲解，而是对其中比较关键的点，难点进行的仔细剖析，需要少许的c语言基础。推荐下C程序设计语言，徐大宝文老师翻译的Brian W. Kernighan和Dennis M. Ritchie的经典书籍。
后续将回顾每章的知识点：）

Friday, June 4, 2010

苏格拉底最大的麦穗模型

苏格拉底的麦穗原理

突然想起来上次面试过一个苏格拉底麦穗问题，即苏格拉底说：我请你穿越这片稻田，去摘一株最大最金黄的麦穗回来，但是有个规则：你不能走回头路，而且你只能摘一次。

当初我给出的解决方案类似于苏格拉底第三个弟子的做法，第三个弟子把麦田分为三份，走第一个1/3时，只看不摘，分出大、中、小三类麦穗，在第二个1/3里验证是否正确，然后选择了大麦穗中的一支美丽的麦穗。

刚才吃饭的时候又突然想起来这个问题，上网搜了搜，也有了部分启发，即和算法导论的在线雇佣问题本质上是一样的（中文5.4.4节 66-68页）。为了偷懒我就直接贴英文的内容了：

As a final example, we consider a variant of the hiring problem. Suppose now that we do not wish to interview all the candidates in order to find the best one. We also do not wish to hire and fire as we find better and better applicants. Instead, we are willing to settle for a candidate who is close to the best, in exchange for hiring exactly once. We must obey one company requirement: after each interview we must either immediately offer the position to the applicant or must tell them that they will not receive the job. What is the trade-off between minimizing the amount of interviewing and maximizing the quality of the candidate hired?

We can model this problem in the following way. After meeting an applicant, we are able to give each one a score; let score(i) denote the score given to the ith applicant, and assume that no two applicants receive the same score. After we have seen j applicants, we know which of the j has the highest score, but we do not know if any of the remaining n - j applicants will have a higher score. We decide to adopt the strategy of selecting a positive integer k < n, interviewing and then rejecting the first k applicants, and hiring the first applicant thereafter who has a higher score than all preceding applicants. If it turns out that the best-qualified applicant was among the first k interviewed, then we will hire the nth applicant. This strategy is formalized in the procedure ON-LINE-MAXIMUM(k, n), which appears below. Procedure ON-LINE-MAXIMUM returns the index of the candidate we wish to hire.

ON-LINE-MAXIMUM(k, n)

1 bestscore ← -∞

2 for i ← to k

3 do if score(i) > bestscore

4 then bestscore ← score(i)

5 for i ← k + 1 to n

6 do if score(i) > bestscore

7 then return i

8 return n

书上给出了证明，比较多，这里就不列出了，感兴趣的同学可以参考http://mitpress.mit.edu/catalog/item/default.asp?tid=8570&ttype=2

这里直接给出结论：如果用k=n/e来实现这个策略，则可以以至少1/e的概率，成功地雇用到最优资格的应聘者。这是一种比较好的算法，貌似这个问题没有最优解，目前。

苏格拉底麦穗模型在现实生活中的体现以及其最优解求解

苏格拉底的爱情

Tuesday, May 25, 2010

PostgreSQL安装使用 and 从MusicBrainz导入数据

    因为研究的需要，需要使用MusicBrainz的数据，它提供的是一个面向对象的数据库，而我则需要RDF的数据，不过它提供了一个教程，虽然很晦涩，而且分布开了。首先安装PostgreSQl，这个几行命令搞定，然后开始使用。PostgreSQL默认帐号是postgres，没有密码，可以设置。最基本的命令
=> psql => It is a terminal-based front-end to PostgreSQL.
=> CREATEUSER - Adds a new user to a PostgreSQL database cluster.
=> CREATEDB - create a new database
psql database，为选择相应的数据库
psql -U musicbrainz_user musicbrainz_db 为进入用户名musicbrainz_user的musicbrainz_db数据库
createdb -O musicbrainz_user musicbrainz_db 为usermusicbrainz_user 创建musicbrainz_db数据库
    这些命令可以man一下就理解了，可怜我一开始忘了help，搞了半天才搞懂，血的教训，血的教训
文件配置：
    其中还有非常重要的两个配置文件可以配置是否能远程访问访问，以及权限设置，在/etc/postgres/postgres.conf和pg_hba.conf
远程访问：
postgresql.conf中的listen_address改为* ,
pg_hba.conf
# TYPE DATABASE    USER        CIDR-ADDRESS          METHOD
host all all 0.0.0.0/0 md5
其他：
pg_hba.conf中注意是下面这几个的设置，具体可参加http://developer.postgresql.org/pgdocs/postgres/auth-pg-hba-conf.html
# Database administrative login by UNIX sockets
local   all      all    trust
# TYPE DATABASE    USER        CIDR-ADDRESS          METHOD
# "local" is for Unix domain socket connections only
local   all         all                 md5
# IPv4 local connections:
host    all         all         0.0.0.0/0         md5
# IPv6 local connections:
host    all         all         ::1/128               md5
~

搞好了这些，然后我就follow了MusicBrainz的method，终于完成
参考文献：
http://defindit.com/readme_files/postgres_utilities.html
http://developer.postgresql.org/pgdocs/postgres/index.html

Tuesday, May 11, 2010

[develop]Nutch 初体验爬行企业内部网

转自我的javaeye blog：http://xusulong.javaeye.com/blog/663411

前些日子琢磨着想搭建一个搜索引擎，自己写成本有点高，虽然以前写过爬虫，但是索引排序估计要烦得多

nutch 是一个开源的、Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。是一个应用程序，可以以 Lucene 为基础实现搜索引擎应用。

选定nutch之后，开始着手学习使用nutch，英文水平还不够，只能看看nutch的简单的tutorial，但是真正当教程，我还是选择了中文，可以让第一个搜索跑起来之后再选择学习英文的文档，以便更深的理解。

我选择的教程是 nutch入门学习

准备工作：

我的系统是Ubuntu 9.10，java -version 1.6.0_20-b02，nutch 1.0，以及tomcat 6.0.26

jdk和tomcat一般大家做过java和web开发都会有装，不赘述，有几点需要注意的列出来
1. tomcat的bin/catalina.sh中加入JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.20，这点我深受其害，开始没有设置，运行bin/nutch crawl的时候总是说JAVA_HOME is not set，我一想我明明设置了java环境变量的，java-version也是正常的，各种google，确定各种地方可以设置JAVA_HOME的地方，都无济于事，最后在一个角落找到，在此文件中可以添加JAVA_HOME，然后运行，居然可以，但是我不明白，nutch爬虫的运行应该是不依赖于tomcat的，tomcat只是用于搜索。这点未参透。
tomcat，jdk搞定之后是nutch，我直接将nutch放在用户名下面的nutch目录，然后将其中的nutch.war复制到 tomcat的webapp中，并取代ROOT（解压，重命名目录）

配置nutch：

这里参考nutch入门学习，我把改的地方说明出来。

增加要抓取的页面(以www.163.com为例)
1. [root@localhost nutch]#mkdir urls
2. [root@localhost nutch]#echo http://www.163.com/>>urls/163
3. 163文件中输入http://news.163.com/
编辑conf/crawl-urlfilter.txt文件，设定要抓取的网址信息。
[root@localhost nutch]#vi conf/crawl-urlfilter.txt
修改MY.DOMAIN.NAME为:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*163.com/
编辑conf/nutch-site.xml文件，增加代理的属性，并编辑相应的属性值
Xml代码
1. <property>
2. <name>http.agent.name</name>
3. <value></value>
4. <description>HTTP 'User-Agent' request header. MUST NOT be empty -
5. please set this to a single word uniquely related to your
6. organization.
7. NOTE: You should also check other related properties:
8. http.robots.agents
9. http.agent.description
10. http.agent.url
11. http.agent.email
12. http.agent.version
13. and set their values appropriately.
14. </description>
15. </property>
16. <property>
17. <name>http.agent.description</name>
18. <value></value>
19. <description>Further description of our bot- this text is used in
20. the User-Agent header. It appears in parenthesis after the agent
21. name.
22. </description>
23. </property>
24. <property>
25. <name>http.agent.url</name>
26. <value></value>
27. <description>A URL to advertise in the User-Agent header. This will
28. appear in parenthesis after the agent name. Custom dictates that this
29. should be a URL of a page explaining the purpose and behavior of this
30. crawler.
31. </description>
32. </property>
33. <property>
34. <name>http.agent.email</name>
35. <value></value>
36. <description>An email address to advertise in the HTTP 'From' request
37. header and User-Agent header. A good practice is to mangle this
38. address (e.g. 'info at example dot com') to avoid spamming.
39. </description>
```
<property> <name>http.agent.name</name> <value></value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> <property> <name>http.agent.description</name> <value></value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value></value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value></value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description>
```
nutch入门学习中说这里就算是不修改也无所谓，这里的设置，是因为nutch遵守了robots协议，在获取response时，把自己的相关信息提交给被爬行的网站，以供识别。但是我这样设置出现了错误提示，即http.agent.name需要设置，我将value设置成 xusulong*（记住有*）即可。其他可以不设置了。

配置tomcat：

设定搜索目录
(是由于默认的segment路径与我们实际的路径不符所造成的)
[root@localhost nutch]#cd ~/tomcat
[root@localhost tomcat]#vi webapps/ROOT/WEB-INF/classes/nutch-site.xml
增加四行代码，修改成为
Xml代码
1. <configuration>
2. <property>
3. <name>searcher.dirname>
4. <value>/home/whu/nutch/crawl.demovalue>
5. property>
6. configuration>
```
<configuration> <property> <name>searcher.dir</name> <value>/home/whu/nutch/crawl.demo</value> </property> </configuration>
```
这里的/home/whu/nutch/crawl.demo是我的nutch路径，爬虫到时候的数据就会放在程序新建的crawl.demo下面，即 nutch抓取的页面的保存目录。
nutch对中文的支持还不完善，需要修改tomcat文件夹下conf/server.xml文件
[root@localhost tomcat]#vi conf/server.xml
增加两句，修改为
<Connector port="8080"
maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
enableLookups="false" redirectPort="8443" acceptCount="100"
connectionTimeout="20000" disableUploadTimeout="true"
URIEncoding="UTF-8" useBodyEncodingForURI="true" />

抓取网页：

whu@leopard:~/nutch$ bin/nutch crawl urls -dir crawl.demo -depth 2 -threads 4 -topN 5 >& crawl.log

具体的参数nutch入门学习有解释，也可以参见nutch的官方网站。这里只抓取少量站点。

这时候 crawl.log会记录抓取的信息，我中间遇到过

如下几个错误：

http.agent.name需要设置问题
Input path does not exist问题，这个多试几次路径即可，只要这里的crawl.demo和配置tomcat中的路径对应，记得出错的时候把出错的目录删除，否则下次还是出错。

测试结果：

运行tomcat，进入首页，搜索网易，结果如下：

搞了一个下午和晚上，泪流满面，中途还有其他的错误我记不大清楚了，总之严重的错误我列出来了，仔细看系统如何报错，google之，仔细发现错误才是王道。