Crawl and Index….. Nutch / elasticSearch – Partners in the making


Hi

In the internet era, there is an old tech saying – “Content is King” ¬†(inspired by old Jungle saying from Phantom.. ūüôā )

One of the common challenges in content management system is to extract the latest information.  In the WWW world, it is commonly known as crawling.  The king of the crawler world is Apache nutch.

elasticsearch (no more just the new kid in town) has already established itself as one of the top search platforms.  It is only natural that companies are looking at using the both platforms together to achieve a better content management system specifically acquire, analyze, publish, search  phase.

Here’s a quick and dirty guide to get them up and running quickly.

1. Download nutch
2. set NUTCH_HOME
NUTCH_HOME=/Users/madheshr/tools/apache-nutch-2.2.1
export NUTCH_HOME
3. Clean build
ant clean
ant
4. Verify new local deploy created under NUTCH_HOME/rutime
/Users/madheshr/tools/apache-nutch-2.2.1/runtime/local
5. Under bin sudirectory of local, create a new directory called urls
6. In urls create a new file called nutch.txt. Edit the file to add URLs to crawl
7. Enable crawler in conf/nutch-site.xml by adding the below lines within configuration tags
<name>http.agent.name</name>
<value>My Nutch Spider</value>
8. Note the value and enter the same in conf/nutch-default.xml as the
value for <name>http.agent.name</name>
9. Test by running the below command in local/bin

nutch crawl urls -dir /tmp -depth 2
Integrate Nutch and ES
1. Activate elasticsearch indexer plugin
Edit conf/nutch-site.xml

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>

2. Verify and add ES specific properties to nutch-site.xlm

<!– Elasticsearch properties –>

<property>
<name>elastic.host</name>
<value>localhost</value>
<description>The hostname to send documents to using TransportClient. Either host
and port must be defined or cluster.</description>
</property>

<property>
<name>elastic.port</name>
<value>9300</value>
<description>
</description>
</property>

<property>
<name>elastic.cluster</name>
<value>elasticsearch</value>
<description>The cluster name to discover. Either host and potr must be defined
or cluster.</description>
</property>

<property>
<name>elastic.index</name>
<value>nutch</value>
<description>Default index to send documents to.</description>
</property>

<property>
<name>elastic.max.bulk.docs</name>
<value>250</value>
<description>Maximum size of the bulk in number of documents.</description>
</property>

<property>
<name>elastic.max.bulk.size</name>
<value>2500500</value>
<description>Maximum size of the bulk in bytes.</description>
</property>

3. Create a new index in ES if it is not there already
<value>nutch</value>

curl -XPUT ‘http://localhost:9200/nutch/&#8217;

Java Code Quality


Hi

Most tech leads will readily admit that in large development projects, a major chunk of their effort goes towards ensuring good code quality. ¬†With the increase in number of developers, there is greater need for standardization the code which is enforced in the form of adherence to certain code quality. ¬†As a programming language, Java if fortunate enough to have several coding conventions defined by several companies including Oracle (Sun). ¬†However every company or even individual projects within a company often supplement the general standard with it’s own set of custom guidelines, rules and conventions.

Just like everything else in life, it is a simpler matter to define standards / guidelines.  However it is an entirely different ball game to follow them.  For architects and tech leads it is a question of ensuring the adherence.  So, we are constantly on the lookout for efficient ways to accomplish this.  One of our favourite tools is the Sonar РJava static code analysis tool from Sonarqube.

Here are some quick steps on how to get up and running with Sonar on a Mac system.  Hope you find it useful.

Set up Sonar

Here’s a great link that i found
http://docs.codehaus.org/display/SONAR/Installing

1. Download sonar into some directory
For eg /Users/madheshr/tools/sonar-3.7

2. Create the sonar schema on MySQL

3. Edit sonar.properties in the conf directory and make below changes

– Specify DB parameters
– Webhosting mechanism: default is 127.0.0.1:9000

4. Create a startup script to start sonar
/Users/madheshr/tools/sonar-3.7/bin/macosx-universal-64/sonar.sh start &
Analyzing a project using sonar-runner

1. Download sonar-runner and extract it
/Users/madheshr/tools/sonar-runner-2.3

2. Edit conf/sonar-runner.properties to mention webserver name and DB name

Note: The default script has mismatched sonar schema names.
3. In the project home create a file sonar-project.properties. Note it is case-sensitive
Also confirm the path from which java code starts. May not be the main src itself

# required metadata
sonar.projectKey=my:iReconAdmin
sonar.projectName=iReconAdmin
sonar.projectVersion=1.0

# optional description
sonar.projectDescription=Admin utility for iRecon

# path to source directories (required)
sonar.sources=src

# The value of the property must be the key of the language.
sonar.language=java

# Encoding of the source code
sonar.sourceEncoding=UTF-8

4. Run using command sonar-runner
/Users/madheshr/tools/sonar-runner-2.3/

As always, all the mistakes are mine and all the credits go to the open source community.

Cheers..