Web Crawling using Apache Nutch + MySQL + Solr
If you have come here, then you probably know what Nutch is and you are looking for a test to integrate it with MySQL and Solr or Elasticsearch. So, lets get started.
I am not going to guide you with the steps, but give you this awesome link which helped me run a test crawl in couple of hours. Technically, it should have taken not more than half hour, but I had to spend more time trying to figure out the issues mentioned below on my Mac - El Capitan.
Firstly follow the steps mentioned in this link. https://anil.io/post/92/apache-nutch-2-2-mysql-and-solr-5-2-1-tutorial
In the process of trying to make things work, you may come across the following issues. I have provided solution for each below
1) Don't ignore this step [STEP 2 in the link above]
export JAVA_HOME="$(/usr/libexec/java_home -v 1.8)"
export NUTCH_JAVA_HOME="$(/usr/libexec/java_home -v 1.8)"
2) Issue when creating the database nutch in MySQL. [STEP 3. 2 in the link above]
Install MySQL and login as root user and create the database locally on your machine. When you attempt to create the DB you will bump into either of the two issues below
The maximum column size is 767 bytes
Specified key was too long; max key length is 767 bytes
Please note that after running the query, your DB would have been created and the error is coming from the table and not from the database creation. Next time you run it, remove the Create Database .... from the query and run only the part where you are creating the table until it succeeds. You may not notice the DB but the error log will show it. Simple close MySQL workbench and open it again.
To fix either of these issues
1) Right click on the nutch database
2) Click 'Alter Schema'
3) Update Default Collation to -> latin1 - default collation
Hopefully now, the DB and table is created and ready for use.
3) Updating gora.properties [STEP 3.4 in the link above]
Make sure you enter the correct username and password for localhost. Also, make sure the port number is correct. Mine was setup to 3307, but the number in the example is 3306. You can find this in MySQL workbench too.
4) Disable MySQL remote access.
If you are testing this on a locally installed MySQL then to disable remote access do the following
If /etc/my.cnf does not exist, then create one using the command - sudo touch /etc/my.cnf. Open it and add one of the following
[mysqld]
bind-address=127.0.0.1
To completely disable networking,
[mysqld]
skip-networking
Restart MySQL after doing this. To restart on your respective OS follow this link
On El Capitan, I had to use the two commands (copied from the link above) below to stop and start MySQL
sudo launchctl load -F /Library/LaunchDaemons/com.oracle.oss.mysql.mysqld.plist
sudo launchctl unload -F /Library/LaunchDaemons/com.oracle.oss.mysql.mysqld.plist
5) Permission denied error for local/log/hadoop.log
Just give the required permissions by executing the command
6) Finally, you may have the connection refused error.
This is due to connection issues with MySQL. In my case, it was the issues mentioned in step 3. And also that I forgot to recompile after editing dora.properties file by running - ant runtime.
That's it. Happy crawling.
Comments