A
Ajax
Aspect-Oriented
 
B
Bloggers
Build Systems
Business Intelligence
ByteCode
 
C
CMS
Cache Solutions
Charting & Reporting
Chat Servers
Code Analyzers
Code Beautifiers
Code Coverage
Collections
Command Line
Connection Pools
Crawlers
 
D
Databases
 
E
EJB Servers
ERP & CRM
ESB
Eclipse Plugins
Expression Languages
 
F
Financial Soft
Forum Soft
 
G
General Purpose
Geospatial
Groupware
 
H
HTML Parsers
 
I
IDEs
Installers
Inversion of Control
Issue Tracking
 
J
J2EE Frameworks
JDBC
JMS
JMX
JSP Tag Libraries
Job Schedulers
 
L
Localization
Logging Tools
 
M
Mail Clients
 
N
Network Clients
Network Servers
NoSQL Databases
 
O
Obfuscators
 
P
PDF Libraries
Parser Generators
Persistence
Portals
Profilers
Project Management
 
R
RSS & RDF Tools
Rule Engines
 
S
SQL Clients
Scripting Languages
Search Engines
Security
Source Control
Swing
 
T
Template Engines
Testing Tools
Text Processing
 
U
UML & Modeling
 
V
Validation
 
W
Web Frameworks
Web Mail
Web Servers
Web Services
Web Testing
Wiki Engines
Workflow Engines
 
X
XML Parsers
XML UI Toolkits
 

Open Source Crawlers in Java

Heritrix

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Go To Heritrix

WebSPHINX

WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for Web crawlers that browse and process Web pages automatically.

Go To WebSPHINX

JSpider

A highly configurable and customizable Web Spider engine, Developed under the LGPL Open Source license, In 100% pure Java.

Go To JSpider

WebEater

A 100% pure Java program for web site retrieval and offline viewing.

Go To WebEater

Java Web Crawler

Java Web Crawler is a simple Web crawling utility written in Java. It supports the robots exclusion standard.

Go To Java Web Crawler

WebLech

WebLech is a fully featured web site download/mirror tool in Java, which supports many features required to download websites and emulate standard web-browser behaviour as much as possible. WebLech is multithreaded and will feature a GUI console.

Go To WebLech

Arachnid

Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub-classing Arachnid and adding a few lines of code called after each page of a Web site is parsed.

Go To Arachnid

JoBo

JoBo is a simple program to download complete websites to your local computer. Internally it is basically a web spider. The main advantage to other download tools is that it can automatically fill out forms (e.g. for automated login) and also use cookies for session handling. Compared to other products the GUI seems to be very simple, but the internal features matters ! Do you know any download tool that allows it to login to a web server and download content if that server uses a web forms for login and cookies for session handling ? It also features very flexible rules to limit downloads by URL, size and/or MIME type.

Go To JoBo

Web-Harvest

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

Go To Web-Harvest

Crawler4j

Crawler4j is a Java library which provides a simple interface for crawling the web. Using it, you can setup a multi-threaded web crawler in 5 minutes! It is also very efficient, it has been able to download and parse 200 pages per second on a Quad core PC with cable connection.

Go To Crawler4j

Ex-Crawler

Ex-Crawler is divided into three subprojects. Ex-Crawler server daemon is a highly configurable, flexible (Web-) Crawler, including distributed grid / volunteer computing features written in Java. Crawled informations are stored in MySQL, MSSQL or PostgreSQL database. It supports plugins through multiple Plugin Interfaces. It comes with it's own socket server, where you can configure it, add urls and much more. Including user accounts and user levels, which are shared with the webfrontend search engine. With the Ex-Crawler distributed crawling graphical client, other people / computers can crawl and analyse websites, images and more for the crawler. The third part of the project is the webfrontend search engine.

Go To Ex-Crawler

Bixo

Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading pipe assembly, you can quickly create specialized web mining applications that are optimized for a particular use case.

Go To Bixo






Java is a trademark or registered trademark of Sun Microsystems, Inc. in the United States and other countries. This site is independent of Sun Microsystems, Inc.