Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading pipe assembly, you can quickly create specialized web mining applications that are optimized for a particular use case.
DEiXTo is a powerful web data extraction tool that is based on the W3C Document Object Model (DOM). It allows users to create highly accurate extraction rules (wrappers) that describe what pieces of data to scrape from a website. DEiXTo consists of three separate components:
- GUI DEiXTo, an MS Windows application implementing a friendly graphical user interface that is used to manage extraction rules (build, test, fine-tune, save and modify). This is all that you need for small scale extraction tasks.
- DEiXToBot, a Perl module implementing a flexible and efficient Mechanize agent (essentially a browser emulator) capable of extracting data of interest using GUI DEiXTo generated patterns. It contains best of breed Perl technology and allows extensive customization. Thus, it facilitates tailor-made solutions.
- DEiXTo CLE (Command Line Executor), a stand-alone, DEiXToBot-based, cross-platform utility that can massively apply an extraction rule on multiple target pages and produce structured output in a variety of formats.
Pattern is a web mining module for the Python programming language. It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization.
Scrapy is an open source and collaborative framework for extracting the data you need from websites. It is written in Python and runs on Linux, Windows, Mac and BSD and users write the rules to extract the data and let Scrapy do the rest.