Research papers and Articles for article extraction from HTML pages
- Boilerplate Detection using Shallow Text Features
- Extracting Article Text from the Web with Maximum Subsequence Segmentation
- Text Extraction from the Web via Text-to-Tag Ratio
- Web Content Extraction Through Histogram Clustering (another version)
- VIPS: a Vision-based Page Segmentation Algorithm
- Automatic Web News Extraction Using Tree Edit Distance: This algorithm uses a tree comparison metric analogous to Levenshtein distance to detect relevant content in a set of HTML documents.
- Discovering Informative Content Blocks from WebDocuments: employs entropy as a threshold metric to predict informative blocks of content.
- Web Page Cleaning with Conditional Random Fields: This paper presents the best performing algorithm which makes use of CRF to label blocks of content as text or noise based on block level features.
- Hierarchical wrapper induction for semistructured information sources
- Template detection for large scale search engines
- Web Page Cleaning for Web Mining through Feature Weighting
- Eliminating noisy information in Web pages for data mining
Some good blog articles:
The Easy Way to Extract Useful Text from Arbitrary HTML. The author is using examples written in python to employ a fairly similar technique described in the text-to-tag ratio paper listed above. The original link is dead, here is a copy: http://www.cnblogs.com/loveyakamoz/archive/2011/08/18/2143965.html
Software for article extraction from HTML pages
- Boilerpipe library: an open source Java library. The library itself is the official implementation of the overall algorithm presented in the previously mentioned paper by Kohlschütter et al.
- Readability bookmarklet by arc90labs is open sourced. Originally written in JavaScript it was also ported to other languages:
- python-readabilty – using BeautifulSoup (slow)
- fork of python-readability employing lxml for faster parsing
- ruby-readability
- PHP port
- jReadability
- C# port
- Project Goose by Gravity labs
- Perl module HTML::Feature
- Webstemmer is a web crawler and page layout analyzer with a text extraction utility
- Demo of VIPS packaged in a .dll (it’s use is limited to research purposes only)
code is here:
http://code.google.com/p/ boilerp…
It has been integrated into Apache Tika as well
It has been integrated into Apache Tika as well
Demo Web Service: http://boilerpipe-web.app spot.com/
Java library: http://code.google.com/p/ boilerp…
Research presentation (WSDM 2010): http://videolectures.net/ wsdm201…
Java library: http://code.google.com/p/
Research presentation (WSDM 2010): http://videolectures.net/