Kiến thức công nghệ: tháng 11 2016

Research papers and Articles for article extraction from HTML pages

Automatic Web News Extraction Using Tree Edit Distance: This algorithm uses a tree comparison metric analogous to Levenshtein distance to detect relevant content in a set of HTML documents.
Discovering Informative Content Blocks from WebDocuments: employs entropy as a threshold metric to predict informative blocks of content.
Web Page Cleaning with Conditional Random Fields: This paper presents the best performing algorithm which makes use of CRF to label blocks of content as text or noise based on block level features.

Some good blog articles:

The Easy Way to Extract Useful Text from Arbitrary HTML. The author is using examples written in python to employ a fairly similar technique described in the text-to-tag ratio paper listed above. The original link is dead, here is a copy: http://www.cnblogs.com/loveyakamoz/archive/2011/08/18/2143965.html

Software for article extraction from HTML pages

Boilerpipe library: an open source Java library. The library itself is the official implementation of the overall algorithm presented in the previously mentioned paper by Kohlschütter et al.
Readability bookmarklet by arc90labs is open sourced. Originally written in JavaScript it was also ported to other languages:
- python-readabilty – using BeautifulSoup (slow)
- fork of python-readability employing lxml for faster parsing
- ruby-readability
- PHP port
- jReadability
- C# port
Project Goose by Gravity labs
Perl module HTML::Feature
Webstemmer is a web crawler and page layout analyzer with a text extraction utility
Demo of VIPS packaged in a .dll (it’s use is limited to research purposes only)

code is here:

http://code.google.com/p/boilerp…
It has been integrated into Apache Tika as well

Demo Web Service: http://boilerpipe-web.appspot.com/
Java library: http://code.google.com/p/boilerp…
Research presentation (WSDM 2010): http://videolectures.net/wsdm201…

Extracting Article Text from the Web with Maximum Subsequence Segmentation

This algorithm (MSS) transforms the problem of detecting article text in HTML documents to maximum subsequence optimization. Local token-level classifier outputs a sequence of scores for applying maximum subsequence optimization. Indexes of the subsequence are used to generate the set of tokens to represent the extracted text.

Given a sequence of numbers, the task is to find a continuous subsequence where the sum of it’s elements is maximal. This is easy if the elements of the sequence are all non-negative. In other words:

In MSS, a document is tokenized using the following steps:

discard everything between <script> and <style> tags
break up HTML into a list of tags, words and numbers
apply porter stemming to all words
generalize numeric tokens

To apply maximum subsequence optimization, local token-level classifiers are used to find a score for each token of the document. A negative score indicates that the observed token is not likely to be considered as content and vica-versa for tokens with positive scores. For the Experiments, Naive Bayes model is trained with 2 types of features for every labeled token in the document:

trigram of token: the token itself and its 2 successors
parent tag of token in the DOM tree (this can easily be implemented by maintaining a stack of tags when passing through the token array)

Each score produced by the NB classifier is then transformed with f(p) = p – 0.5 to obtain a sequence of scores ranging from [-0.5, 0.5].

Text Extraction from the Web via Text-to-Tag Ratio

If MSS transformed the problem of article text extraction to maximum subsequence optimization; this approach transforms it to histogram clustering.

A Cluster algorithm is adopted to the text-to-tag ratio array (TTRArray) that is generated using the following steps:

delete all empty lines and the contents between <script> tags from the original document
initialize the TRRArray
for each line in the html document
1. let x be the number of non-tag ASCII characters
2. let y be the number of tags in that line
3. if there are no tags in the current line than TTRArray[ current line ] = x
4. otherwise TTRArray[ current line ] = x/y

The resulting TTRArray holds a text-to-tag ratio for every line in the filtered HTML document.

Clustering can be applied to the TTRArray to obtain lines of text that represent the article by considering the following heuristic:

TTRArray (image courtesy of Weninger & Hsu: Text Extraction from the Web via Text-to-Tag Ratio)

For each k in TTRArray, the higher the TTR is for an element k relative to the mean TTR of the entire array the more likely that k represents a line of content-text within the HTML-page.

Prior to clustering, TTRArray is passed by a smoothing function to prevent the loss of short paragraph lines that might still be part of the content at the edges of the article.

Weninger & Hsu propose the following clustering techniques to obtain article text:

K-means
Expectation Maximization
Farthest First
Threshold clustering; using standard deviation as a cut-of threshold

VIPS: a Vision based Page Segmentation Algorithm

Semantic tree structure (source http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html)

This algorithm is different from others in that it is making full use of the visual page layout features.

It first extracts blocks from the HTML DOM tree and assigns them a value that indicates the coherency of the block based on visual perception. A hierarchical semantic structure is then employed to represent the structure of the website. The algorithm applies vertical and horizontal separators to the block layout to construct such a semantic structure.

The constructed structure is then used to find parts of the website that represents article text and other common building blocks.

Readabillity

Readability is the most popular article text extraction tool. Readability relies solely on common HTML coding practices to extract article text, title, corresponding images and even the next button if the article seems to be segmented into several pages.

It provides both python and ruby and implementations.

link: http://www.learn4master.com/machine-learning/resources-for-article-extraction-from-html-pages