Thứ Hai, 28 tháng 11, 2016

Resources for article extraction from HTML pages

Research papers and Articles for article extraction from HTML pages

Some good blog articles:
 The Easy Way to Extract Useful Text from Arbitrary HTML. The author is using examples written in python to employ a fairly similar technique described in the text-to-tag ratio paper listed above. The original link is dead, here is a copy: http://www.cnblogs.com/loveyakamoz/archive/2011/08/18/2143965.html

Software for article extraction from HTML pages

code is here:
http://code.google.com/p/boilerp…
It has been integrated into Apache Tika as well
Demo Web Service: http://boilerpipe-web.appspot.com/
Java library: http://code.google.com/p/boilerp…
Research presentation (WSDM 2010): http://videolectures.net/wsdm201…


EmoticonEmoticon

:)
:(
hihi
:-)
:D
=D
:-d
;(
;-(
@-)
:o
:>)
(o)
:p
:-?
(p)
:-s
8-)
:-t
:-b
b-(
(y)
x-)
(h)