Thứ Hai, 28 tháng 11, 2016

Resources for article extraction from HTML pages

Research papers and Articles for article extraction from HTML pages

Some good blog articles:
 The Easy Way to Extract Useful Text from Arbitrary HTML. The author is using examples written in python to employ a fairly similar technique described in the text-to-tag ratio paper listed above. The original link is dead, here is a copy: http://www.cnblogs.com/loveyakamoz/archive/2011/08/18/2143965.html

Software for article extraction from HTML pages

code is here:
http://code.google.com/p/boilerp…
It has been integrated into Apache Tika as well
Demo Web Service: http://boilerpipe-web.appspot.com/
Java library: http://code.google.com/p/boilerp…
Research presentation (WSDM 2010): http://videolectures.net/wsdm201…


EmoticonEmoticon