The directory or one of the parent directories must be marked as
Source Root (In this case, it appears in blue).
If this is not the case, right click your root source directory -> Mark As -> Source Root.
|
Home
Archives for 2016
Thứ Ba, 20 tháng 12, 2016
IntelliJ does not show 'Class' when we right click and select 'New' (không định dạng file java trong intellij)
Người đăng Văn Hoàng
Đăng lúc 12/20/2016 07:02:00 CH
Tags
Thứ Hai, 28 tháng 11, 2016
Resources for article extraction from HTML pages
Người đăng Văn Hoàng
Đăng lúc 11/28/2016 08:37:00 SA
Research papers and Articles for article extraction from HTML pages
- Boilerplate Detection using Shallow Text Features
- Extracting Article Text from the Web with Maximum Subsequence Segmentation
- Text Extraction from the Web via Text-to-Tag Ratio
- Web Content Extraction Through Histogram Clustering (another version)
- VIPS: a Vision-based Page Segmentation Algorithm
- Automatic Web News Extraction Using Tree Edit Distance: This algorithm uses a tree comparison metric analogous to Levenshtein distance to detect relevant content in a set of HTML documents.
- Discovering Informative Content Blocks from WebDocuments: employs entropy as a threshold metric to predict informative blocks of content.
- Web Page Cleaning with Conditional Random Fields: This paper presents the best performing algorithm which makes use of CRF to label blocks of content as text or noise based on block level features.
- Hierarchical wrapper induction for semistructured information sources
- Template detection for large scale search engines
- Web Page Cleaning for Web Mining through Feature Weighting
- Eliminating noisy information in Web pages for data mining
Some good blog articles:
The Easy Way to Extract Useful Text from Arbitrary HTML. The author is using examples written in python to employ a fairly similar technique described in the text-to-tag ratio paper listed above. The original link is dead, here is a copy: http://www.cnblogs.com/loveyakamoz/archive/2011/08/18/2143965.html
Software for article extraction from HTML pages
- Boilerpipe library: an open source Java library. The library itself is the official implementation of the overall algorithm presented in the previously mentioned paper by Kohlschütter et al.
- Readability bookmarklet by arc90labs is open sourced. Originally written in JavaScript it was also ported to other languages:
- python-readabilty – using BeautifulSoup (slow)
- fork of python-readability employing lxml for faster parsing
- ruby-readability
- PHP port
- jReadability
- C# port
- Project Goose by Gravity labs
- Perl module HTML::Feature
- Webstemmer is a web crawler and page layout analyzer with a text extraction utility
- Demo of VIPS packaged in a .dll (it’s use is limited to research purposes only)
Boilerpipe library
Boilerpipe is one of the best open source packages for full article text extraction using machine learning techniques. Both text and structural features of the document are used to train a classifier to predict whether the observed part of the document belongs to the article text or not.
Boilerplate adopts the following features of HTML documents:
- text frequency in the whole corpus: to obtain useless phrases that are not the main text of the article.
- particular tags that enclose a block of text: <h#> headline, <p> paragraph, <a> anchor and <div> division
- shallow text features: average word length, average sentence length, absolute number of words in the segment
- local context of text: absolute and relative position of the text block
- heuristic features: number of words that start with an uppercase letter, number of words written in all-caps, number of date and time tokens, link density and certain ratios of those previously listed
- density of text blocks: number of words in a wrapped fixed column width text block divided by number of lines of the same block
The original html document is transfered into atomic text blocks that are annotated with features listed above and labeled manually with content or boilerpate class.
code is here:
http://code.google.com/p/ boilerp…
It has been integrated into Apache Tika as well
It has been integrated into Apache Tika as well
Demo Web Service: http://boilerpipe-web.app spot.com/
Java library: http://code.google.com/p/ boilerp…
Research presentation (WSDM 2010): http://videolectures.net/ wsdm201…
Java library: http://code.google.com/p/
Research presentation (WSDM 2010): http://videolectures.net/
Extracting Article Text from the Web with Maximum Subsequence Segmentation
This algorithm (MSS) transforms the problem of detecting article text in HTML documents to maximum subsequence optimization. Local token-level classifier outputs a sequence of scores for applying maximum subsequence optimization. Indexes of the subsequence are used to generate the set of tokens to represent the extracted text.
Given a sequence of numbers, the task is to find a continuous subsequence where the sum of it’s elements is maximal. This is easy if the elements of the sequence are all non-negative. In other words:
In MSS, a document is tokenized using the following steps:
- discard everything between <script> and <style> tags
- break up HTML into a list of tags, words and numbers
- apply porter stemming to all words
- generalize numeric tokens
To apply maximum subsequence optimization, local token-level classifiers are used to find a score for each token of the document. A negative score indicates that the observed token is not likely to be considered as content and vica-versa for tokens with positive scores. For the Experiments, Naive Bayes model is trained with 2 types of features for every labeled token in the document:
- trigram of token: the token itself and its 2 successors
- parent tag of token in the DOM tree (this can easily be implemented by maintaining a stack of tags when passing through the token array)
Each score produced by the NB classifier is then transformed with f(p) = p – 0.5 to obtain a sequence of scores ranging from [-0.5, 0.5].
Text Extraction from the Web via Text-to-Tag Ratio
If MSS transformed the problem of article text extraction to maximum subsequence optimization; this approach transforms it to histogram clustering.
A Cluster algorithm is adopted to the text-to-tag ratio array (
TTRArray
) that is generated using the following steps:- delete all empty lines and the contents between <script> tags from the original document
- initialize the
TRRArray
- for each line in the html document
- let
x
be the number of non-tag ASCII characters - let
y
be the number of tags in that line - if there are no tags in the current line than
TTRArray[ current line ] = x
- otherwise
TTRArray[ current line ] = x/y
- let
The resulting
TTRArray
holds a text-to-tag ratio for every line in the filtered HTML document.
Clustering can be applied to the TTRArray to obtain lines of text that represent the article by considering the following heuristic:
TTRArray (image courtesy of Weninger & Hsu: Text Extraction from the Web via Text-to-Tag Ratio)
For each k in TTRArray, the higher the TTR is for an element k relative to the mean TTR of the entire array the more likely that k represents a line of content-text within the HTML-page.
Prior to clustering, TTRArray is passed by a smoothing function to prevent the loss of short paragraph lines that might still be part of the content at the edges of the article.
Weninger & Hsu propose the following clustering techniques to obtain article text:
- K-means
- Expectation Maximization
- Farthest First
- Threshold clustering; using standard deviation as a cut-of threshold
VIPS: a Vision based Page Segmentation Algorithm
Semantic tree structure (source http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html)
This algorithm is different from others in that it is making full use of the visual page layout features.
It first extracts blocks from the HTML DOM tree and assigns them a value that indicates the coherency of the block based on visual perception. A hierarchical semantic structure is then employed to represent the structure of the website. The algorithm applies vertical and horizontal separators to the block layout to construct such a semantic structure.
The constructed structure is then used to find parts of the website that represents article text and other common building blocks.
Readabillity
Readability is the most popular article text extraction tool. Readability relies solely on common HTML coding practices to extract article text, title, corresponding images and even the next button if the article seems to be segmented into several pages.
Reference:
Thứ Tư, 16 tháng 11, 2016
Chủ Nhật, 6 tháng 11, 2016
Nên dùng Array.forEach hay for trong Javascript?
Người đăng Văn Hoàng
Đăng lúc 11/06/2016 07:27:00 CH
- về nguyên tắc thì sử dụng cái nào cũng được . tuy nhiên về tốc độ thì dùng for được đánh giá cao về tốc độ, thực hiện test trên trang https://jsperf.com ta sẽ thấy đều đó
Use:kết quả cho thấy
arr.forEach(function (item) {
someFn(item);
})
for (var i = 0, len = arr.length; i < len; i++) {
someFn(arr[i]);
}
Use:kết quả cho thấy
tham khảo thêm tại :
https://coderwall.com/p/kvzbpa/don-t-use-array-foreach-use-for-instead
Thứ Sáu, 28 tháng 10, 2016
fix lỗi 404 error của wordpress khi thiết lập đường dẫn tĩnh
Người đăng Văn Hoàng
Đăng lúc 10/28/2016 09:00:00 CH
Tags
Để thiết lập đường dẫn tĩnh cho website trong Wordpress ta vào Settings->Permalink:
Tiếp theo là bật chức năng rewrite trong wampserver: click trái vào biểu tượng wampserser ở system tray tiếp theo chọn Apache -> apache modules -> check vào rewrite_module
Giải thích thêm:
- Common Settings: Các thiết lập thông dụng.
- Plain: Cấu trúc đường dẫn mặc định (đường dẫn động).
- Day and name: cấu trúc đường dẫn với kiểu hiển thị đầy đủ ngày tháng đăng post và tên post.
- Month and name: cấu trúc đường dẫn với kiểu hiển thị tháng, năm và tên post.
- Numeric: Cấu trúc đường dẫn hiển thị ID của post thay vì tên.
- Post name: Chỉ hiển thị tên post trên đường dẫn
- Custom Structure: Cấu trúc được xác định ở đây thông qua các từ khóa cấu trúc (được bọc bởi ký tự %)
- Optional: Các thiết lập tùy chọn không bắt buộc.
- Category base: Tên đường dẫn mẹ của các đường dẫn tới trang category. Mặc định nó sẽ là http://domain/category/tên-category/, nếu bạn điền “chuyen-muc” vào đây thì nó sẽ hiển thị là http://domain/chuyen-muc/tên-category.
- Tag base: Tên đường dẫn mẹ của đường dẫn tới các trang tag. Mặc định nó sẽ là http://domain/tag/tên-tag/, nếu bạn điền “the” vào đây thì nó sẽ hiển thị là http://domain/the/tên-tag.
Tiếp theo là bật chức năng rewrite trong wampserver: click trái vào biểu tượng wampserser ở system tray tiếp theo chọn Apache -> apache modules -> check vào rewrite_module
Okie, như vậy là đã hoàn tất
tham khảo thêm video:
Đăng ký:
Bài đăng (Atom)