Qi Ning: Webpage of search engine knowledge checks heavy technology

Filed under: SEO Optimization — Wrote by Lees on Friday, December 26th, 2008 @ 9:58 am

To search engine, reduplicative webpage content is very harmful. The existence that repeats a webpage means these webpages to be about to be searched to bring prop up multiprocessing. More harmful is the index that searchs engine in making, may be in index Kulisuo brings two identical webpages. When somebody inquires, reduplicative webpage link can appear in searching a result. No matter be,from the search so experience or systematic efficiency retrieve these heavy burden webpages are evil place for quality.

The webpage checks heavy technology traceable to duplicate detect technology, judge content of a file to whether exist namely borrowed, duplicate another or the technology of many files.

1993 the Manber of Arizona university (Google shows vice-president, engineer) rolled out tool of a Sif, seek similar document. 1995 the Brin of Stanford university (Sergey Brin, one of Google author) be in with the person such as Garcia-Molina ” digital books is watched ” text puts forward to duplicate first in the project detect mechanism COPS (Copy Protection System) system and corresponding algorithm [Sergey Brin Et Al 1995] . This is planted later detect in repeating a technology to be arrived to search engine by application, basic core technology compares likeness already.

Webpage and simple documentation are different, the special attribute of the webpage has the number such as content and format, because this is in,content and stylistic photograph made the kind of likeness of 4 kinds of webpages with likeness. 1, pattern of two pages content is identical. 2, two pages content is identical, but the pattern is different. 3, content of two pages part is identical and the pattern is same. 4, two pages share is important and same but the pattern is different.

Implement a method:

The webpage is checked heavy, arrange the webpage the documentation that becomes to have caption and main body above all, will go to the lavatory check heavy. So the webpage is checked cry again again ” documentation is checked heavy ” . “Documentation is checked heavy ” be divided to be 3 measure commonly,

One, feature extraction.

2, likeness spends computation and evaluation.

3, pass the time in a leisurely way is serious.

1. We are in feature extraction when judging likeness, it is ability undertakes contrast with changeless feature commonly, the file is checked weighing the first pace also is to undertake feature extraction. Disclose documentation content namely, by a certain number of features that comprise documentation gather expresses, this one pace is for the feature at the back of the respect compare computational likeness to spend. Feature extraction has a lot of methods, two kinds of more classical algorithm basically say here, “I-Match is algorithmic ” , ” Shingle is algorithmic ” . “I-Match is algorithmic ” it is Lai Yu notting comply’s complete information analysis, however feature of statistic of service data aggregate comes the main feature of draw-out documentation, will be not main feature to abandon. “Shingle is algorithmic ” pass many more draw-out diagnostic vocabulary, compare degree of likeness of two features aggregate to realize documentation to check heavy.

2. After likeness spends computation and evaluation feature extraction to end, undertake with respect to need the feature contrasts, because the webpage is checked,weighing the 2nd pace is similar computation and degree of evaluation. The feature of I-Match algorithm has only, should input a documentation, be worth according to lexical IDF (go against text frequency index, abbreviate of Inverse Document Frequency is IDF) filter an a few crucial features, namely in an article mix particularly high the vocabulary of special low frequency often cannot react the essence of this article. The take out in passing documentation accordingly is high frequency with low frequency vocabulary, and computation gives the exclusive Hash of this documentation be worth (Hash says simply even if be data value map address. Regard an input as data value, via can getting the address is worth after computation. ) , the documentation with those Hash same values is reduplicative.

Shingle algorithm is many more draw-out the feature undertakes comparative, so processing rises a few more complex, relative method is completely consistent Shingle several. Divide the Shingle with the Shingle subtractive and uniform number with two documentation next several, the numerical value of cipher out of this kind of method plan is ” Jaccard coefficient ” , it can judge aggregate likeness to spend. The be mixed of computational method aggregate of Jaccard coefficient is divided with aggregate and collect.

3. Disappear repeats content to deleting again, search engine considers numerous collect an element, used the simplest the most practical method so. First by the page of reptile capture at the same time very old rate also made sure preferential reservation achieves a webpage formerly.

The webpage investigates serious work is a system in indispensable, deleted reduplicative page, the other link that searchs engine so also can reduce a lot of needless troubles, saved index storage space, reduced inquiry cost, improved PageRank computation efficiency. Went to the lavatory to search engine user.

Tags: , , , , , , ,

Google China checks sensitive image

Filed under: Google Blogoscoped — Wrote by Lees on Monday, March 3rd, 2008 @ 10:19 am

Translation

Tags: , , , , ,

Copyright © 2007 Google Adsense College.
Powered by GoogleSchool. All Rights Reserved.