Research on Finding Near-replicas of Documents on the Web

HomePage >> Journals >> Electrical Engineering and Automation

Electrical Engineering and Automation

Electrical Engineering and Automation is an international comprehensive professional academic journal of Ivy Publisher, concerning the development of electrical theory and automation on the combination of electrical theory and modern industrial technology. The main focus of the journal is the academic papers and comments of latest power electronics theoretical and technical research improvement in the fields of nature science, engineering technol... [More] Electrical Engineering and Automation is an international comprehensive professional academic journal of Ivy Publisher, concerning the development of electrical theory and automation on the combination of electrical theory and modern industrial technology. The main focus of the journal is the academic papers and comments of latest power electronics theoretical and technical research improvement in the fields of nature science, engineering technology, economy and science, report of latest research result, aiming at providing a good communication platform to transfer, share and discuss the theoretical and technical development of electrical theory development for professionals, scholars and researchers in this field, reflecting the academic front level, promote academic change and foster the rapid expansion of electrical theory and automation application technology.

The journal receives manuscripts written in Chinese or English. As for Chinese papers, the following items in English are indispensible parts of the paper: paper title, author(s), author(s)'affiliation(s), abstract and keywords. If this is the first time you contribute an article to the journal, please format your manuscript as per the sample paper and then submit it into the online submission system. Accepted papers will immediately appear online followed by printed hard copies by Ivy Publisher globally. Therefore, the contributions should not be related to secret. The author takes sole responsibility for his views.

ISSN Print:2326-876X

ISSN Online:2326-8778

Email:eea@ivypub.org

Website: http://www.ivypub.org/eea/

		0
		0

Navigation

News

All articles submitted to this journal are required to be tested by AMLC.

This journal is full-text indexed by CNKI which is the China's most authoritative academic resources database.

This journal is full-text indexed by Open J-Gate.

Paper Infomation

Research on Finding Near-replicas of Documents on the Web

Full Text(PDF, 229KB)

Author: Jian Chen, Youqun Shi, Ran Tao

Abstract: The presence of a large number of near-replicas of documents on the web has become the biggest obstacle to the rapid access to effective information. In order to solve the problem that there are a large number of approximate mirror pages on the network, the researchers proposed a variety of approximate mirror page de-algorithm, but the performance of these algorithms in web noise resistance is not satisfactory. To solve these problems, this paper proposes an algorithm based on Simhash long sentence extraction approximate mirror page de-emphasis, extracting long sentences in the document to avoid the adverse effect of web noise and weakening the bad effects on algorithm brought by the noises. Researches on de-emphasis of web page information suggest that the improved algorithm can effectively weaken the noise effect, which has a high accuracy rate and recall rate.

Keywords: Near-replicas Web Pages; Simhash; Long Sentence Extraction; Avoiding Noise

References:

[1]China Internet Network Information Center. The 37th China Internet Development Statistics Report [EB/OL]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/201601/P020160122469130059846.pdf, 2016-01-22.

[2]Fan Yong，Zheng Jia-heng. Research on elimination of similar web pages [J]. .Computer Engineering and Applications,2009,12:141-143+183.

[3]HUANG Ren， FENG Sheng， YANG Ji-yun， LIU Yu， AO Min. Detection and elimination of similar Web pages based on text structure and extraction of long sentences[J]. Application Research of Computers, 2010, 27(7):2489-2491.

[4]Chowdhury A, Frieder O, Grossman D, et al. Collection statistics for fast duplicate document detection[J]. Acm Transactions on Information Systems, 2002, 20(2):171-191.

[5]Broder A Z, Glassman S C, Manasse M S, et al. Syntactic clustering of the Web[J]. Computer Networks & Isdn Systems, 1997, 29(8–13):1157-1166.

[6]Theobald, Martin, Siddharth, et al. SpotSigs: robust and efficient near duplicate detection in large web collections[J]. 2008..

[7]Charikar M S. Similarity estimation techniques from rounding algorithms[J]. Applied & Computational Harmonic Analysis, 2002:380-388.

[8]WEI Li-xia, ZHENG Jia-heng. Detection and elimination of similar Web pages based on text structure[J]. Computer Applications, 2007, 27(11):2854-2856.

[9]Gionis A, Indyk P, Motwani R. Similarity Search in High Dimensions via Hashing[C]// International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc. 2000:518--529.

[10]Andoni A, Indyk P. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions.[J]. Foundations of Computer Science Annual Symposium on, 2006, 51(1):459-468.

[11]Mikolajczyk K, Schmid C. A performance evaluation of local descriptors[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2003:257.

[12]Witten I H, Moffat A, Bell T C. Managing gigabytes (2nd ed.): compressing and indexing documents and images[J]. 1999, 41(6):79-80.