Research on Finding Near-replicas of Documents on the Web

您所在的位置：首页 >> 期刊 >> 电气工程与自动化

电气工程与自动化

《电气工程与自动化》是IVY出版社旗下的一本关注电气理论及其自动化发展的国际期刊，是电气理论与现代工业技术相结合的综合性学术刊物。主要刊登有关电力电子，及其在自然科学、工程技术、经济和社会等各领域内的最新研究进展的学术性论文和评论性文章。旨在为该领域内的专家、学者、科研人员提供一个良好的传播、分享和探讨电气理论进展的交流平台，反映学术前沿水平，促进学术交流，推进电气理论和自动化应用技术的发展。本刊可接收中、英文稿件。其中，中文稿件要有详细的英文…… 【更多】《电气工程与自动化》是IVY出版社旗下的一本关注电气理论及其自动化发展的国际期刊，是电气理论与现代工业技术相结合的综合性学术刊物。主要刊登有关电力电子，及其在自然科学、工程技术、经济和社会等各领域内的最新研究进展的学术性论文和评论性文章。旨在为该领域内的专家、学者、科研人员提供一个良好的传播、分享和探讨电气理论进展的交流平台，反映学术前沿水平，促进学术交流，推进电气理论和自动化应用技术的发展。
本刊可接收中、英文稿件。其中，中文稿件要有详细的英文标题、作者、单位、摘要和关键词。初次投稿请作者按照稿件模板排版后在线投稿。稿件会经过严格、公正的同行评审步骤，录用的稿件首先发表在本刊的电子刊物上，然后高质量印刷发行。期刊面向全球公开征稿、发行，要求来稿均不涉密，文责自负。

ISSN Print:2326-876X

ISSN Online:2326-8778

Email:eea@ivypub.org

Website: http://www.ivypub.org/eea/

		0
		0

快捷导航

新闻

所有投至本刊的文章都需经过科技期刊学术不端文献检测系统（AMLC）进行文献检测。

本刊已被中国最权威的学术资源数据库中国知网全文收录。

本刊已被Open J-Gate数据库全文收录。

Paper Infomation

Research on Finding Near-replicas of Documents on the Web

Full Text(PDF, 229KB)

Author: Jian Chen, Youqun Shi, Ran Tao

Abstract: The presence of a large number of near-replicas of documents on the web has become the biggest obstacle to the rapid access to effective information. In order to solve the problem that there are a large number of approximate mirror pages on the network, the researchers proposed a variety of approximate mirror page de-algorithm, but the performance of these algorithms in web noise resistance is not satisfactory. To solve these problems, this paper proposes an algorithm based on Simhash long sentence extraction approximate mirror page de-emphasis, extracting long sentences in the document to avoid the adverse effect of web noise and weakening the bad effects on algorithm brought by the noises. Researches on de-emphasis of web page information suggest that the improved algorithm can effectively weaken the noise effect, which has a high accuracy rate and recall rate.

Keywords: Near-replicas Web Pages; Simhash; Long Sentence Extraction; Avoiding Noise

References:

[1]China Internet Network Information Center. The 37th China Internet Development Statistics Report [EB/OL]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/201601/P020160122469130059846.pdf, 2016-01-22.

[2]Fan Yong，Zheng Jia-heng. Research on elimination of similar web pages [J]. .Computer Engineering and Applications,2009,12:141-143+183.

[3]HUANG Ren， FENG Sheng， YANG Ji-yun， LIU Yu， AO Min. Detection and elimination of similar Web pages based on text structure and extraction of long sentences[J]. Application Research of Computers, 2010, 27(7):2489-2491.

[4]Chowdhury A, Frieder O, Grossman D, et al. Collection statistics for fast duplicate document detection[J]. Acm Transactions on Information Systems, 2002, 20(2):171-191.

[5]Broder A Z, Glassman S C, Manasse M S, et al. Syntactic clustering of the Web[J]. Computer Networks & Isdn Systems, 1997, 29(8–13):1157-1166.

[6]Theobald, Martin, Siddharth, et al. SpotSigs: robust and efficient near duplicate detection in large web collections[J]. 2008..

[7]Charikar M S. Similarity estimation techniques from rounding algorithms[J]. Applied & Computational Harmonic Analysis, 2002:380-388.

[8]WEI Li-xia, ZHENG Jia-heng. Detection and elimination of similar Web pages based on text structure[J]. Computer Applications, 2007, 27(11):2854-2856.

[9]Gionis A, Indyk P, Motwani R. Similarity Search in High Dimensions via Hashing[C]// International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc. 2000:518--529.

[10]Andoni A, Indyk P. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions.[J]. Foundations of Computer Science Annual Symposium on, 2006, 51(1):459-468.

[11]Mikolajczyk K, Schmid C. A performance evaluation of local descriptors[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2003:257.

[12]Witten I H, Moffat A, Bell T C. Managing gigabytes (2nd ed.): compressing and indexing documents and images[J]. 1999, 41(6):79-80.

您所在的位置： 首页 >> 期刊 >> 电气工程与自动化

电气工程与自动化

快捷导航

新闻

Paper Infomation

您所在的位置：首页 >> 期刊 >> 电气工程与自动化