Paper Infomation
Research on Finding Near-replicas of Documents on the Web
Full Text(PDF, 229KB)
Author: Jian Chen, Youqun Shi, Ran Tao
Abstract: The presence of a large number of near-replicas of documents on the web has become the biggest obstacle to the rapid access to effective information. In order to solve the problem that there are a large number of approximate mirror pages on the network, the researchers proposed a variety of approximate mirror page de-algorithm, but the performance of these algorithms in web noise resistance is not satisfactory. To solve these problems, this paper proposes an algorithm based on Simhash long sentence extraction approximate mirror page de-emphasis, extracting long sentences in the document to avoid the adverse effect of web noise and weakening the bad effects on algorithm brought by the noises. Researches on de-emphasis of web page information suggest that the improved algorithm can effectively weaken the noise effect, which has a high accuracy rate and recall rate.
Keywords: Near-replicas Web Pages; Simhash; Long Sentence Extraction; Avoiding Noise
References:
[1]China Internet Network Information Center. The 37th China Internet Development Statistics Report [EB/OL]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/201601/P020160122469130059846.pdf, 2016-01-22.
[2]Fan Yong,Zheng Jia-heng. Research on elimination of similar web pages [J]. .Computer Engineering and Applications,2009,12:141-143+183.
[3]HUANG Ren, FENG Sheng, YANG Ji-yun, LIU Yu, AO Min. Detection and elimination of similar Web pages based on text structure and extraction of long sentences[J]. Application Research of Computers, 2010, 27(7):2489-2491.
[4]Chowdhury A, Frieder O, Grossman D, et al. Collection statistics for fast duplicate document detection[J]. Acm Transactions on Information Systems, 2002, 20(2):171-191.
[5]Broder A Z, Glassman S C, Manasse M S, et al. Syntactic clustering of the Web[J]. Computer Networks & Isdn Systems, 1997, 29(8–13):1157-1166.
[6]Theobald, Martin, Siddharth, et al. SpotSigs: robust and efficient near duplicate detection in large web collections[J]. 2008..
[7]Charikar M S. Similarity estimation techniques from rounding algorithms[J]. Applied & Computational Harmonic Analysis, 2002:380-388.
[8]WEI Li-xia, ZHENG Jia-heng. Detection and elimination of similar Web pages based on text structure[J]. Computer Applications, 2007, 27(11):2854-2856.
[9]Gionis A, Indyk P, Motwani R. Similarity Search in High Dimensions via Hashing[C]// International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc. 2000:518--529.
[10]Andoni A, Indyk P. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions.[J]. Foundations of Computer Science Annual Symposium on, 2006, 51(1):459-468.
[11]Mikolajczyk K, Schmid C. A performance evaluation of local descriptors[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2003:257.
[12]Witten I H, Moffat A, Bell T C. Managing gigabytes (2nd ed.): compressing and indexing documents and images[J]. 1999, 41(6):79-80.