网页spam相关论文了解

petermao — Sun, 14 Apr 2013 02:08:04 +0000

最近对网页spam有兴趣，找了些论文看看，部分还没看完，做个小结吧，后续再慢慢看，慢慢补充了。
主要内容都是参考里的论文里的。

1 spam基础
那些误导搜索引擎排名的行为或者从搜索引擎中获得不应有的利益的行为都可以称为spam，具体是否为spam取决于搜索引擎的判断标准。

简单来说，那些搜索引擎明文允许进行的或者即使搜索引擎不存在但仍存在的“优化”行为可以认为不是spam。比如一些针对特定client、终端的优化行为不能认为是spam。再比如垃圾网站里贴出的链接，这些链接不应该作为spam，因为这些链接的存在超出了链接目的地的owner的控制，惩罚的应该是垃圾网站本身而不是它指向的链接。
Anything that would still be done if search engines did not exist, or anything that a search engine has given written permission to do.[2]

英文spam据估计在15%左右，也可以按语言、domain等进行区分。中文的估计更高。

产生spam的主要原因是一方面搜索引擎是web的入口，另一方面搜索引擎按page的质量进行排序，用户只关注top10的点击，这些点击带来了相当多的流量，spam的目的就是排在前面，以获得更好的流量。

spam的主要危害如下：
影响用户体验；
浪费资源；
对好网站不公平；

2 相关理论模型
所有rank相关的模型都涉及了，有些还没来得及细细研究。

tf-idf = sum(tf(t)*idf(t)) t是查询query与文档term的交集
tf: 单词占文章的百分比
idf: 单词在所有文档集合中的出现频率百分比的倒数
因为web的页面内容web owner可以随意修改，idf是全局的，一般认为spam没法控制，因此spam主要针对tf

VSM：
将文档document与查询query表示成term权重的向量，计算文档与查询的相似度以进行排序<余弦距离是常用的方法>

pagerank：
使用入链信息来打分
the importance of a certain page inﬂuences and is being inﬂuenced by the
importance of some other pages.

hits：
使用入链与出链信息来打分<分别对应authority与hub得分>
According to the circular deﬁnition of HITS, important hub pages are those that point to many important authority pages,while important authority pages are those pointed to by many hubs.
spam通过影响出链信息，再影响入链信息。

browserrank

query-click

rank factor：
前面的hits、pagerank等也是排序因子，按照[3]，排序因子主要可以分为 on the page factor + off the page factor。
具体可以参考[3]，对了解spam的产生原因有帮助。

评价标准：
查准率
查全率

3 分类
host/domain name spam:
购买过期域名、host堆积<比如一个domain下N多个子host，这些子host的前缀通常是随机的数字或者常用的term>，某些热门关键词的domain<比如由热门词汇组成很长的domain，中间以-等分隔>

content spam：
位置：web页面中的任何一个位置都可以。常用的比如title meta body anchor<针对指向的page> url。
策略主要可以分为重复(repeat)与隐藏(hide)。
重复: 单个单词重复(Repetition)
大量词汇堆砌(Dumping)
穿插(Weaving)：拷贝一篇好的页面，再在里面穿插些spam词汇或者链接
短语堆积(Phrase stitching):从不同的文章里抽取些词汇再汇总
链接堆积
隐藏：背景色与文本色一样的文本/链接
可视化属性设置为false
使用脚本生成文本与链接
通过img的点击指向新链接
size特别短小的文本<用户不可见>

link spam：
位置：出链(outlink)与入链(inlink)
出链：克隆好的导航网站/目录(open dir clone)，比如国外的dmoz.org、http://dir.yahoo.com等，国内的hao123等
入链：链接农场(link farm)：一堆链接的指向复杂度超出了阈值
蜜罐(honey pot)：从别的网站里拷贝的一些好页面，里面包含了一些spam链接；open dir clone也可放入此类。
在一些权威网站里贴垃圾链接(insert link at dir)
在一些open的平台里帖链接，比如blog、wiki、social site、留言板
友情链接交换(link exchange)
购买过期的域名(expired domain)：在购买的废弃域名里张贴大量链接，主要废弃域名的排名会存在一段时间

关于隐藏(hiding)技术与重定向(redirect)技术：
有的论文将隐藏技术与重定向技术作为单独的一类spam技术进行划分。隐藏技术包括基于IP的(搜集特定crawler的IP)与基于HTTP协议头的User Agent(crawler一般会用这个字段标识自身)，另外前面所讲的基于内容的隐藏技术也划分到此类。重定向技术包括http协议重定向、meta重定向与JS重定向。

除去已划分到内容spam部分的基于内容的隐藏技术，剩下的个人认为没有多少意思。对于基于特定IP/http请求头的隐藏技术，检测方法很容易，无非是多次爬取，基于内容的比较，对于大的商业crawler，这样做貌似不够友好，也浪费资源；至于重定向，比如JS重定向，随着越来越多的好网页在特殊情况下也会使用这种技术，个人认为采用类似于V8的JS引擎，已可以检测出重定向后的URL，就不用考虑此类问题了。

用户的角度：
基于浏览行为的，
基于query点击的，

4 检测
content：基于规则的与基于统计的

link：link farm检测
pagerank分值较低的
trustrank分值
spamrank分值较高的
初始种子的选择：人工选择好的、差的，in link 与out link的交集<不同domain的> > 阈值；

用户行为的事后分析：
click-model针对click-log
browser-model针对browse-log

附：参考
[1] Monika Henzinger, Rajeev Motwani, and Craig Silverstein. Challenges in web search engines. SIGIR Forum, 36(2), 2002.
[2] The Classification of Search Engine Spam http://www.silverdisc.co.uk/articles/spam-classification
[3] Web Spam Taxonomy
[4] Survey on web spam detection-principles and algorithms
[5] Alan Perkins. The classification of search engine spam. http://www.ebrandmanagement.com/whitepapers/spam-classification/.
[6] Zolt´an Gy¨ongyi and Hector Garcia-Molina. Link spam alliances. Technical report, Stanford University, 2005.

rank factor
google ranking factor: http://www.vaughns-1-pagers.com/internet/google-ranking-factors.htm

pagerank
[1] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.
[2] Monica Bianchini, Marco Gori, and Franco Scarselli. Inside PageRank. ACM Transactions on Internet Technology, 5(1), 2005.
[3] T. Haveliwala. Efficient computation of PageRank. Tech. rep., Stanford University, 1999.
[4] T. Haveliwala. Topic-sensitive PageRank. In Proceedings of the Eleventh International Conference on World Wide Web, 2002.
[5] S. Kamvar, T. Haveliwala, C. Manning, and G. Golub. Extrapolation methods for accelerating PageRank computations. In Proceedings of the Twelfth International Conference on World Wide Web, 2003.
[6] A. Langville and C. Meyer. Deeper inside PageRank. Tech. rep., North Carolina State University, 2003.
[7] Jon Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 1999.

anti-spam
[1] Dennis Fetterly, Mark Manasse, and Marc Najork. Spam, damn spam, and statistics. In Proceedings of the Seventh International Workshop on the Web and Databases (WebDB), 2004.
[2] Z. Gy¨ongyi and H. Garcia-Molina. Seed selection in TrustRank. Tech. rep., Stanford University, 2004.
[3] Pr0 – Google’s PageRank 0, http://pr.efactory.de/e-pr0.shtml. 2002.
[4] Combating Web Spam with TrustRank
[5] Detecting Spam Web Pages through Content Analysis
[6] Identifying Link Farm Spam Pages
[7] Fighting against Web Spam: A Novel Propagation Method based on Click-through Data

petermao的技术blog » 搜索引擎

网页spam相关论文了解