web_page_classification_note
2016-01-28 13:22:33   0  举报             
     
         
 AI智能生成
  网页分类论文-阅读比较
    作者其他创作
 大纲/内容
  INTRODUCTION    
     Problem Definition    
     二元分类  
     多分类、单标签(j结果)  
     多分类、多标签(label)  
     多分类、多标签、不同权重  
     Applications of Web Classification    
     构建、扩展Web Directories    
     单层和多层Flat classification and hierarchical classification    
     https://www.dmoz.org/ 的定义  
     提高Quality of Search Results  
     帮助Question Answering Systems.    
     通过decision tree classifiers 分类    
     collection pages (containing a list of items)  
     topic pages
(representing an answer instance)  
     relevant pages (supporting an answer instance)  
     irrelevant pages.  
     Building Efficient Focused Crawlers or Vertical (Domain-Specific) Search Engines  
     Other Applications    
     Web content filtering  
     contextual advertising  
     ontology annotation  
     knowledge base construction  
     The Difference Between Web Classification and Text Classification    
     traditional
text classification is typically performed on structured documents  
     Web pages are semistructured documents in HTML  
     a feature is
central to the definition of the Web  
     Related Surveys  
     FEATURES    
     Using On-Page Features    
     feature selection to make better
use of the textual features    
     n-gram representation:使用短文本做单位组成vector  
     HTML tags:title, headings, metadata, and main text    
     Good-quality document summarization can accurately represent the major topic of
a Web page.  
     Visual Analysis.  
     Using Features of Neighbors    
     Motivation.    
     当features are sometimes missing, misleading,可以使用该page的相邻(相关)的page去判断  
     Underlying Assumptions    
     如果Pa和Pb都属于某分类,其视觉上的邻居节点也属于该分类  
     Neighbor Selection    
     使用父页面货指向target page的超链的锚文本或附近的content,会更重要  
     Features of Neighbors    
      parent, child, sibling, and spouse pages are useful ,但是sibling page的效果最好  
     (来来去去都是这些)The features that have been used from neighbors include
labels, partial content (anchor text, the surrounding text of anchor text, titles, headers),
and full content  
     Utilizing Artificial Links.    
     Feature selection           
     summary of news articles.--使用新闻的概述-- only using the
first fragment of each document offers fast and accurate classification of news articles  
     信息增益、互信息、文档频率,和χ2测试 information gain, mutual information,
document frequency, and the χ2 test.  
     Latent semantic indexing潜在语义索引  
     A matrix factorization矩阵分解  
     Discussion: Features  
     ALGORITHMS    
     Dimension Reduction 缩减维度    
     Feature selection reduces the dimensionality of the feature space选择特征用以降纬    
     summary of news articles.--使用新闻的概述-- only using the
first fragment of each document offers fast and accurate classification of news articles  
     信息增益、互信息、文档频率,和χ2测试 information gain, mutual information,
document frequency, and the χ2 test.  
     Latent semantic indexing潜在语义索引  
     A matrix factorization矩阵分解  
     Relational Learning    
     由于网页是有超链关联的,relational learning problem  
     Relaxation labeling松弛标示法  
     loopy belief propagation and iterative classification置信度传播和迭代分类  
     Modifications to Traditional Algorithms    
     k-Nearest Neighbor classifiers  
     binary classification scenario,  
     SVM classifier:can then
be trained on the labeled positive examples and the filtered negative examples   
     Hierarchical Classification:层次分类    
     hierarchical SVMs  效果一般  
     Combining Information from Multiple Sources:结合不同的数据源    
     voting and stacking  
     Combining SVM kernels组合支持向量机  
     Moreover, the combination of two does not always perform better than each separately 不一定组合的就比单独的要好  
     OTHER ISSUES    
     Web Page Content Preprocessing  
     Dataset Selection and Generation    
     supervised learning problem  
     Web Site Classification  
     Blog Classification    
     1、a binary classification of blog and nonblog.是否是博客  
     The second category of research includes identification of the topic, mood or, sentiment
of blogs. 从词的心情看观察  
     the genre of blogs.  
     CONCLUSION    
     supervised learning problem on the basis of subject,
function, sentiment, genre, and more.  
      收藏 
     
 
 
 
 
  0 条评论
 下一页
  
 