Knowledge Technology
2018-06-07 09:26:14   3  举报             
     
         
 AI智能生成
  COMP90049 Knowledge Technologies University of Melbourne IT 2018 S1
    作者其他创作
 大纲/内容
  String Search    
     Exact String Search    
     Pattern Matching: RegEx  
     Naive Matching  
     Boyer-Moore matching 
  
     Approximate String Search    
     Neighbourhood Search 
  
     Edit Distance     
     Global  
     Local  
     N-Gram Distance   
     Phonetics (Soundex)  
     Evaluation    
     Accuracy: fraction of correct responses   
     Precision: fraction of correct responses among attempted responses 
  
     Recall: proportion of words with a correct response (somewhere)   
     Information retrieval     
     Information Seeking    
     Information needs    
     Informational: global warming 
  
     Factoid: melting point of lead 
  
     Topic tracking: Trump administration 
  
     Navigational: university of Melbourne 
  
     Transactional: Macbook Air 
  
     Geospatial: carlton restaurants 
  
     Answers    
     relevant  
     Document Matching    
     Boolean Query    
     match if they contain the terms, and don't contain the NOT terms  
     Pros    
     repeatable, auditable and controllable. 
  
     allow expression of complex concepts.
  
     reduction in time spent reading 
  
     Cons    
     no ranking and no control over result set size  
     difficult to incorporate useful heuristics  
     Ranked Retrieval     
     Evidence of similarity 
    
     query terms in the title  
     query terms as phrases  
     recently created  
     translate between languages  
     authoritative, reliable documents 
  
     TF-IDF Model    
     More weight is given to a document where a query term appears many times. 
(Term frequency or TF.)
  
    (Term frequency or TF.)
 Less weight is given to a term that appears in many documents.
(Inverse document frequency or IDF.)
  
    (Inverse document frequency or IDF.)
 Less weight is given to a document that has many terms. 
  
     Evaluation    
     Precision: number of returned relevant results/number of returned results 
  
     Recall: number of returned relevant results/total number of relevant results
(often useless in an IR context)
  
    (often useless in an IR context)
 Precision at k (P@k):
number of returned relevant results in top k/k 
  
     (Recall at k usually not meaningful) 
  
     Web Search    
     Crawling
(gather data from web)
    (gather data from web)
 Basic challenge: there is no central index of URLs of interest. 
  
     Secondary challenges:    
     Some websites return the same content as a new URL at each visit.
  
     Some pages never return status 'done' on access.
  
     Some websites are not intended to be crawled. 
  
     Much web content is generated on-the-fly from databases, 
which can be costly for the content provider, so excessive numbers of visits to a site are unwelcome.
  
    which can be costly for the content provider, so excessive numbers of visits to a site are unwelcome.
 Some content has a short lifespan. 
  
     Some regions and content providers have low bandwidth.   
     URLs L:    
     Every page is visited eventually  
     Synonym URLs are disregarded  
     Significant or dynamic pages are visited sufficiently frequently
  
     The crawler isn't cycling indefinitely in a single web site (caught in a crawler trap)  
     Parsing
(translate data into a canonical form)
    (translate data into a canonical form)
 The aim of parsing is to reduce a web page, or a query, to a sequence of tokens  
     Canonicalisation
(having a single representation of a text)
    (having a single representation of a text)
 Tokenisation
(decomposing the larger document into smaller information-bearing units)
    (decomposing the larger document into smaller information-bearing units)
 Normalisation
(a form which is generally representative of other instances of the same idea)
    (a form which is generally representative of other instances of the same idea)
 Tokenisation steps    
     Folding case    
     transfer to lowercase  
     Stripping punctuation    
     remove non-alphabetic characters  
     Stemming    
     remove affixes  
     Splitting into tokens    
     base on whitespace  
     Zoning    
     segment documents to discrete zones e.g. title, anchor text, headings  
     Indexing
(Build efficient data structure)
    (Build efficient data structure)
 inverted index: 
a collection of lists, one per term, recording the identifiers of the documents containing that term.
  
    a collection of lists, one per term, recording the identifiers of the documents containing that term.
 Querying
(process data in response to queries)
    (process data in response to queries)
 Boolean Querying     
     fetch each term  
     use intersection for AND  
     use union for OR  
     take complement for NOT  
     ignore within-document frequencies  
     Ranked Querying     
     accumulator
(the partial sum of the dot product)
    (the partial sum of the dot product)
 calculate Wq,t and fetch the inverted list for t  
     calculate Wd,t and set A=A+Wq,t*Wd,t  
     set A=A/W  
     identify greatest A and return corresponding documents  
     accumulator cost    
     limiting:
processing the most important terms first, namely, the ones with highest IDF
  
    processing the most important terms first, namely, the ones with highest IDF
 thresholding:
Only create an accumulator when the TF-IDF value is above some threshold
  
    Only create an accumulator when the TF-IDF value is above some threshold
 Ass-ons    
     Phrase Queries    
     returning only documents where the query terms appear in sequence  
     use a "positional index": an index where the positions of terms (indices within the document)  
     Link Analysis
    
     a weighting (or re-ranking) procedure   
     based on the link structure of the Web  
     assumption:  popular pages are more likely to be relevant   
     PageRank    
     we begin at all pages with equal probability  
     traversal     
     follow one of the outgoing links from this page, with even probability
  
     teleport    
     navigating via entering a URL, with even probability  
     Machine Learning    
     Naive Bayes    
     assumption: each attribute is independent of all of the other attributes, given the class under consideration  
     maximum likelihood estimation (hedge using smoothing)  
     Decision Tree    
     GINI Index    
     1-p^2  
     best: 0; worst: 0.5  
     Entropy    
     -p log(p)  
     best: 0; worst: 1  
     Information Gain: Entropy(p)-(sum of Entropy)
  
     Error    
     1-max(p)  
     best: 0; worst: 0.5  
     Pros    
     inexpensive to construct  
     extremely fast at clsaaifying  
     easy to interpret  
     good accuracy  
     Cons    
     restrictive  
     too specific and complex  
     Random Forests    
     Bagging    
     build a number of Decision Trees by re-sampling the data  
     randomly select (with repetition) N instances   
     build the tree as usual  
     classify the test instance by voting   
     for each node in the tree, we randomly select some subset of the possible attributes  
     Cons    
     deal with outlier instances in the original dataset  
     overcome the problem of irrelevant attributes   
     build many trees in a reasonable amount of time  
     K-Nearest Neighbour    
     Euclidian distance (p,q)    
     Minkowski distance (p,q,k)   
     Manhattan distance  
     Support Vector Machine    
     best line(hyper-plane) that divides two classes  
     linear seperability  
     maximum margin  
     soft margin    
     allow wrong points, if can get a larger margin  
     Kernel functions    
     the data becomes linearly separable  
     "multi-class" classifier    
     build multiple models    
     one versus many  
     one versus one  
     Evaluation    
     Accuracy    
     TP+TN/TP+TN+FP+FN  
     Precision    
     TP/TP+FP  
     Recall (Sensitivity)    
     TP/TP+FN  
     Specificity    
     TN/TN+FP  
     F1_Socre    
     2*Recall*Precision/Recall+Precision  
     Data Mining    
     main modes    
     classification    
     e.g. choose best label  
     clustering    
     e.g. find group of something  
     regression    
     e.g. predict sea level  
     association rule mining    
     e.g. find purchase predictive of other purchase  
     sequential discovery    
     e.g. find weather patterns in a given city  
     outlier detection    
     e.g. find erroneous values in a collection of data  
     Clustering    
     unsupervised machine learning (absence of labelled instance)  
     partitional (flat) & hierarchical (higher-up)  
     fuzzy&non-fuzzy, partial&complete, heterogeneous&homegeneous etc.  
     well-separated, center-based contiguous, desity-based  
     K-means  
     Association Rules    
     basic elements    
     Support count (σ) 
  
     Support  
     Confidence(c)  
     threshold minsup, minconf  
     Frequent Itemset 
Generation Strategies
    
    Generation Strategies
 Reduce the number of candidates (M) 
    
     Complete search: M=2^d 
  
     Use pruning techniques   
     Reduce the number of transactions (N) 
    
     Used by DHP (Direct Hashing and Pruning) 
and vertical-based mining algorithms
  
    and vertical-based mining algorithms
 Reduce the number of comparisons (NM) 
    
     Use efficient data structures to store candidates/transactions 
  
     No need to match every candidate against every transaction 
  
     Apriori Algorithm 
    
     Generate frequent itemsets of length k (k=1)
  
     Generate length (k+1) candidate itemsets from length k frequent itemsets  
     Prune candidates that are infrequent in k+1  
     Count the support of each candidate by scanning the DB 
  
     Eliminate candidates that are infrequent  
      收藏 
       收藏 
     
 
 
 
 
  0 条评论
 下一页
  
 