Knowledge Technology
2018-06-07 09:26:14 3 举报
AI智能生成
登录查看完整内容
COMP90049 Knowledge Technologies University of Melbourne IT 2018 S1
作者其他创作
大纲/内容
KT
String Search
Exact String Search
Pattern Matching: RegEx
Naive Matching
Boyer-Moore matching
Approximate String Search
Neighbourhood Search
Edit Distance
Global
Local
N-Gram Distance
Phonetics (Soundex)
Evaluation
Accuracy: fraction of correct responses
Precision: fraction of correct responses among attempted responses
Recall: proportion of words with a correct response (somewhere)
Information retrieval
Information Seeking
Information needs
Informational: global warming
Factoid: melting point of lead
Topic tracking: Trump administration
Navigational: university of Melbourne
Transactional: Macbook Air
Geospatial: carlton restaurants
Answers
relevant
Document Matching
Boolean Query
Pros
allow expression of complex concepts.
reduction in time spent reading
Cons
no ranking and no control over result set size
difficult to incorporate useful heuristics
Ranked Retrieval
Evidence of similarity
query terms in the title
query terms as phrases
recently created
translate between languages
TF-IDF Model
More weight is given to a document where a query term appears many times. (Term frequency or TF.)
Less weight is given to a term that appears in many documents.(Inverse document frequency or IDF.)
Less weight is given to a document that has many terms.
Precision: number of returned relevant results/number of returned results
Recall: number of returned relevant results/total number of relevant results(often useless in an IR context)
Precision at k (P@k):number of returned relevant results in top k/k
(Recall at k usually not meaningful)
Web Search
Crawling(gather data from web)
Basic challenge: there is no central index of URLs of interest.
Secondary challenges:
Some websites return the same content as a new URL at each visit.
Some pages never return status 'done' on access.
Some websites are not intended to be crawled.
Some content has a short lifespan.
Some regions and content providers have low bandwidth.
URLs L:
Every page is visited eventually
Synonym URLs are disregarded
Significant or dynamic pages are visited sufficiently frequently
The crawler isn't cycling indefinitely in a single web site (caught in a crawler trap)
Parsing(translate data into a canonical form)
Canonicalisation(having a single representation of a text)
Tokenisation(decomposing the larger document into smaller information-bearing units)
Normalisation(a form which is generally representative of other instances of the same idea)
Tokenisation steps
Folding case
transfer to lowercase
Stripping punctuation
remove non-alphabetic characters
Stemming
remove affixes
Splitting into tokens
base on whitespace
Zoning
Indexing(Build efficient data structure)
Querying(process data in response to queries)
Boolean Querying
fetch each term
use intersection for AND
use union for OR
take complement for NOT
ignore within-document frequencies
Ranked Querying
accumulator(the partial sum of the dot product)
set A=A/W
identify greatest A and return corresponding documents
accumulator cost
thresholding:Only create an accumulator when the TF-IDF value is above some threshold
Ass-ons
Phrase Queries
returning only documents where the query terms appear in sequence
use a \"positional index\": an index where the positions of terms (indices within the document)
Link Analysis
a weighting (or re-ranking) procedure
based on the link structure of the Web
assumption: popular pages are more likely to be relevant
PageRank
we begin at all pages with equal probability
traversal
teleport
Machine Learning
Naive Bayes
maximum likelihood estimation (hedge using smoothing)
Decision Tree
GINI Index
1-p^2
best: 0; worst: 0.5
Entropy
-p log(p)
best: 0; worst: 1
Information Gain: Entropy(p)-(sum of Entropy)
Error
1-max(p)
inexpensive to construct
extremely fast at clsaaifying
easy to interpret
good accuracy
restrictive
too specific and complex
Random Forests
Bagging
build a number of Decision Trees by re-sampling the data
randomly select (with repetition) N instances
build the tree as usual
classify the test instance by voting
deal with outlier instances in the original dataset
overcome the problem of irrelevant attributes
build many trees in a reasonable amount of time
K-Nearest Neighbour
Manhattan distance
Support Vector Machine
best line(hyper-plane) that divides two classes
linear seperability
maximum margin
soft margin
Kernel functions
the data becomes linearly separable
\"multi-class\" classifier
build multiple models
one versus many
one versus one
Accuracy
TP+TN/TP+TN+FP+FN
Precision
TP/TP+FP
Recall (Sensitivity)
TP/TP+FN
Specificity
TN/TN+FP
F1_Socre
2*Recall*Precision/Recall+Precision
Data Mining
main modes
classification
e.g. choose best label
clustering
e.g. find group of something
regression
e.g. predict sea level
association rule mining
e.g. find purchase predictive of other purchase
sequential discovery
e.g. find weather patterns in a given city
outlier detection
e.g. find erroneous values in a collection of data
Clustering
unsupervised machine learning (absence of labelled instance)
partitional (flat) & hierarchical (higher-up)
K-means
Association Rules
basic elements
Support count (σ)
Support
Confidence(c)
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
Complete search: M=2^d
Use pruning techniques
Reduce the number of transactions (N)
Used by DHP (Direct Hashing and Pruning) and vertical-based mining algorithms
Reduce the number of comparisons (NM)
Use efficient data structures to store candidates/transactions
No need to match every candidate against every transaction
Apriori Algorithm
Generate frequent itemsets of length k (k=1)
Generate length (k+1) candidate itemsets from length k frequent itemsets
Prune candidates that are infrequent in k+1
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent
收藏
收藏
0 条评论
回复 删除
下一页