association
2015-11-12 13:32:32 0 举报
AI智能生成
登录查看完整内容
作者其他创作
大纲/内容
Association Rule Mining
concept
Itemset
A collection of one or more items\u00A0
k-itemset
Support count (σ)
Frequency of occurrence of an\u00A0itemset
Frequent Itemset
An itemset whose support is greater\u00A0than or equal to a minsup threshold
Association Rule
Rule Evaluation Metrics
Support (s)
Fraction of transactions that contain\u00A0both X and Y
Confidence (c)
Measures how often items in Yappear in transactions thatcontain X
association rule mining task
support = minsup threshold
confidence = minconf threshold
Brute-force approach
List all possible association ruls
Compute the support and confidence for each rule
Prune rules that fail the\u00A0minus and minconf
Disadvantage: Computionally prohibitive\u00A0
method
Basic Approach of Association Rule
Step 1:find all frequent itemsets\u00A0
Step 2:generate strong association rules from frequent itensets
Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the database
O(NWM) = M=2^d
Apriori Algorithm
Apriori property
all non-empty subsets of a frequent itemset must also be frequent
Apriori principle holds due to the following property of the support measure
support of an\u00A0items never exceeds the support of its subsets
notation
frequent k-itemset(denoted as Lk):satisfy mini support
candidate k-itemset(denoted as Ck):possible \u00A0frequent k-itemset
Level-wise approach
(k-1)-itemsets are used to explore k-itemsets
prune Ck by subset test
generate Lk by scanning transaction DB
How to Count Supports of candidates
Candidate itemisers are stored in a hash-tree
Leaf node of hash-tree contains a list of itemisers and counts
Interior node contains a hash table
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidats
improvement
reduce passes of transaction database scans
shrink number of candidates
facilitate support counting of candidates
DHP
Observation of performance in association rule mining
initial candidate set generation is key issue to improve
amount of transaction data that must be scanned
Major features of DHP
efficient generation for frequent itemsets
effective reduction on transaction database size
option of reducing #(database scan) required
Partitioning
Observation
Partition size is chosen to be resident in main memory
Observation: any potential frequent itemset appears as a frequent itemset in at least one of the partitions.
Transaction DB is divided into non- overlapping partitions
Twophasesscanning
Firstscan:generatesasetofallpotentially frequent itemsetsEach partition generates the local frequent itemsets
Secondscan:actualsupportismeasuredCollection of local frequent itemset = global candidate itemsetGlobal frequent itemsets are found by scan DB
Advantages:
Adapts to available main memoryEasily parallelizedMaximum number of database scans is two
Disadvantage
May have many candidates during second scan
Sampling
Sample the database and apply Apriori to the sample.
Advantages
Reduces number of database scans to onein the best case and two in worst n \u00A0Scales better
Disadvantages
Potentially large number of candidates in second pass
结束
要点1
要点2
要点3
0 条评论
回复 删除
下一页