首页  思维导图  详情



 



association

2015-11-12 13:32:32   0  举报





AI智能生成

”Association”是一个英文词汇，通常指的是两个或更多的事物、人或概念之间的联系或关联。这种联系可以是物理的，也可以是逻辑的，或者是情感上的。例如，我们可能会说，”苹果和水果有关联，因为苹果是一种水果”；或者”阅读和学习有关联，因为阅读可以帮助我们学习新的知识”。在社交环境中，”协会”这个词通常用来指代一个由共享相同兴趣或目标的人组成的团体，他们通过共同的努力来实现这些目标。例如，”环保协会”就是一个由关心环境保护的人组成的组织，他们通过各种活动来提高人们对环保问题的认识。总的来说，”association”这个概念强调的是联系和互动，无论是在物质世界还是在社会世界中。

模板推荐

作者其他创作

大纲/内容

concept

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Itemset

A collection of one or more items

k-itemset

Support count (σ)

Frequency of occurrence of an itemset

Frequent Itemset

An itemset whose support is greater than or equal to a minsup threshold

Association Rule

An implication expression of the form X → Y, where X and Y are itemsets

Rule Evaluation Metrics

Support (s)

Fraction of transactions that contain both X and Y

Confidence (c)

Measures how often items in Y appear in transactions that contain X

association rule mining task

given a set of transactions T, the goal of association rule mining is to find all rules having

support >= minsup threshold

confidence >= minconf threshold

Brute-force approach

List all possible association ruls

Compute the support and confidence for each rule

Prune rules that fail the minus and minconf

Disadvantage: Computionally prohibitive

method

Basic Approach of Association Rule

Step 1:find all frequent itemsets

Step 2:generate strong association rules from frequent itensets

Brute-force approach

Each itemset in the lattice is a candidate frequent itemset

Count the support of each candidate by scanning the database

O(NWM) => M=2^d

Apriori Algorithm

Apriori property

all non-empty subsets of a frequent itemset must also be frequent

Apriori principle holds due to the following property of the support measure

For any(X,Y) : (X include in Y ) => s(X) >= s(Y)

support of an items never exceeds the support of its subsets

notation

frequent k-itemset(denoted as Lk):satisfy mini support

candidate k-itemset(denoted as Ck):possible frequent k-itemset

Level-wise approach

(k-1)-itemsets are used to explore k-itemsets

prune Ck by subset test

generate Lk by scanning transaction DB

How to Count Supports of candidates

Candidate itemisers are stored in a hash-tree

Leaf node of hash-tree contains a list of itemisers and counts

Interior node contains a hash table

Challenges

Multiple scans of transaction database

Huge number of candidates

Tedious workload of support counting for candidats

improvement

reduce passes of transaction database scans

shrink number of candidates

facilitate support counting of candidates

DHP

Observation of performance in association rule mining

initial candidate set generation is key issue to improve

amount of transaction data that must be scanned

Major features of DHP

efficient generation for frequent itemsets

effective reduction on transaction database size

option of reducing #(database scan) required

Partitioning

Observation

Partition size is chosen to be resident in main memory

Observation: any potential frequent itemset appears as a frequent itemset in at least one of the partitions.

Transaction DB is divided into non- overlapping partitions

Twophasesscanning

Firstscan:generatesasetofallpotentially frequent itemsets Each partition generates the local frequent itemsets

Secondscan:actualsupportismeasured Collection of local frequent itemset = global candidate itemset Global frequent itemsets are found by scan DB

Advantages:

Adapts to available main memory Easily parallelized Maximum number of database scans is two

Disadvantage

May have many candidates during second scan

Sampling

Observation

Sample the database and apply Apriori to the sample.

Potentially Large Itemsets (PL): Large itemsets from sample Negative Border (BD - ): Generalization of Apriori-Gen applied to itemsets of varying sizes. Minimal set of itemsets which are not in PL, but whose subsets are all in PL.

Advantages

Reduces number of database scans to one in the best case and two in worst n Scales better

Disadvantages

Potentially large number of candidates in second pass

结束

要点1

要点2

要点3