The data mining Apriori algorithm

python3 code is as follows:

#coding = utf-8
Import numpy
#from python_util import fileread

"""
    The required part of the program:
        Create an initial candidate set
        Generate Lk+1 from Lk
        Calculate the support for each candidate set
        Calculate confidence
        generateCandidate(Lk): filter out inappropriate
        Data reading
"""


"""
    Create an initial candidate set
    INPUT:
        Data_set:a list of list stores the data of the transaction
    OUTPUT:
        C1: list, each element is a collection
"""


Def createInitCandidates(data_set):
    C1 = []
    For data in data_set:
        For item in data:
            If [item] not in C1:
                C1.append([item])
    C1.sort()
    Return [frozenset(c) for c in C1]


"""
    Calculation support
    INPUT:
        Data_set: transaction data
        Candidate: candidate set
    OUTPUT:
        
"""


Def calSupport(data_set, candidates):
    # First convert data_set to a collection
    Support = {}

    Datas = [frozenset(data) for data in data_set]
    For candidate in candidates:
        Tmp = 0
        For data in datas:
            If candidate.issubset(data):
                Tmp += 1
        Support[candidate] = tmp/len(data_set)
    Return support


"""
    Filter out less than the minimum support according to the support level
    INPUT:
        Candidates: candidate collection
        Min_support: minimum support
        Support: support for candidates
    OUTPUT:
        New_candidates: is Lk
"""


Def filterCandidates(candidates, min_support, support):
    New_candidates = []
    For candidate in candidates:
        If support[candidate] >= min_support:
            New_candidates.append(candidate)
    New_candidates.sort()
    Return new_candidates


"""
    GnenerateCandidates
    INPUT:
        Lk: a list consists of set
    OUTPUT:
        Candidate: Ck+1
"""


Def generateCandidates(Lk):
    Counting = []
    # Find the length of the current candidate set
    k = len(Lk[0])
    len_Lk = len(Lk)
    #self_joining
    For len_kl1 in range(len_Lk):
        Tmp1 = list(Lk[len_kl1])[: k - 1]
        For len_kl2 in range(len_kl1+1, len_Lk):
            Tmp2 = list(Lk[len_kl2])[: k - 1]
            If tmp1 == tmp2:
                Candidate = Lk[len_kl1].union(Lk[len_kl2])
                Count.append(candidate)
    Counts.sort()
    Return candidate


"""
    Apriori algorithm implementation
    INPUT:
        Data_set: data set, a list consists of list
        Min_support: minimum support
"""


Def Apriori(data_set, min_support):
    # Generate initial item collection
    C1 = createInitCandidates(data_set)
    Support = calSupport(data_set, C1)
    # Filter out less than the minimum support
    L1 = filterCandidates(C1, min_support, support)
    L = [L1]
    k = 2
    While len(L[k-2]) > 0:
        Biscribing = generateCandidates(L[k-2])
        Tmp_support = calSupport(data_set, candidates)
        Lk = filterCandidates(candidates, min_support, tmp_support)
        #support.append(tmp_support)
        Support.update(tmp_support)
        L.append(Lk)
        k += 1
    Return support, L




"""
    Still need to do the work:
    Confidence in output frequent items above the lowest value
"""

Def calConfidence(supportdata,set1,set2):
    Tmp1=[]
    Tmp2=[]
    Set2.update(set1)
    For key in supportdata:
        If (set1 == key):
            Tmp1 = supportdata[key]
        If (set2 == key):
            Tmp2 = supportdata[key]
    If tmp1 == [] or tmp2 == []:
        Print("less than minimum support or does not exist")
        Return 0
    Else:
        Confidence = tmp2/tmp1
        Return confidence



If __name__ == "__main__":
    #pathname = input("Please enter a file address")
    Data_set = fileread("testInput.txt")
    Support, L = Apriori(data_set, 0.2)
    Print("Support for all items is ", support)
    Print("Frequent itemsets are", L)
    #print(len(support))
    Confidence = calConfidence(support, {1}, {2})
    Print(confidence)

where the format of testInput.txt is as follows:

1 2 5
2 4
2 3
1 2 4
1 3
2 3
1 3
1 2 3 5
1 2 3
2 4 5

At the IEEE Conference on Data Mining (ICDM) held in December 2006, experts from the conference selected The top 10 data mining algorithms at that time can be found in the literature [1]. The algorithms among the top ten algorithms that have been introduced in this blog include:

Apriori algorithm is a representative algorithm for association rule mining, which is also ranked among the top ten data mining algorithms. The list. Association rule mining is a very important research direction in data mining, and it is also a long-standing topic. Its main task is to find out the inner relationship between things.

Welcome to pay attention to Baima negative Jinxi's blog http://blog.csdn.net/baimafujinji, in order to ensure that the formula and chart are displayed correctly, it is strongly recommended that you view the original blog post from this address. The main focus of this blog is: digital image processing, algorithm design and analysis, data structure, machine learning, data mining, statistical analysis methods, natural language processing.

引言: Data Mining and Machine Learning

Sometimes people are confused about the two terms of machine learning and data mining. If you open a textbook titled Machine Learning and then open a textbook called Data Mining, you will find that there is a lot of content between the two. For example, machine learning also talks about decision trees and support vector machines, and data mining books must also spend considerable space on decision trees and support vector machines. It can be seen that the two have quite a large overlap, but if they are studied in detail, the two are indeed different fields.

In general, data mining can be seen as the intersection of database, machine learning and statistics. Simply put, for data mining, databases provide data management techniques, while machine learning and statistics provide data analysis techniques. So you can think of data mining as including machine learning, or machine learning is a fairly large collection of ammunition in the ammunition library of data mining. Since it is a type of ammunition, it is actually said that there are other technologies in the non-machine learning category in data mining. Apriori algorithm belongs to a non-machine learning data mining technology.

We all know that data mining extracts from a large number of incomplete, noisy, fuzzy, random data, information that is hidden in the prior, but is potentially useful. And the process of knowledge. Machine learning is a type of technology that uses data to build or train a model and use it to implement data analysis. This trained machine learning model can of course be considered as potential and meaningful information and knowledge that we have extracted from the data. In non-machine learning data mining technology, we do not build such a model, but directly from the original data set, trying to analyze some information or knowledge hidden behind the data. When you introduce the Apriori algorithm later, you will feel this feature quite clearly.

基础概念

Many commercial companies have accumulated a large amount of transaction data in day-to-day operations. For example, the supermarket's checkout counter collects a large amount of customer shopping data every day. For example, the following table gives an example of such a data set, which we usually refer to as a market basket transaction. Each row in the table corresponds to a transaction containing a unique identifier TID and a collection of items purchased by a particular customer. Retailers are interested in analyzing this data in order to understand the buying behavior of their customers. This valuable information can be used to support practical applications in a variety of businesses, such as marketing promotions, inventory management, and customer relationship management.

令I={i1,i2,⋯,id}I={i1,i2,⋯,id} is a collection of all items in the basket data, and T={t1,t2,⋯,tN}T ={t1,t2,⋯,tN} is a collection of all transactions. A collection containing zero or more items is called an itemset. If an item set contains kk items, it is called a kk-item set. Obviously, the item set contained in each transaction titi is a subset of II.

association rule is an implication expression of the form X→YX→Y, where XX and YY are disjoint sets of items, ie X∩Y=∅X∩Y=∅. The strength of the association rule can be measured by its support and confidence. The support determination rule can be used to give the frequency of the data set, and the confidence determines how often YY appears in the transaction containing XX. The forms of the two metrics: ss: Fraction of transactions that contain both XX and YY and cc: How often items in YY appear in transactions that contain XX are defined as follows:

s(X→Y)=σ(X∪Y)Nc(X→Y)=σ(X∪Y)σ(X)s(X→Y)=σ(X∪Y)Nc(X→Y)=σ(X∪Y)σ(X)

Milk, Diaper}→→{Beer}, it is easy to get:

s=σ(Milk,Diaper,Beer)|T|=25=0.4c=σ(Milk,Diaper,Beer)σ(Milk,Diaper)=23=0.67s=σ(Milk,Diaper,Beer)|T|=25=0.4c=σ(Milk,Diaper,Beer)σ(Milk,Diaper)=23=0.67

Association Rule Mining Task:Given a set of transactions T, the goal of association rule mining is to find all rules having

  • support ≥ minsup threshold
  • confidence ≥ minconf threshold

Therefore, most of the association rule mining algorithms usually adopt a strategy of decomposing the association rule mining task into the following two main subtasks.

  1. Frequent item set generation: The goal is to find all itemsets that meet the minimum support threshold, which are called frequent itemsets. The generation of
  2. rules: The goal is to extract all high-confidence rules from the frequent itemsets found in the previous step. These rules are called strong rules.

In general, the computational overhead required to generate frequent itemsets is much greater than the computational overhead required to generate rules.

The easiest way to think about the most direct relationship mining is perhaps the Brute-force method:

  • List all possible association rules
  • Compute the support and confidence for each rule
  • Prune rules that fail the minsup and minconf thresholds

However, since the calculation of Brute-force is too large, sampling this The method is not realistic! The Lattice structure is often used to enumerate all possible itemsets. As shown in the figure below, the item set of I={a, b, c, d, e}I={a, b, c, d, e}. In general, after excluding an empty set, a data set containing kk items may produce 2k−12k−1 frequent itemsets. Since the value of kk may be very large in practical applications, it may be exponential to search for an empty set of item sets that need to be probed.

One of the original methods for discovering frequent itemsets is to determine the support count for each candidate item in the lattice structure. In order to accomplish this task, each candidate set must be compared to each transaction, as shown in the following figure. If the candidate set is included in the transaction, the support count for the candidate set increases. For example, since the item set {Bread, Milk} appears in transactions 1, 4, and 5, its support count is increased by 3 times. The cost of this method can be very large because it requires O(NMw)O(NMw) comparisons, where NN is the number of transactions, M=2k−1M=2k−1 is the number of candidate sets, and ww is the transaction Maximum width (that is, the largest number of items in the transaction).

先验原理

At the end of the previous section, we have seen that Brute-force is not desirable in practice. We must try to reduce the computational complexity of generating frequent itemsets. At this point we can use the support to pruning the candidate set, which is the first a priori principle used by Apriori:

Apriori's law 1: If a set is a frequent item set, then its All subsets are frequent itemsets.

example: Assume that a set {A, B} is a frequent item set, that is, A and B appear in one record at a time greater than or equal to the minimum support degree min_support, then its subset {A}, {B} appears The number must be greater than or equal to min_support, ie its subset is a frequent item set.

Apriori's Law 2: If a collection is not a frequent item set, then all of its supersets are not frequent itemsets.

Example: Assume that the set {A} is not a frequent item set, that is, the number of occurrences of A is less than min_support, then any superset such as {A, B} must be less than min_support, so its superset must not be Frequent itemsets.

The following figure shows that when we find that {A, B} is a non-frequent set, it means that all the supers that contain it are also infrequent, that is, they can be cut off.

Apriori algorithm and example

R. Agrawal and R. Srikant proposed the Apriori algorithm in the literature [2] in 1994. The algorithm is described as follows:

  • Let kk=1
  • Generate frequent itemsets of length kk
  • Repeat until no new frequent itemsets are identified 
    • Generate length (kk+1) candidate itemsets from length kk frequent itemsets
    • Prune candidate itemsets containing subsets of length kk+1 that are infrequent
    • Count the support of each candidate by scanning the DB
    • Eliminate candidates that are infrequent, leaving only those that are frequent

or on other data. Common is the following formal description (note that this is consistent with the previous text description):

Below is a concrete example, there are 4 transactions in the database, {A, C, D} , {B, C, E}, {A, B, C, E}, {B, E}, use min_support=2 as the support threshold, and finally we select the frequent set as {B, C, E}.

The above example is the most worthwhile for us from L2L2 to C3C3. This is actually the place where the first blue box in the pseudo code is marked: Ck+1=GenerateCandidates(Lk)Ck+1=GenerateCandidates(Lk), specifically the strategy used in the Apriori algorithm. As follows:

Visible generation strategy consists of two parts, the first is the self-joining part. For example, suppose we have a L3L3={abc, abd, acd, ace, bcd} (note that this is already sorted). Select two itemsets, they satisfy the condition: the first kk-1 items are the same, but last An item is different, and they are grouped into a new item set cc of Ck+1Ck+1. As shown in the following figure, {abc} and {abd} are composed of {abcd}, {acd} and {ace} are composed of {acde}. The second part of the strategy is pruning. For a set of items cc located in Ck+1Ck+1, ss is a subset of cc size kk, if ss does not exist in LkLk, then cc is from Ck+1Ck+1 Deleted. As shown in the figure below, since the {acde} subset {cde} does not exist in L3L3, we remove {acde} from C4C4. The resulting C4C4 contains only one item set {abcd}.

Back to the previous example, from L2L2 to C3C3, we can only get {B, C, E}. The above is the core idea of ​​Apriori algorithm. Of course, in the specific implementation, how to Count Supports of Candidates are also issues to consider. We will skip the discussion of this part here. Interested readers can refer to the literature [3] to learn more.

Ref

【1】Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Philip, S.Y. and Zhou, Z.H., 2008. Top 10 algorithms in data mining. Knowledge and information systems, 14(1), pp.1-37. (http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf)

【2】Rakesh Agrawal and Ramakrishnan Srikant Fast algorithms for mining association rules in large databases. Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994. (http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf)

[3] Pang-Ning Tan, Micheale Steinbach, Vipin Kumar. Introduction to data mining, Fan Ming, such as translation. Posts & Telecom Press, 2011