Bank Deposit Prediction Based on Decision Tree

Hits: 0

Article directory

foreword

This paper mainly introduces the specific implementation process of [decision tree] , which mainly includes the calculation of single-feature information entropy, the calculation of information gain, the establishment of decision tree and model preservation, model training and prediction, and finally the established model is detected by predicting bank deposits. .
Bank Deposit [Dataset] : Machine Learning – Decision Tree Data 2

1. What is a decision tree?

Decision Tree: It is a tree-shaped induction [classification algorithm] . By learning the training set data, certain rules are mined to predict the test set data.
Decision tree is a typical classification method
1) The data is first processed, using inductive algorithms to generate readable rules and decision trees,
2) the new data is then analyzed using the decisions.
Essentially a decision tree is the process of classifying data through a series of rules.

2. Experimental principle

[Calculate their information gain] for each feature separately, take the feature with the most obvious (maximum) information gain as the current division feature, and divide the original table into sub-tables according to each element under this feature, and repeat the above steps until The labels in the table contain only one value, or reach a set threshold. After the decision tree is established, the preprocessed data is put into the model for prediction, and the prediction results are compared with the labels to analyze the results.

Third, the specific implementation

1. Find the information entropy of each feature

The code is as follows (example):

import math
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


# Calculate information entropy 
def  Ent (data, feature, target) :   # data is dataframe type, feature: feature (string), target: label (string) 
    data = data.loc[:, [feature, target]]   # Select Feature Columns and Label Columns
    data_target = set(data.loc[:, target].values) 
    len1 = data.shape[ 0 ]   #len1: total number of rows 
    sum = 0 
    for i in data_target:
        len2 = data[data[target] == ​​i].shape[ 0 ]   #len2: The number of rows of i results under this feature 
        if len2/len1 != 1 :
            sum += -(len2/len1)*math.log2(len2/len1)
    return sum

2. Find the information gain of each feature and find the eigenvalue with the largest information gain

As the depth of the tree increases, the entropy of the node decreases rapidly, and the faster the entropy decreases, the better, so that we can expect to get a decision tree with the shortest height.

The code is as follows (example):

# Calculate information gain
def Gain(data, feature, target):
    #Calculate the information entropy before division
    s0 = Ent(data, feature, target)
    #Calculate the information entropy after division 
    tz_set = set (data.loc[:, feature].values)   #All possible values ​​of features 
    count = data.shape[ 0 ] #Number of all rows in the data table before division 
    sum = 0   # Save the sum of the divided information entropy and the product of the proportion 
    for i in tz_set:
        data_temp = data[data[feature] == i]
        count_temp = data_temp.shape[ 0 ] #The number of all rows in the data table after division 
        sum += (count_temp/ count ) * Ent(data_temp, feature, target)
     return s0- sum   # Return information gain

# Find the eigenvalue with the largest information gain 
def Get_MaxGain( data , target):
     column = data.columns
     max = 0 
    for i in  column :   # Calculate the information gain of each feature 
        if i == target:  
            continue
        sum = Gain(data, i, target)
        if sum>max:
            max = sum
            max_feature = i
    return max_feature

3. Implement the network structure of the decision tree and save it

Since python does not have a special tree structure, I save the tree in the form of a list. The list consists of three parts, namely the feature name, the feature value, the next layer structure under the feature value, [feature name, feature value] 1, [the next structure of eigenvalue 1], eigenvalue 2, [the next structure of eigenvalue 2], …, eigenvalue n, [the next structure of eigenvalue n]].
E.g:
The decision tree structure is: [‘age’, ‘more than 20 years old’, [‘height’, ‘more than 180cm’, [‘result’, 0], ‘less than 180cm’, [‘result’, 1]], ‘ less than 20 years old’, [‘weight’, ‘less than 50kg’, [‘result’, 0], ‘over 70kg’, [‘result’, 1]]]

# Recursively implement the network structure of the decision tree and save it 
def DisTree(data, target, k, ListTree):   # k: formatted output 
    ListNode = [] # Leaf node 
    if len( set (data.loc[:, target])) == 1 :
        ListNode.append( 'result' )   # Add a result node, which is also the identifier of the bottommost result of the decision tree 
        ListNode.append(data.loc[:, target].values[ 0 ])
         # print('{}The result is: {}'.format((k-1)*2*' ', data.loc[:, target].values[0])) 
        ListTree.append(ListNode)   # add leaf nodes to the tree 
        return ListTree
     if k == 6 :   # The tree structure is five levels of discriminant conditions, one level of results 
        ListNode.append( 'result' )
        ListNode.append( list ( data [target].mode())[ 0 ])   # Find the mode in the target, and if there is no mode, find any one 
        # print('{}The result is: {}' .format((k-1)*2*' ', list(data[target].mode())[0])) 
        ListTree.append(ListNode)   # add leaf nodes to the tree 
        return ListTree
    feature = Get_MaxGain( data , target) # Feature of the best classification in the current table 
    ListNode.append(feature)   # Add feature 
    tz_set = set (data.loc[:, feature].values)   # Feature all possible values 
    ​​for i in tz_set:
         # print('{}When {} is {}'.format((k-1)*2*' ', feature, i))    
        ListNode.append( str (i)) # Add condition 
        data_temp = data [ data [feature]==i]
        DisTree(data_temp, target, k+ 1 , ListNode)
    ListTree.append(ListNode)   # add leaf nodes to the tree 
return ListTree

# Normalized training function 
def fit( data , target): 
     model = DisTree( data , target, 1 ,[])  
     return  model

4. In the case of decision conditions that do not appear in the model, the result is determined

Due to the unreasonable data preprocessing, some situations that have never appeared in the prediction will be encountered. At this time, the strategy of timely termination-in-place analysis is adopted. The data is predicted to this position in this layer, but the results are generated according to the model. For all the results below this layer, the method of equal weight voting is used in these results, and the best result is selected as the prediction result of the data.
For example: when meeting age 68
The result is: 0

# In the case of a decision condition that does not appear in the model, the result is determined 
def  Not_in_model (model, result) : 
    if (model[ 0 ] == 'result' ):
        result.append(model[1])
        return result
    m = 2
    while(m<len(model)):
        Not_in_model(model[m], result)
        m += 2
    return result

5. Prediction function

According to the trained model, predict each piece of data to be predicted, according to the name of each feature in the model, find the feature value under the feature of the data to be predicted, find the corresponding index in the list according to the feature value, and add one to the index value. For the structure of the next layer, continue to search according to the feature name of the next layer until a list with the feature name “result” is found, indicating that the lowest layer of the decision tree – the result layer has been found, and the position of the list subscript 1 corresponds to for the prediction result.

#Prediction function 
def  predict (model, data) : 
    result = []   # The predicted result is saved 
    for i in range(len(data)):   # Predict each data 
        models = model[ 0 ]
        feature = models[ 0 ]   # The first feature name to be identified 
        while feature != 'result' :
             # print(" \n"*5) # Display model changes 
            # print(models) 
            value = str(data.loc[ data.index[i], feature])   # Find the value corresponding to the feature of the data 
            try :
                k = models.index(value)   # Find the subscript corresponding to value in the model 
            except :   # If this option does not appear, it will start to analyze the result until it is judged
                temp = Not_in_model(models,[])
                result.append(max(temp, default= 'the list is empty' , key= lambda v: temp.count(v)))
                 break 
            models = models[k+ 1 ]   # According to the structure tree of the model, the value value corresponds to the next index The value is the decision condition of the next layer 
            feature = models[ 0 ] 
         if models[ 0 ] == 'result' :
            result.append(models[1])
    return result

6. Make predictions on bank data

1) Carry out and process the data
Analyze the distribution law of each continuous value feature and divide it.
For the feature ‘balance’:

For feature ‘duration’:

For feature ‘previous’:

For feature ‘campaign’:

For feature ‘pdays’:

# Data processing 
    data.drop( 'day' , axis = 1 , inplace= True )
    data.drop('month', axis = 1, inplace= True)

    data.loc[data["age"]<=18,"age"]=0
    data.loc[data["age"]>50,"age"]=2
    data.loc[data["age"]>18,"age"]=1

    data.loc[data["balance"]<=20000,"balance"]=0
    data.loc[data["balance"]>40000,"balance"]=2
    data.loc[data["balance"]>20000,"balance"]=1

    data.loc[data["duration"]<=500,"duration"]=0
    data.loc[data["duration"]>2000,"duration"]=2
    data.loc[data["duration"]>500,"duration"]=1

    data.loc[data['previous']<=10,'previous']=0
    data.loc[data['previous']>20,'previous']=2
    data.loc[data['previous']>10,'previous']=1

    data.loc[data['campaign']<=10,'campaign']=0
    data.loc[data['campaign']>20,'campaign']=2
    data.loc[data['campaign']>10,'campaign']=1

    data.loc[data['pdays']<=200,'pdays']=0
    data.loc[data['pdays']>400,'pdays']=2
    data.loc[data['pdays']>200,'pdays']=1

2) Divide the [training set] Divide the [training set] and test set, and train and predict the test machine

X_train, X_test, y_train, y_test = train_test_split(data, data.loc[:, target],test_size=0.25, random_state=7)
    model = fit(X_train, target)
    print(model)
    result = predict(model, X_test)
    score(result, y_test)

4. Achieve results

Processed data:
Trained model:
Predicted results:

5. Analysis of results

  1. Because the data processing is not precise enough, the result error is large.
  2. The model saves repeated values ​​that appear too many times, and the list saves the tree structure is not a good solution.

6. Source code

import math
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


# Calculate information entropy 
def Ent(data, feature, target):   # data is dataframe type, feature: feature (string), target: label (string) 
    data = data.loc[:, [feature, target]]   # Select feature column and label column 
    data_target = set (data.loc[:, target].values)
    len1 = data.shape[ 0 ]   #len1: total number of rows 
    sum = 0 
    for i in data_target:
        len2 = data [ data [target] == ​​i].shape[ 0 ]   #len2: the number of rows of i results under this feature 
        if len2/len1 != 1 :
             sum += -(len2/len1)*math.log2( len2/len1)
     return  sum

# Calculate the information gain 
def Gain( data , feature, target):
     # Calculate the information entropy before division 
    s0 = Ent( data , feature, target)
     # Calculate the information entropy after division 
    tz_set = set (data.loc[:, feature]. values)   #All possible values ​​of the feature 
    count = data.shape[ 0 ] #Number of all rows in the data table before division 
    sum = 0   # Save the sum of the product of information entropy and proportion after division 
    for i in tz_set:
        data_temp = data[data[feature] == i]
        count_temp = data_temp.shape[ 0 ] #The number of all rows in the data table after division 
        sum += (count_temp/ count ) * Ent(data_temp, feature, target)
     return s0- sum   # Return information gain

# Find the eigenvalue with the largest information gain 
def Get_MaxGain( data , target):
     column = data.columns
     max = 0 
    for i in  column :   # Calculate the information gain of each feature 
        if i == target:  
            continue
        sum = Gain(data, i, target)
        if sum>max:
            max = sum
            max_feature = i
    return max_feature

# Recursively implement the network structure of the decision tree and save it 
def DisTree( data , target, k, ListTree):   # k: formatted output 
    ListNode = [] # Leaf node 
    if  len ( set (data.loc[:, target])) == 1 :
        ListNode.append( 'result' )   # Add a result node, which is also the identifier of the bottommost result of the decision tree 
        ListNode.append(data.loc[:, target].values[ 0 ])
         # print('{}The result is: {}'.format((k-1)*2*' ', data.loc[:, target].values[0])) 
        ListTree.append(ListNode)   # add leaf nodes to the tree 
        return ListTree
     if k == 6 :   # The tree structure is five levels of discriminant conditions, one level of results 
        ListNode.append( 'result' )
        ListNode.append( list ( data [target].mode())[ 0 ])   # Find the mode in the target, and if there is no mode, find any one 
        # print('{}The result is: {}' .format((k-1)*2*' ', list(data[target].mode())[0])) 
        ListTree.append(ListNode)   # add leaf nodes to the tree 
        return ListTree
    feature = Get_MaxGain( data , target) # Feature of the best classification in the current table 
    ListNode.append(feature)   # Add feature 
    tz_set = set (data.loc[:, feature].values)   # Feature all possible values 
    ​​for i in tz_set:
         # print('{}When {} is {}'.format((k-1)*2*' ', feature, i))    
        ListNode.append( str (i)) # Add condition 
        data_temp = data [ data [feature]==i]
        DisTree(data_temp, target, k+ 1 , ListNode)
    ListTree.append(ListNode)   # add leaf nodes to the tree 
    return ListTree

# Normalized training function 
def fit( data , target): 
     model = DisTree( data , target, 1 ,[])  
     return  model

# In the case of decision conditions that do not appear in the model, the result is determined 
def Not_in_model( model , result ):
     if ( model [ 0 ] == 'result' ):
        result.append(model[1])
        return result
    m = 2
    while(m<len(model)):
        Not_in_model(model[m], result)
        m += 2
    return result


#Prediction function 
def predict( model , data ):
     result = []   # The predicted result is saved 
    for i in  range ( len ( data )):   # Predict each data 
        models = model [ 0 ]
        feature = models[ 0 ]   # The first feature name to be identified 
        while feature != 'result' :
             # print(" \n"*5) # Display model changes 
            # print(models) 
            value = str (data.loc[ data.index[i], feature])   # Find the value corresponding to the feature of the data
            try:
                k = models.index( value )   # Find the subscript corresponding to value in the model 
            except :   # If this option does not appear, it will start to analyze the result until it is judged
                temp = Not_in_model(models,[])
                result.append( max (temp, default = 'list is empty' , key =lambda v: temp.count(v)))
                break
            models = models[k+ 1 ]   # According to the structure tree of the model, the value value corresponds to the next index value of the judgment condition of the next layer 
            feature = models[ 0 ] 
         if models[ 0 ] == 'result' :
            result.append(models[1])
    return result

def score(result, target):
    error = 0
    for i in range(len(result)):
        if result[i] != target[target.index[i]]:    
            # print(target[target.index[i]],result[i])
            error += 1
    print(result)
    print( 'The number of errors is: {}, the error rate is: {}' .format( error , error / len ( result )))

def  test ():
     data = pd.read_csv(r 'D:\python111\machine learning\course design\bank.csv' )
    target = data.columns.values[-1]

    # Data processing 
    data.drop( 'day' , axis = 1 , inplace= True )
    data.drop('month', axis = 1, inplace= True)

    data.loc[data["age"]<=18,"age"]=0
    data.loc[data["age"]>50,"age"]=2
    data.loc[data["age"]>18,"age"]=1

    data.loc[data["balance"]<=20000,"balance"]=0
    data.loc[data["balance"]>40000,"balance"]=2
    data.loc[data["balance"]>20000,"balance"]=1

    data.loc[data["duration"]<=500,"duration"]=0
    data.loc[data["duration"]>2000,"duration"]=2
    data.loc[data["duration"]>500,"duration"]=1

    data.loc[data['previous']<=10,'previous']=0
    data.loc[data['previous']>20,'previous']=2
    data.loc[data['previous']>10,'previous']=1

    data.loc[data['campaign']<=10,'campaign']=0
    data.loc[data['campaign']>20,'campaign']=2
    data.loc[data['campaign']>10,'campaign']=1

    data.loc[data['pdays']<=200,'pdays']=0
    data.loc[data['pdays']>400,'pdays']=2
    data.loc[data['pdays']>200,'pdays']=1

    print(data)
    X_train, X_test, y_train, y_test = train_test_split(data, data.loc[:, target],test_size=0.25, random_state=7)
    model = fit(X_train, target)
    print(model)
    result = predict(model, X_test)
    score(result, y_test)

test()

You may also like...

Leave a Reply

Your email address will not be published.