Hits: 0

# foreword

This paper mainly introduces the specific implementation process of [decision tree] , which mainly includes the calculation of single-feature information entropy, the calculation of information gain, the establishment of decision tree and model preservation, model training and prediction, and finally the established model is detected by predicting bank deposits. .
Bank Deposit [Dataset] : Machine Learning – Decision Tree Data 2

# 1. What is a decision tree?

Decision Tree: It is a tree-shaped induction [classification algorithm] . By learning the training set data, certain rules are mined to predict the test set data.
Decision tree is a typical classification method
1) The data is first processed, using inductive algorithms to generate readable rules and decision trees,
2) the new data is then analyzed using the decisions.
Essentially a decision tree is the process of classifying data through a series of rules.

# 2. Experimental principle

[Calculate their information gain] for each feature separately, take the feature with the most obvious (maximum) information gain as the current division feature, and divide the original table into sub-tables according to each element under this feature, and repeat the above steps until The labels in the table contain only one value, or reach a set threshold. After the decision tree is established, the preprocessed data is put into the model for prediction, and the prediction results are compared with the labels to analyze the results.

# Third, the specific implementation

## 1. Find the information entropy of each feature

The code is as follows (example):

```import math
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Calculate information entropy
def  Ent (data, feature, target) :   # data is dataframe type, feature: feature (string), target: label (string)
data = data.loc[:, [feature, target]]   # Select Feature Columns and Label Columns
data_target = set(data.loc[:, target].values)
len1 = data.shape[ 0 ]   #len1: total number of rows
sum = 0
for i in data_target:
len2 = data[data[target] == ​​i].shape[ 0 ]   #len2: The number of rows of i results under this feature
if len2/len1 != 1 :
sum += -(len2/len1)*math.log2(len2/len1)
return sum```

## 2. Find the information gain of each feature and find the eigenvalue with the largest information gain

As the depth of the tree increases, the entropy of the node decreases rapidly, and the faster the entropy decreases, the better, so that we can expect to get a decision tree with the shortest height.

The code is as follows (example):

```# Calculate information gain
def Gain(data, feature, target):
#Calculate the information entropy before division
s0 = Ent(data, feature, target)
#Calculate the information entropy after division
tz_set = set (data.loc[:, feature].values)   #All possible values ​​of features
count = data.shape[ 0 ] #Number of all rows in the data table before division
sum = 0   # Save the sum of the divided information entropy and the product of the proportion
for i in tz_set:
data_temp = data[data[feature] == i]
count_temp = data_temp.shape[ 0 ] #The number of all rows in the data table after division
sum += (count_temp/ count ) * Ent(data_temp, feature, target)
return s0- sum   # Return information gain

# Find the eigenvalue with the largest information gain
def Get_MaxGain( data , target):
column = data.columns
max = 0
for i in  column :   # Calculate the information gain of each feature
if i == target:
continue
sum = Gain(data, i, target)
if sum>max:
max = sum
max_feature = i
return max_feature```

## 3. Implement the network structure of the decision tree and save it

Since python does not have a special tree structure, I save the tree in the form of a list. The list consists of three parts, namely the feature name, the feature value, the next layer structure under the feature value, [feature name, feature value] 1, [the next structure of eigenvalue 1], eigenvalue 2, [the next structure of eigenvalue 2], …, eigenvalue n, [the next structure of eigenvalue n]].
E.g:
The decision tree structure is: [‘age’, ‘more than 20 years old’, [‘height’, ‘more than 180cm’, [‘result’, 0], ‘less than 180cm’, [‘result’, 1]], ‘ less than 20 years old’, [‘weight’, ‘less than 50kg’, [‘result’, 0], ‘over 70kg’, [‘result’, 1]]]

```# Recursively implement the network structure of the decision tree and save it
def DisTree(data, target, k, ListTree):   # k: formatted output
ListNode = [] # Leaf node
if len( set (data.loc[:, target])) == 1 :
ListNode.append( 'result' )   # Add a result node, which is also the identifier of the bottommost result of the decision tree
ListNode.append(data.loc[:, target].values[ 0 ])
# print('{}The result is: {}'.format((k-1)*2*' ', data.loc[:, target].values))
ListTree.append(ListNode)   # add leaf nodes to the tree
return ListTree
if k == 6 :   # The tree structure is five levels of discriminant conditions, one level of results
ListNode.append( 'result' )
ListNode.append( list ( data [target].mode())[ 0 ])   # Find the mode in the target, and if there is no mode, find any one
# print('{}The result is: {}' .format((k-1)*2*' ', list(data[target].mode())))
ListTree.append(ListNode)   # add leaf nodes to the tree
return ListTree
feature = Get_MaxGain( data , target) # Feature of the best classification in the current table
ListNode.append(feature)   # Add feature
tz_set = set (data.loc[:, feature].values)   # Feature all possible values
​​for i in tz_set:
# print('{}When {} is {}'.format((k-1)*2*' ', feature, i))
ListNode.append( str (i)) # Add condition
data_temp = data [ data [feature]==i]
DisTree(data_temp, target, k+ 1 , ListNode)
ListTree.append(ListNode)   # add leaf nodes to the tree
return ListTree

# Normalized training function
def fit( data , target):
model = DisTree( data , target, 1 ,[])
return  model```

## 4. In the case of decision conditions that do not appear in the model, the result is determined

Due to the unreasonable data preprocessing, some situations that have never appeared in the prediction will be encountered. At this time, the strategy of timely termination-in-place analysis is adopted. The data is predicted to this position in this layer, but the results are generated according to the model. For all the results below this layer, the method of equal weight voting is used in these results, and the best result is selected as the prediction result of the data.
For example: when meeting age 68
The result is: 0

```# In the case of a decision condition that does not appear in the model, the result is determined
def  Not_in_model (model, result) :
if (model[ 0 ] == 'result' ):
result.append(model)
return result
m = 2
while(m<len(model)):
Not_in_model(model[m], result)
m += 2
return result```

## 5. Prediction function

According to the trained model, predict each piece of data to be predicted, according to the name of each feature in the model, find the feature value under the feature of the data to be predicted, find the corresponding index in the list according to the feature value, and add one to the index value. For the structure of the next layer, continue to search according to the feature name of the next layer until a list with the feature name “result” is found, indicating that the lowest layer of the decision tree – the result layer has been found, and the position of the list subscript 1 corresponds to for the prediction result.

```#Prediction function
def  predict (model, data) :
result = []   # The predicted result is saved
for i in range(len(data)):   # Predict each data
models = model[ 0 ]
feature = models[ 0 ]   # The first feature name to be identified
while feature != 'result' :
# print(" \n"*5) # Display model changes
# print(models)
value = str(data.loc[ data.index[i], feature])   # Find the value corresponding to the feature of the data
try :
k = models.index(value)   # Find the subscript corresponding to value in the model
except :   # If this option does not appear, it will start to analyze the result until it is judged
temp = Not_in_model(models,[])
result.append(max(temp, default= 'the list is empty' , key= lambda v: temp.count(v)))
break
models = models[k+ 1 ]   # According to the structure tree of the model, the value value corresponds to the next index The value is the decision condition of the next layer
feature = models[ 0 ]
if models[ 0 ] == 'result' :
result.append(models)
return result```

## 6. Make predictions on bank data

1) Carry out and process the data
Analyze the distribution law of each continuous value feature and divide it.
For the feature ‘balance’:

For feature ‘duration’:

For feature ‘previous’:

For feature ‘campaign’:

For feature ‘pdays’:

```# Data processing
data.drop( 'day' , axis = 1 , inplace= True )
data.drop('month', axis = 1, inplace= True)

data.loc[data["age"]<=18,"age"]=0
data.loc[data["age"]>50,"age"]=2
data.loc[data["age"]>18,"age"]=1

data.loc[data["balance"]<=20000,"balance"]=0
data.loc[data["balance"]>40000,"balance"]=2
data.loc[data["balance"]>20000,"balance"]=1

data.loc[data["duration"]<=500,"duration"]=0
data.loc[data["duration"]>2000,"duration"]=2
data.loc[data["duration"]>500,"duration"]=1

data.loc[data['previous']<=10,'previous']=0
data.loc[data['previous']>20,'previous']=2
data.loc[data['previous']>10,'previous']=1

data.loc[data['campaign']<=10,'campaign']=0
data.loc[data['campaign']>20,'campaign']=2
data.loc[data['campaign']>10,'campaign']=1

data.loc[data['pdays']<=200,'pdays']=0
data.loc[data['pdays']>400,'pdays']=2
data.loc[data['pdays']>200,'pdays']=1```

2) Divide the [training set] Divide the [training set] and test set, and train and predict the test machine

```X_train, X_test, y_train, y_test = train_test_split(data, data.loc[:, target],test_size=0.25, random_state=7)
model = fit(X_train, target)
print(model)
result = predict(model, X_test)
score(result, y_test)```

# 4. Achieve results

Processed data:
Trained model:
Predicted results:

# 5. Analysis of results

1. Because the data processing is not precise enough, the result error is large.
2. The model saves repeated values ​​that appear too many times, and the list saves the tree structure is not a good solution.

# 6. Source code

```import math
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Calculate information entropy
def Ent(data, feature, target):   # data is dataframe type, feature: feature (string), target: label (string)
data = data.loc[:, [feature, target]]   # Select feature column and label column
data_target = set (data.loc[:, target].values)
len1 = data.shape[ 0 ]   #len1: total number of rows
sum = 0
for i in data_target:
len2 = data [ data [target] == ​​i].shape[ 0 ]   #len2: the number of rows of i results under this feature
if len2/len1 != 1 :
sum += -(len2/len1)*math.log2( len2/len1)
return  sum

# Calculate the information gain
def Gain( data , feature, target):
# Calculate the information entropy before division
s0 = Ent( data , feature, target)
# Calculate the information entropy after division
tz_set = set (data.loc[:, feature]. values)   #All possible values ​​of the feature
count = data.shape[ 0 ] #Number of all rows in the data table before division
sum = 0   # Save the sum of the product of information entropy and proportion after division
for i in tz_set:
data_temp = data[data[feature] == i]
count_temp = data_temp.shape[ 0 ] #The number of all rows in the data table after division
sum += (count_temp/ count ) * Ent(data_temp, feature, target)
return s0- sum   # Return information gain

# Find the eigenvalue with the largest information gain
def Get_MaxGain( data , target):
column = data.columns
max = 0
for i in  column :   # Calculate the information gain of each feature
if i == target:
continue
sum = Gain(data, i, target)
if sum>max:
max = sum
max_feature = i
return max_feature

# Recursively implement the network structure of the decision tree and save it
def DisTree( data , target, k, ListTree):   # k: formatted output
ListNode = [] # Leaf node
if  len ( set (data.loc[:, target])) == 1 :
ListNode.append( 'result' )   # Add a result node, which is also the identifier of the bottommost result of the decision tree
ListNode.append(data.loc[:, target].values[ 0 ])
# print('{}The result is: {}'.format((k-1)*2*' ', data.loc[:, target].values))
ListTree.append(ListNode)   # add leaf nodes to the tree
return ListTree
if k == 6 :   # The tree structure is five levels of discriminant conditions, one level of results
ListNode.append( 'result' )
ListNode.append( list ( data [target].mode())[ 0 ])   # Find the mode in the target, and if there is no mode, find any one
# print('{}The result is: {}' .format((k-1)*2*' ', list(data[target].mode())))
ListTree.append(ListNode)   # add leaf nodes to the tree
return ListTree
feature = Get_MaxGain( data , target) # Feature of the best classification in the current table
ListNode.append(feature)   # Add feature
tz_set = set (data.loc[:, feature].values)   # Feature all possible values
​​for i in tz_set:
# print('{}When {} is {}'.format((k-1)*2*' ', feature, i))
ListNode.append( str (i)) # Add condition
data_temp = data [ data [feature]==i]
DisTree(data_temp, target, k+ 1 , ListNode)
ListTree.append(ListNode)   # add leaf nodes to the tree
return ListTree

# Normalized training function
def fit( data , target):
model = DisTree( data , target, 1 ,[])
return  model

# In the case of decision conditions that do not appear in the model, the result is determined
def Not_in_model( model , result ):
if ( model [ 0 ] == 'result' ):
result.append(model)
return result
m = 2
while(m<len(model)):
Not_in_model(model[m], result)
m += 2
return result

#Prediction function
def predict( model , data ):
result = []   # The predicted result is saved
for i in  range ( len ( data )):   # Predict each data
models = model [ 0 ]
feature = models[ 0 ]   # The first feature name to be identified
while feature != 'result' :
# print(" \n"*5) # Display model changes
# print(models)
value = str (data.loc[ data.index[i], feature])   # Find the value corresponding to the feature of the data
try:
k = models.index( value )   # Find the subscript corresponding to value in the model
except :   # If this option does not appear, it will start to analyze the result until it is judged
temp = Not_in_model(models,[])
result.append( max (temp, default = 'list is empty' , key =lambda v: temp.count(v)))
break
models = models[k+ 1 ]   # According to the structure tree of the model, the value value corresponds to the next index value of the judgment condition of the next layer
feature = models[ 0 ]
if models[ 0 ] == 'result' :
result.append(models)
return result

def score(result, target):
error = 0
for i in range(len(result)):
if result[i] != target[target.index[i]]:
# print(target[target.index[i]],result[i])
error += 1
print(result)
print( 'The number of errors is: {}, the error rate is: {}' .format( error , error / len ( result )))

def  test ():
data = pd.read_csv(r 'D:\python111\machine learning\course design\bank.csv' )
target = data.columns.values[-1]

# Data processing
data.drop( 'day' , axis = 1 , inplace= True )
data.drop('month', axis = 1, inplace= True)

data.loc[data["age"]<=18,"age"]=0
data.loc[data["age"]>50,"age"]=2
data.loc[data["age"]>18,"age"]=1

data.loc[data["balance"]<=20000,"balance"]=0
data.loc[data["balance"]>40000,"balance"]=2
data.loc[data["balance"]>20000,"balance"]=1

data.loc[data["duration"]<=500,"duration"]=0
data.loc[data["duration"]>2000,"duration"]=2
data.loc[data["duration"]>500,"duration"]=1

data.loc[data['previous']<=10,'previous']=0
data.loc[data['previous']>20,'previous']=2
data.loc[data['previous']>10,'previous']=1

data.loc[data['campaign']<=10,'campaign']=0
data.loc[data['campaign']>20,'campaign']=2
data.loc[data['campaign']>10,'campaign']=1

data.loc[data['pdays']<=200,'pdays']=0
data.loc[data['pdays']>400,'pdays']=2
data.loc[data['pdays']>200,'pdays']=1

print(data)
X_train, X_test, y_train, y_test = train_test_split(data, data.loc[:, target],test_size=0.25, random_state=7)
model = fit(X_train, target)
print(model)
result = predict(model, X_test)
score(result, y_test)

test()```