BP neural network from derivation to entry-level understanding

Hits: 0

1. Derivation of the formula

The formula is cumbersome and difficult to understand. If you don’t understand it, it is recommended to go to the following website to watch the explanation of the formula:

[Machine learning combat] [python3 version] [code explanation] _ beep mile _bilibili

2. Code implementation

2.1. The first implementation method (only the key code is given here because the code is too long)

  • First import the packages we need to use

import random
import numpy as np

  • Create a Network class

class  Network ( object ): 
    # Network initialization 
    def  __init__ ( self , sizes) :
         self .num_layers = len(sizes)   # The number of neurons in each layer is 784 30 10 
        self .sizes = sizes
         self .biases = [np.random.randn (y, 1 ) for y in sizes[ 1 : ]]  
 # bias initialization [30x1,10x1], random initialization 
        self .weights = [np.random.randn(y, x)
                         for x, y in zip(sizes[ :- 1 ], sizes[ 1 : ])] 
 # Weight initialization [30x784, 10x30], zip means merge, if there are multiple, take the minimum length, and the rest are omitted

In this code, the list sizes contains the number of neurons in each layer. For example, if we wanted to create a Network object with 2 neurons in the first layer, 3 neurons in the second layer, and 1 neuron in the last layer, we would write the code like this:

net = Network([ 2 , 3 , 1 ])//Indicates that there are three layers of network

The biases and weights in the Network object are initialized randomly, using Numpy’s np.random.randn function to generate a Gaussian distribution with mean 0 and standard deviation 1.

We then add a feedforward method to the Network class that, given an input a to the network, returns the corresponding output. What this method does is apply the equation to each layer.

# Forward operation 
    def  feedforward ( self , a) :
         for b, w in zip( self .biases, self .weights):
            a = sigmoid(np.dot(w, a) + b)
        return a

For learning, we create an SGD method implementing Stochastic Gradient Descent

def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):

        training_data = list(training_data)   # 50000 samples
        n = len(training_data)

        if test_data:
            test_data = list(test_data)   # 10000 samples
            n_test = len(test_data)

        for j in range(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k + mini_batch_size]
                for k in range( 0 , n, mini_batch_size)] # Divide 50000 samples into mini_batch_size batches and run 
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta) # Run batch by batch 
            if test_data:
                print("Epoch {} : {} / {}".format(j, self.evaluate(test_data), n_test))
            else:
                print("Epoch {} complete".format(j))

where: training_data is a list of (x, y) tuples representing the training input and its corresponding expected output. The variables epochs and mini_batch_size are as you would expect – the number of epochs, and the size of the mini-batch at sampling time. eta is the learning rate, η.

Also to facilitate gradient descent and updates, we create a simple random sampling method from the training data:

# Mini-batch gradient descent 
    def  update_mini_batch ( self , mini_batch, eta) :

        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in  self .weights]
         for x, y in  mini_batch:   # It is still one sample for back propagation calculation 
            delta_nabla_b, delta_nabla_w = self .backprop(x, y)
            nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
         self .weights = [w - (eta / len(mini_batch)) * nw         
  # Just update the weights, a batch is only updated once 
                        for w, nw in zip( self .weights, nabla_w)]
         self .biases = [b - (eta / len(mini_batch)) * nb
                        for b, nb in zip( self .biases, nabla_b)]

We performed the backpropagation algorithm again:

# Backpropagation 
    def  backprop ( self , x, y) :

        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activations = [x]  # list to store all the activations, layer by layer
        zs = []  # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation) + b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)

        # backward pass Calculate the error of the last layer 
        delta = self .cost_derivative(activations[- 1 ], y) * \
                sigmoid_prime(zs[-1])
        nabla_b [- 1 ] = delta
        nabla_w[- 1 ] = np.dot(delta, activations[- 2 ].transpose())
         # Calculate the error from the penultimate layer to the second layer 
        for l in range( 2 , self .num_layers):
            z = zs [-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l + 1].transpose(), delta) * sp
            nabla_b [-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l - 1 ].transpose())

        return (nabla_b, nabla_w)

2.2. Improvement of [BP neural network]

2.2.1, BP neural network improvement

  • Cross-entropy cost function:

“Serious errors” lead to slow learning. If an output that deviates from expectations is deliberately generated when initializing weights and biases, it will take many iterations in the process of training the network to offset this deviation and restore normal learning. .

(1) The purpose of introducing the cross-entropy cost function is to solve the problem that some instances learn very slowly at the beginning of training. It is mainly aimed at the activation function of the Sigmod function.

(2) If you use an activation function that does not saturate, you can continue to use the sum of squared errors as the loss function

(3) If the output neuron is a sigmoid neuron, cross entropy is generally a better choice

(4) The output neuron is linear, so the quadratic cost function will no longer cause the problem of learning speed drop. In this case, a quadratic cost function is an appropriate choice

(5) Cross-entropy cannot improve the slow learning that occurs in neurons in the hidden layer

(6) The cross-entropy loss function only improves the slow learning that occurs when the network output “obviously deviates from expectations”

(7) The application of cross-entropy loss cannot improve or avoid neuron saturation, but can avoid the problem of slow learning when the output layer neurons are saturated.

  • 4 normalization techniques:

(1) Stop early. Track how the accuracy on the validation dataset changes with training. If we see no improvement in accuracy on validation data, then we stop training

(2) Regularization

(3) Dropout

(4) Amplified sample set

  • Better weight initialization method

A bad weight initialization method can lead to saturation problems. A good weight initialization method can not only speed up the training speed, but also greatly improve the final performance.

2.3. The second implementation method (too much code, only the improved part is given here) c

  • Error squared and cost function

class  QuadraticCost (object) :  # Error sum of squared cost function

    @staticmethod
    def fn(a, y):

        return 0.5*np.linalg.norm(a-y)**2

    @staticmethod
    def delta(z, a, y):
        """Return the error delta from the output layer."""
        return (a-y) * sigmoid_prime(z)

  • class CrossEntropyCost (object) : # Cross entropy cost function<pre><code>@staticmethod def fn(a, y): return np.sum(np.nan_to_num(-y*np.log(a)-( 1 -y)*np.log( 1 -a))) </code></pre> <p># use 0 instead of nan and a larger value instead of inf</p> <pre><code>@staticmethod def delta(z, a, y): return (a-y) </code></pre> <p>

def  default_weight_initializer ( self ) : # Recommended weight initialization

        self .biases = [np.random.randn(y, 1 ) for y in  self .sizes[ 1 : ]] # Bias initialization is unchanged 
        self .weights = [np.random.randn(y, x)/np. sqrt(x)   # reduce variance and avoid saturation 
                        for x, y in zip( self .sizes[ :- 1 ], self .sizes[ 1 : ])]

  • Initialization improvements:

def  default_weight_initializer ( self ) : # Recommended weight initialization

        self .biases = [np.random.randn(y, 1 ) for y in  self .sizes[ 1 : ]] # Bias initialization is unchanged 
        self .weights = [np.random.randn(y, x)/np. sqrt(x)   # reduce variance and avoid saturation 
                        for x, y in zip( self .sizes[ :- 1 ], self .sizes[ 1 : ])]

You may also like...

Leave a Reply

Your email address will not be published.