Hits: 0

# 1. Derivation of the formula

The formula is cumbersome and difficult to understand. If you don’t understand it, it is recommended to go to the following website to watch the explanation of the formula:

[Machine learning combat] [python3 version] [code explanation] _ beep mile _bilibili

# 2. Code implementation

## 2.1. The first implementation method (only the key code is given here because the code is too long)

• First import the packages we need to use

```import random
import numpy as np```

• Create a Network class

```class  Network ( object ):
# Network initialization
def  __init__ ( self , sizes) :
self .num_layers = len(sizes)   # The number of neurons in each layer is 784 30 10
self .sizes = sizes
self .biases = [np.random.randn (y, 1 ) for y in sizes[ 1 : ]]
# bias initialization [30x1,10x1], random initialization
self .weights = [np.random.randn(y, x)
for x, y in zip(sizes[ :- 1 ], sizes[ 1 : ])]
# Weight initialization [30x784, 10x30], zip means merge, if there are multiple, take the minimum length, and the rest are omitted```

In this code, the list sizes contains the number of neurons in each layer. For example, if we wanted to create a Network object with 2 neurons in the first layer, 3 neurons in the second layer, and 1 neuron in the last layer, we would write the code like this:

`net = Network([ 2 , 3 , 1 ])//Indicates that there are three layers of network`

The biases and weights in the Network object are initialized randomly, using Numpy’s np.random.randn function to generate a Gaussian distribution with mean 0 and standard deviation 1.

We then add a feedforward method to the Network class that, given an input a to the network, returns the corresponding output. What this method does is apply the equation to each layer.

```# Forward operation
def  feedforward ( self , a) :
for b, w in zip( self .biases, self .weights):
a = sigmoid(np.dot(w, a) + b)
return a```

For learning, we create an SGD method implementing Stochastic Gradient Descent

```def SGD(self, training_data, epochs, mini_batch_size, eta,
test_data=None):

training_data = list(training_data)   # 50000 samples
n = len(training_data)

if test_data:
test_data = list(test_data)   # 10000 samples
n_test = len(test_data)

for j in range(epochs):
random.shuffle(training_data)
mini_batches = [
training_data[k:k + mini_batch_size]
for k in range( 0 , n, mini_batch_size)] # Divide 50000 samples into mini_batch_size batches and run
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta) # Run batch by batch
if test_data:
print("Epoch {} : {} / {}".format(j, self.evaluate(test_data), n_test))
else:
print("Epoch {} complete".format(j))```

where: training_data is a list of (x, y) tuples representing the training input and its corresponding expected output. The variables epochs and mini_batch_size are as you would expect – the number of epochs, and the size of the mini-batch at sampling time. eta is the learning rate, η.

Also to facilitate gradient descent and updates, we create a simple random sampling method from the training data:

```# Mini-batch gradient descent
def  update_mini_batch ( self , mini_batch, eta) :

nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in  self .weights]
for x, y in  mini_batch:   # It is still one sample for back propagation calculation
delta_nabla_b, delta_nabla_w = self .backprop(x, y)
nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self .weights = [w - (eta / len(mini_batch)) * nw
# Just update the weights, a batch is only updated once
for w, nw in zip( self .weights, nabla_w)]
self .biases = [b - (eta / len(mini_batch)) * nb
for b, nb in zip( self .biases, nabla_b)]```

We performed the backpropagation algorithm again:

```# Backpropagation
def  backprop ( self , x, y) :

nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# feedforward
activations = [x]  # list to store all the activations, layer by layer
zs = []  # list to store all the z vectors, layer by layer
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation) + b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)

# backward pass Calculate the error of the last layer
delta = self .cost_derivative(activations[- 1 ], y) * \
sigmoid_prime(zs[-1])
nabla_b [- 1 ] = delta
nabla_w[- 1 ] = np.dot(delta, activations[- 2 ].transpose())
# Calculate the error from the penultimate layer to the second layer
for l in range( 2 , self .num_layers):
z = zs [-l]
sp = sigmoid_prime(z)
delta = np.dot(self.weights[-l + 1].transpose(), delta) * sp
nabla_b [-l] = delta
nabla_w[-l] = np.dot(delta, activations[-l - 1 ].transpose())

return (nabla_b, nabla_w)```

## 2.2. Improvement of [BP neural network]

### 2.2.1, BP neural network improvement

• Cross-entropy cost function:

“Serious errors” lead to slow learning. If an output that deviates from expectations is deliberately generated when initializing weights and biases, it will take many iterations in the process of training the network to offset this deviation and restore normal learning. .

(1) The purpose of introducing the cross-entropy cost function is to solve the problem that some instances learn very slowly at the beginning of training. It is mainly aimed at the activation function of the Sigmod function.

(2) If you use an activation function that does not saturate, you can continue to use the sum of squared errors as the loss function

(3) If the output neuron is a sigmoid neuron, cross entropy is generally a better choice

(4) The output neuron is linear, so the quadratic cost function will no longer cause the problem of learning speed drop. In this case, a quadratic cost function is an appropriate choice

(5) Cross-entropy cannot improve the slow learning that occurs in neurons in the hidden layer

(6) The cross-entropy loss function only improves the slow learning that occurs when the network output “obviously deviates from expectations”

(7) The application of cross-entropy loss cannot improve or avoid neuron saturation, but can avoid the problem of slow learning when the output layer neurons are saturated.

• 4 normalization techniques:

(1) Stop early. Track how the accuracy on the validation dataset changes with training. If we see no improvement in accuracy on validation data, then we stop training

(2) Regularization

(3) Dropout

(4) Amplified sample set

• Better weight initialization method

A bad weight initialization method can lead to saturation problems. A good weight initialization method can not only speed up the training speed, but also greatly improve the final performance.

## 2.3. The second implementation method (too much code, only the improved part is given here) c

• Error squared and cost function

```class  QuadraticCost (object) :  # Error sum of squared cost function

@staticmethod
def fn(a, y):

return 0.5*np.linalg.norm(a-y)**2

@staticmethod
def delta(z, a, y):
"""Return the error delta from the output layer."""
return (a-y) * sigmoid_prime(z)```

• ``` class CrossEntropyCost (object) : # Cross entropy cost function<pre><code>@staticmethod def fn(a, y): return np.sum(np.nan_to_num(-y*np.log(a)-( 1 -y)*np.log( 1 -a))) </code></pre> <p># use 0 instead of nan and a larger value instead of inf</p> <pre><code>@staticmethod def delta(z, a, y): return (a-y) </code></pre> <p>```

```def  default_weight_initializer ( self ) : # Recommended weight initialization

self .biases = [np.random.randn(y, 1 ) for y in  self .sizes[ 1 : ]] # Bias initialization is unchanged
self .weights = [np.random.randn(y, x)/np. sqrt(x)   # reduce variance and avoid saturation
for x, y in zip( self .sizes[ :- 1 ], self .sizes[ 1 : ])]```

• Initialization improvements:

```def  default_weight_initializer ( self ) : # Recommended weight initialization

self .biases = [np.random.randn(y, 1 ) for y in  self .sizes[ 1 : ]] # Bias initialization is unchanged
self .weights = [np.random.randn(y, x)/np. sqrt(x)   # reduce variance and avoid saturation
for x, y in zip( self .sizes[ :- 1 ], self .sizes[ 1 : ])]```