Genius remembers this site address in one second: []The fastest update! No ads!

For the training process of this neural network, it is necessary to determine these 11935 parameters.

The goal of training can be roughly summarized as: For each training sample, the corresponding output is infinitely close to 1, while other outputs are infinitely close to 0.

According to the experimental results given by Michael Nielsen, based on the above network structure, without tuning, the correct recognition rate of 95% can be easily achieved. The core code is only 74 lines!

After adopting the idea of ​​deep learning and convolutional networks (;networks), the correct recognition rate of 99.67% was finally reached. The historical best result achieved for the MNIST dataset is a recognition rate of 99.79%, which was made by LiWan, Matthew Zeiler, SixinZhang, YannLeCun, and RobFergus in 2013.

Considering that there are some illegible numbers like the following in this data set, this result is quite amazing! It has surpassed the recognition of the real human eye.

In this process, step by step adjustment of the weight and bias parameter values, it is necessary to introduce gradient descent algorithm (gradientdescent).

During the training process, our neural network needs a practical learning algorithm to gradually adjust the parameters.

The ultimate goal is to make the actual output of the network and the expected output as close as possible. We need to find an expression to characterize this closeness. This expression is called the cost function (cost)

x represents a training sample, which is the input of the network. In fact, an x ​​represents 784 inputs.

y(x) represents the expected output value when the input is x; and a represents the actual output value when the input is x. Both y(x) and a represent 10 output values ​​(represented by mathematical vectors). The square of their difference represents the closeness of the actual output value to the expected output value. The closer, the smaller the difference.

n is the number of training samples. Assuming there are 50,000 training samples, then n is 50,000. Because it is multiple training, it is necessary to divide by n to average all training samples.

The notation of C(w, b) regards cost as a function of all weight w and bias b in the network. Why do you think so? When training, the input x is fixed (training sample) and will not change. When the input is considered unchanged, this formula can be regarded as a function of w and b. So, where are w and b on the right side of the equation? Actually, in a. y(x) is also a fixed value, but a is a function of w and b.

In summary, C(w, b) characterizes how close the actual output value of the network is to the expected output value. The closer, the smaller the value of C(w, b). Therefore, the process of learning is to find a way to reduce C(w, b), regardless of the expression of C(w, b), it is a function of w and b, which becomes a function of finding the minimum value Optimization problem.

Since the form of C(w, b) is more complicated and there are many parameters, it is very difficult to directly solve it mathematically.

In order to use computer algorithms to solve this problem, computer scientists have proposed a gradient descent algorithm (gradientdescent).

This algorithm essentially takes a small step downward in the direction of contribution of the tangents of each dimension in a multi-dimensional space, and finally reaches the minimum.

Since multi-dimensional space cannot be represented visually, people usually retreat to three-dimensional space for analogy. When C(w, b) has only two parameters, its function image can be presented in three-dimensional space.

It is as if a small ball is continuously rolling down on the **** of a valley, and it may eventually reach the bottom of the valley. This understanding is basically valid when it is re-extended to the multidimensional space.

However, due to the large number of training samples (tens of thousands, hundreds of thousands, or even more), if the calculation is directly based on the previous C(w, b), the amount of calculation will be large and the learning process will be slow.

So there is a stochastic gradient descent algorithm, which is an approximation of gradient descent.

In this algorithm, each learning is no longer for all training sets, but a part of the training set is randomly selected to calculate C(w, b), and the next learning is randomly selected from the remaining training set to calculate, until Use up the entire training set. Then repeat the process again and again.

Deep neural networks (with multiple hidden layers) have more structural advantages than shallow neural networks, and they have the ability to abstract from multiple levels.

Since the 1980s and 1990s, researchers have been trying to apply the stochastic gradient descent algorithm to the training of deep neural networks, but they encountered the problem of vanishinggradient or explodinggradient, resulting in abnormal learning process Slowly, deep neural networks are basically unavailable.

However, since 2006, people have begun to use some new technologies to train deep networks, and breakthroughs have been made continuously. These technologies include but are not limited to:

Use convolutional networks (; networks);

;(dropout);

Rectifiedlinearunits;

Use GPU to get stronger computing power, etc.

The advantages of deep learning are obvious: this is a brand new way of programming. It does not require us to design algorithms and programming directly for the problem to be solved, but to program for the training process.

In the training process, the network can learn the correct way to solve the problem by itself, which allows us to use simple algorithms to solve complex problems, and it surpasses traditional methods in many areas.

The training data plays a more important role in this process: a simple algorithm plus complex data may be far better than a complex algorithm plus simple data.

Deep networks often contain a large number of parameters, which do not conform to the Occam's razor principle from a philosophical principle, and usually people have to spend a lot of energy on adjusting these parameters;

Training deep networks requires a lot of computing power and time;

The problem of overfitting is always accompanied by the training process of neural networks, and the problem of slow learning has always plagued people This easily makes people have a fear of losing control, and at the same time, the technology Further application in some important occasions creates obstacles.

The BetaCat story is about an artificial intelligence program that gradually rules the world through self-learning.

So, will the current development of artificial intelligence technology cause this to happen? This may not be possible yet. Most people believe that there are probably two important factors:

First, the current artificial intelligence, its self-learning is still limited to the way people specify, only learning to solve specific problems, it is still not a general intelligence.

Second, for the artificial intelligence training process, people need to input regular training data for it. The input and output of the system still have strict requirements on the data format. This also means that even if the artificial intelligence program is connected to the Internet, it Nor can it learn from the unstructured data on the Internet like BetaCat.

However, this is only for ordinary artificial intelligence, but for a real networked intelligent life like Origin, it requires it to be able to do the above two points.