Part 2: Gradient-based learning

The biggest difference between linear models and neural network models, lies in the nonlinearity of neural networks that causes most of the cost functions we are interested in to become non-convex. This means that neural network training usually uses iterative, gradient-based optimization, which only makes the cost function reach a very small value; instead of a linear equation solver like a trained linear regression model, alive is used to train logistic regression or The SVM's convex optimization algorithm can guarantee global convergence. Convex optimization converges from any initial parameter (in theory, it is also very robust in practice but may encounter numerical problems). The stochastic gradient descent for non-convex loss functions does not have this guarantee of convergence and is sensitive to the initial values ​​of the parameters. For feedforward neural networks, it is important to initialize all weight values ​​to small random numbers. The offset can be initialized to a positive value of zero alive.