# A Optimal Control Approach for Deep Learning

##### Posted on

This note is based on Li, Q., & Hao, S. (2018). An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks. ArXiv:1803.01299 [Cs].

## The Optimal Control Viewpoint

- $T\in \bbZ_+$: the number of layers
- ${x_{s,0}\in\bbR^{d_0}:s=0,\ldots,S}$: a collection of fixed inputs
- $S\in \bbZ_+$: sample size

Consider the dynamical system

\[x_{s,t+1} = f_t(x_{s,t},\theta_t),\quad t=0,1,\ldots,T-1\,,\](represent affine transformation and nonlinear activation in a layer?)

the goal of training a neural network is to adjust the weights $\theta = {\theta_t:t=0,\ldots,T-1}$ so as to minimize some loss function that measures the difference between the final network output $x_{s,T}$ and some true targets $y_s$ of $x_{s,0}$.

Define a family of real-valued function $\Phi_s:\bbR^{d_T}\rightarrow \bbR$ acting on $x_{s,T}$ ($y_s$ are absorbed into the definition of $\Phi_s$).

In addition, consider some regularization terms for each layer $L_t:\bbR^{d_t}\times\Theta_t\rightarrow \bbR$ that has to be simultaneously minimized. Here slightly more general case regularize the sate at each layer, not only trainable parameters (!!).

In summary,

\[\min_{\theta\in\Theta}\;J(\theta) = \frac 1S\sum_{s=1}^S\Phi_s(x_s,T) + \frac 1S\sum_{s=1}^S\sum_{t=0}^{T-1}L_t(x_{s,t}, \theta_t)\]subject to

\[x_{s,t+1} = f_t(x_{s,t}, \theta_t), t=0,\ldots,T-1,s\in[S]\]where

- $\Theta={\Theta_0\times\cdots\times \Theta_{T-1}}$
- $[S]={1,\ldots,S}$.

### The Pontryagainâ€™s Maximum Principle

Maximum principles of the Pontryagain type usually consist of necessary conditions for optimality in the form of the maximization of a certain Hamiltonian function.

Distinguishing feature: does not assume differentiability (or even continuity) of $f_t$ w.r.t. $\theta$

Thus, the optimality condition and the algorithms based on it need not rely on gradient-descent type updates.

Define the Hamiltonian function $H_t:\bbR^{d_t}\times \bbR^{d_{t+1}}\times \Theta_t\rightarrow\bbR$ by

\[H_t(x,p,\theta) = pf_t(x,\theta) - \frac 1SL_t(x,\theta)\,.\]Let $\theta^*$ be a solution. The necessary conditions are:

- eq (4): the forward propagation equation
- eq (5): the evolution of the co-state $p_s^*$
- eq (6): an optimal solution $\theta^*$ must globally maximize the summed Hamiltonian for each layer $t=0,\ldots,T-1$

### The Convexity Assumption

#### Convexity of ${f_t(x,\theta):\theta\in\Theta_t}$

Two types of layers:

- non-trainable (non-linear activation function): trivially convex since singleton
- trainable:
- $f_t$ is usually (?) affine in $\theta$: fully connected layers, convolution layers and batch normalization layers
- as long as the admissable set $\Theta_t$ is convex.
- residual networks also satisfy the convexity constraints if one introduces auxillary variables.
- if $\Theta_t$ is not convex, generally PMP not constitute necessary conditions.

## The Method of Successive Approximations

Algorithm based on the maximum principle: how to find the solution.

Each of eq. (4)-(6) represent a manifold in solution space consisting of all possible $\theta,{x_s,s\in[S]}$ and ${p_s,s\in [S]}$, and the intersection of these three manifolds must contain an optimal solution, if one exists.

Thus, an iterative projection method that successively projects a guessed solution onto each of the manifolds.

### An Error Estimate for the MSA

(the discontinuity not account for the continuity of $\Theta_t$??)

Intuitively, the Hamiltonian maximization step is generally the right direction, but we will incur nonnegative penalty terms when changing the parameters from $\theta$ to $\phi$.

## Neural Networks with Discrete Weights

Main strength: not gradient-descent, thus useful when the trainable parameters only take values in a discrete set.

- binary networks: weights are restricted to ${-1,+1}$
- ternary networks: ${-1,+1,0}$

### Binary networks

for simplicity only the fully connected case

### Ternary networks

goal: explore the sparsification of the network

results: similar performance but high degrees of sparsity.

## Discussion and Related Work

- convexity assumption are justified for conventional neural networks, but not necessarily so for all neural networks
- the usual BP with gradient-descent can be regarded as a simple modification of MSA with differentiability assumption
- overfitting in MSA might imply that different way with stochastic gradient based optimization

Literatures:

- here the optimal control approach is different from prevailing viewpoint of nonlinear programming
- optimal control and dynamical systems viewpoint has been discussed and dynamical systems based discretization schemes, but most of them have theoretical basis in continuous-time dynamical systems
- the paper gives a discrete-time formulation
- the connection between optimal control and deep learning has been qualitative discussed, but fewer works relating optimal control algorithms to training neural networks beyond the classical gradient-descent with back-propagation.
- binary MSA is competitive, but need to reduce overfitting
- future work is required to rigorously establish the convergence of these algorithms

## Conclusion and Outlook

questions:

- the applicability of the PMP for discrete-weight neural networks, which does not satisfy the convexity assumptions.
- establish the convergence of the algorithms presented.