1 Introduction
Deep learning has achieved stateoftheart performance in various applications [lecun2015deep]
, such as computer vision
[krizhevsky2012imagenet][brown2020language], and scientific discovery [long2018pde, zhang2018deep]. Despite the empirical success of deep learning, how gradient descent or its variants lead deep neural networks to be biased towards solutions with good generalization performance on the test set is still a major open question. To develop a theoretical foundation for deep learning, many studies have investigated the implicit bias of gradient descent in different settings [li2018algorithmic, amari2020does, vaswani2020each, soudry2018implicit, lyu2019gradient, arora2019implicit].It is well acknowledged that welltrained endtoend deep architectures can effectively extract features relevant to a given label. Although theoretical analysis of deep learning has been successful in recent years [dlbook, goldblum2019truth], most of the studies that aim to analyze the properties of the final output function fail to understand the features learned by neural networks. Recently, in [papyan2020prevalence], the authors observed that the features in the same class will collapse to their mean and the mean will converge to an equiangular tight frame (ETF) during the terminal phase of training, that is, the stage after achieving zero training error. This phenomenon, namely, neural collapse [papyan2020prevalence]
, provides a clear view of how the lastlayer features in the neural network evolve after interpolation and enables us to understand the benefit of training after achieving zero training error to achieve better performance in terms of generalization and robustness. To theoretically analyze the neural collapse phenomenon,
[fang2021layer] proposed the layerpeeled model (LPM) as a simple surrogate for neural networks, where the lastlayer features are modeled as free optimization variables. In particular, in a balanced class classification problem using a neural network with neurons in the last hidden layer, the LPM takes the following form:(1) 
where are positive constants. Here, is the weight of the final linear classifier, is the feature of the last layer and is the corresponding label. The intuition behind the LPM is that modern deep networks are often highly overparameterized, with the capacity to learn any representations of the input data. It has been shown that an equiangular tight frame (ETF), i.e., feature with neural collapse, is the only global optimum of the LPM objective (1) [fang2021layer, lu2020neural, wojtowytsch2020emergence, zhu2021geometric].
However, feature constraints in LPMs are unrealistic settings that are never used in practice. In this study, we directly deal with the unconstrained model and show that gradient flow can find those neural collapse solutions without the help of explicit constraints and regularization. To do this, we build a connection between the neural collapse and recent theories on maxmargin implicit regularization [lyu2019gradient, wei2018margin], and use it to provide a convergence result to the firstorder stationary point of a minimumnorm separation problem. Furthermore, we illustrate that the crossentropy loss enjoys a benign global landscape where all the critical points are strict saddles in the tangent space, except for the only global minimizers that exhibit the neural collapse phenomenon. Finally, We verify our insights via empirical experiments.
In contrast to previous theoretical works on neural collapse, our analysis does not incorporate any explicit regularization or constraint on features. A comparison with other results can be found in Table 1 and we defer a detailed discussion to Section 5.2. The reasons we investigate the unregularized objective are summarized as follows:

Feature regularization or constrain is an unrealistic setting and never used in practice. However, previous studies have justified that neural networks continue to perform well without any regularization or constraint [zhang2021understanding].

As shown in this study, neural collapse exists even under an unconstrained setting, which implies the emergence of neural collapse should be attributed to gradient descent and crossentropy loss rather than explicit regularization.

Regularization or constraint feature constraint can be barriers for existing theories of neural networks [jacot2018neural, lyu2019gradient]. By allowing features to be totally free, we hope our results can inspire further analysis to plug in a realistic neural network.
Reference  Contribution 


Loss Function  
[papyan2020prevalence]  Empirical Results  CrossEntropy Loss  
[wojtowytsch2020emergence, lu2020neural, fang2021layer]  Global Optimum  CrossEntropy Loss  
[mixon2020neural, poggio2020explicit, han2021neural]  Modified Training Dynamics  Loss  
[zhu2021geometric]  Landscape Analysis  CrossEntropy Loss  





1.1 Contributions
The contributions of the present study can be summarized as follows.

We build a relationship between the maxmargin analysis [soudry2018implicit, nacson2019convergence, lyu2019gradient] and the neural collapse and provide the implicit bias analysis to the feature rather than the output function. Although both parameters and features diverge to infinity, we prove that the convergent direction is along the direction of the minimumnorm separation problem.

Previous works [lyu2019gradient, ji2020gradient] only prove that gradient flow on homogeneous neural networks will converge to the KKT point of the corresponding minimumnorm separation problem. However, the minimumnorm separation problem remains a highly nonconvex problem and a local KKT point may not be the neural collapse solution. In this study, we perform a more detailed characterization of the convergence direction via landscape analysis.

Previous analysis about neural collapse relies on the explicit regularization or constraint. In this study, we show that the implicit regularization effect of gradient flow is sufficient to lead to a neural collapse solution. The emergence of neural collapse should be attributed to gradient descent and loss function, rather than explicit regularization or constraint. We put detailed discussion in Section 5.2.
1.2 Related Works
Implicit Bias of Gradient Descent:
To understand how gradient descent or its variants helps deep learning to find solutions with good generalization performance on the test set, a recent line of research have studied the implicit bias of gradient descent in different settings. For example, gradient descent is biased toward solutions with smaller weights under loss [li2018algorithmic, amari2020does, vaswani2020each] and will converge to large margin solution while using logistic loss [soudry2018implicit, nacson2019convergence, lyu2019gradient, chizat2020implicit, ji2020gradient]. For linear networks, [arora2019implicit, razin2020implicit, gidel2019implicit] have shown that gradient descent determines a lowrank approximation.
Loss Landscape Analysis:
Although the practical optimization problems encountered in machine learning are often nonconvex, recent works have shown the landscape can enjoy benign properties which allow further analysis. In particular, these landscapes do not exhibit spurious local minimizers or flat saddles and can be easily optimized via gradientbased methods
[ge2015escaping]. Examples include phase retrieval [sun2018geometric], lowrank matrix recovery [ge2016matrix, ge2015escaping], dictionary learning [sun2016complete, qu2019analysis, laurent2018deep] and blind deconvolution [lau2019short].2 Preliminaries and Problem Setup
2.1 Preliminaries
We consider a balanced dataset with classes . A standard fully connected neural network can be represented as:
(2) 
Here denote the weight matrices in each layer, denote the bias terms, and
denotes the nonlinear activation function, for example, ReLU or sigmoid. Let
denote the last layer feature for data and denotes the feature mean within the kth class. To provide a formal definition of neural collapse, we first introduce the concept of a simplex equiangular tight frame (ETF):Definition 2.1 (Simplex ETF).
A symmetric matrix is said to be a simplex equiangular tight frame (ETF) if
(3) 
Where is a matrix with orthogonal columns.
Let be the weight of the final layer classifier, the four criteria of neural collapse can be formulated precisely as:

(NC1) Variability collapse: As training progresses, the withinclass variation of the activation becomes negligible as these activation collapse to their class mean .

(NC2) Convergence to Simplex ETF:
The vectors of the classmeans (after centering by their globalmean) converge to having equal length, forming equalsize angles between any given pair, and being the maximally pairwisedistanced configuration constrained to the previous two properties.

(NC3) Convergence to selfduality: The linear classifiers and classmeans will converge to align with each other, up to appropriate rescaling, that is, there exist a universal constant such that

(NC4) Simplification to Nearest ClassCenter For a given deepnet activation:
the network classifier converges to choose whichever class has the nearest train classmean
In this paper, we say that a point satisfies the neural collapse conditions or is neural collapse solution if the above four criteria are all satisfied for .
2.2 Problem Setup
We mainly focus on the neural collapse phenomenon, which is only related to the classifiers and features in the last layer. Since the general analysis of the highly nonsmooth and nonconvex neural network is difficult, we peel down the last layer of the neural network and propose the following unconstrained layerpeeled model (ULPM) as a simplification to capture the main characters related to neural collapse during training dynamics. A similar simplification is commonly used in previous theoretical works [lu2020neural, fang2021layer, wojtowytsch2020emergence, zhu2021geometric], but ours does not have any constraint or regularization on features and is closer to realistic neural network models. It should be mentioned that although [mixon2020neural, han2021neural] also studied the unconstrained model, their analysis adopted an approximation to real dynamics and was highly dependent on the loss function, which is rarely used in classification tasks. Compared with their works, we directly deal with the real training dynamics and cover the most popular crossentropy loss in classification tasks.
Let and be the matrices of classifiers and features in the last layer, where is the number of classes and is the number of data points in each class. The ULPM is defined as follows:
(4) 
Here, we do not have any constraint or regularization on features, which corresponds to the absence of weight decay in deep learning training. The objective function (4) is generally nonconvex on and we aim to study the landscape of the objective function (4). Furthermore, we consider the gradient flow of the objective function:
Notations.
denotes the Frobenius norm, denotes the matrix spectral norm, denotes the nuclear norm, denotes the vector norm and is the trace of matrices. We use to denote the set of indices up to .
3 Main Results
In this section, we present our main results regarding the training dynamics and landscape analysis of (4). This section is organized as follows: First, in Section 3.1, we show the relationship between the margin and neural collapse in our surrogate model. Inspired by this relationship, we propose a minimumnorm separation problem (5) and show the connection between the convergence direction of the gradient flow and the KKT point of (5). In addition, we explicitly solve the global optimum of (5) and show that it must satisfy the neural collapse conditions. However, owing to the nonconvexity, we find an Example 3.1 in Section 3.2, which shows that there exist some bad KKT points such that a simple gradient flow will get stuck in them and does not converge to the neural collapse solution, which is proved to be optimal in Theorem 3.3. Then, we present our second–order analysis result in Theorem 3.4 to show that those bad points will exhibit decreasing directions in the tangent space thus gradient descent and its variants can avoid those bad directions and only converge to the neural collapse solutions [lee2016gradient, ge2015escaping].
3.1 Convergence To The First–Order Stationary Point
Heuristically speaking, the simplex ETF (Definition 2.1) gives a set of vectors with the maximum average angle between them. As a result, neural collapse implies that the neural networks tend to maximize the angles between each class and the corresponding classifiers. At a high level, such behavior is quite similar to margin maximization which is known to be an implicit regularization effect of gradient descent and has been extensively studied in [soudry2018implicit, nacson2019convergence, lyu2019gradient, ji2020gradient]. First, we illustrate the connection between the margin and neural collapse. Recall that the margin of a single data point and the associated feature is . Then, the margin of the entire dataset can be defined as
The following theorem demonstrates that neural collapse yields the maximum margin solution of our ULPM model:
Theorem 3.1 (Neural collapse as maxmargin solution).
For the ULPM model (4), the margin of the entire dataset always satisfies
and the equality holds if and only if satisfies the neural collapse conditions with .
Based on this finding, we present an analysis of the convergence of the gradient flow on the ULPM (4). Following [lyu2019gradient], we link the gradient flow on crossentropy loss to a minimumnorm separation problem.
Theorem 3.2.
For problem (4), let be the path of gradient flow at time , if there exist a time such that , then any limit point of
is along the direction (i.e., a constant multiple) of a KarushKuhnTucker (KKT) point of the following minimumnorm separation problem:
(5)  
Remark 3.1.
Here we write (5) as a constraint problem, but the constraint is introduced by the implicit regularization effect of gradient flow on our ULPM objective (4). Since the training dynamics will diverge to infinity, we hope to justify that the diverge direction is highly related to neural collapse and an appropriate normalization is needed, which is why it appears to be a constraint optimization form. Our goal is to justify that the neural collapse phenomenon is caused by the properties of the loss function and training dynamics rather than an explicit regularization or constraint, which seems to be necessary for previous studies [fang2021layer, lu2020neural, wojtowytsch2020emergence].
Remark 3.2.
In Theorem 3.2, we assume that there exists a time such that . Note that the loss function can be rewritten as:
(6) 
the requirement implies that for any , which is equivalent to ; that is, every feature is separated perfectly by the classifier. This assumption is common in the study of implicit bias in the nonlinear setting [nacson2019lexicographic, lyu2019gradient, ji2020gradient] and its validity can be justified by the fact that neural collapse is found only in the terminal phase of training in the deep neural network, where the training accuracy has achieved 100%. It is also an interesting direction to remove this assumption and study the earlystage dynamics of training, which is beyond the scope of this study and we leave it to future exploration.
Theorem 3.2 indicates that the convergent direction of the gradient flow is restricted to the maxmargin directions, which usually have good robustness and generalization performance. In general, the KKT conditions are not sufficient to obtain global optimality because the minimumnorm separation problem (5) is nonconvex. On the other hand, we can precisely characterize its global optimum in the ULPM case based on Theorem 3.1:
Corollary 3.1.
Every global optimum of the minimumnorm separation problem (5) is also a KKT point that satisfies the neural collapse conditions.
3.2 Second–Order Landscape Analysis
In the convex optimization problem, the KKT conditions are usually equivalent to global optimality. Unfortunately, owing to the nonconvex nature of the objective (4), the KKT points can also be saddle points or local optimum other than the global optimum. In this section, we aim to show that this nonconvex optimization problem is actually not scary via landscape analysis. To be more specific, we prove that except for the global optimum given by neural collapse, all the other KKT points are actually saddle points that can be avoided by gradient flow.
In contrast to previous landscape analysis of nonconvex problems, where people aim to show that the objective has a negative directional curvature around any stationary point [sun2015nonconvex, zhang2020symmetry], our analysis is slightly different. Note that since the model is unconstrained, once features can be perfectly separated, the ULPM objective (4) will always decrease along the direction of the current point and the optimum is attained only in infinity. Although growing along all of those perfectly separating directions can let the loss function decrease to 0, the speed of decrease is quite different and there exists an optimal direction with the fastest decreasing speed. As shown in Section 3.1, firstorder analysis of training dynamics fails to distinguish such an optimal direction from KKT points, and we need secondorder analysis to help us fully characterize the realistic training dynamics. First, we provide an example to illustrate the motivation and necessity of secondorder landscape analysis.
Example 3.1 (A Motivating Example).
Consider the case where , let be the following point:
One can easily verify that enables our model to classify all features perfectly. Furthermore, we can show that it is along the direction of a KKT point of the minimumnorm separation problem (5) by constructing the Lagrangian multiplier as follows:
To see this, simply write down the corresponding Lagrangian (note that we aim to justify is along the direction of a KKT point of (5), to make it a true KKT point , one needs to multiple on ):
Simply take derivatives for and we find that it satisfies the KKT conditions. However, the gradient of is:
We can see that the directions of the gradient and the parameter align with each other (i.e., is parallel to and is parallel to ), which implies that simple gradient descent may get stuck in this direction and only grow the parameter norm. However, if we construct:
By simple calculation, we find that satisfies . Since and as , this result implies that for any , we can choose appropriate such that:
In Example 3.1, it is shown that there are some KKT points of the minimumnorm separation problem (5) that are not globally optimal, but there exist some better points close to it; thus, the gradientbased method can easily avoid them (see [lee2016gradient, ge2015escaping] for a detailed discussion). In the following theorem, we will show that the best directions are neural collapse solutions in the sense that the loss function is the lowest among all the growing directions.
Theorem 3.3.
The optimal value of the loss function (4) on a sphere is obtained (i.e., for any ) if only if satisfies neural collapse conditions and .
Remark 3.3.
Note that the second condition is necessary because neural collapse conditions do not specify the norm ratio of and . That is, if satisfies the neural collapse conditions, then for any will also satisfy them, but only some certain are optimal.
Now, we turn to those points that are not globally optimal. To formalize our discussion in the motivating example 3.1, we first introduce the tangent space:
Definition 3.1 (tangent space).
The tangent space of is defined as a set of directions that are orthogonal to :
Our next result justifies our observation in Example 3.1 that for those unoptimal points, there exists a direction in the tangent space such that moving along this direction will lead to a lower objective value.
Theorem 3.4.
If is not the optimal solution in Theorem 3.3 (i.e., is not neural collapse solution or it is neural collapse solution but ), then there exists a direction and constant such that for any ,
(7) 
Further more, it implies that for any there exists a point such that:
(8)  
Remark 3.4.
The result in (7) gives us a decreasing direction orthogonal to the direction of . As shown in Example 3.1, the gradient on these unoptimal points might be parallel to ; thus, the firstorder analysis fails to explain the prevalence of neural collapse. Here the decreasing direction must be obtained by analyzing the Hessian matrices, which further indicates that these points are saddle points in the tangent space, which is why we name it secondorder landscape analysis. A formal statement and proof are presented in the appendix C. Previous works have shown that for a large family of gradientbased methods, they can avoid saddle points and only converge to minimizers [lee2016gradient, ge2015escaping, panageas2019first], thus our landscape analysis indicates that the gradient flow dynamics only find neural collapse directions.
4 Empirical Results
To evaluate our theory, we trained the VGG13 [simonyan2014very] and ResNet18 [he2016deep] on both FashionMNIST [xiao2017fashion]
and CIFAR10 datasets
[krizhevsky2009learning]without weight decay and tracked the convergence speed of the last layer feature to the neural collapse solution every few epochs to see how it changes during the terminal phase training. All the networks were trained for 500 epochs, using stochastic gradient descent without weight decay. The results are plotted in
Figure 1. A detailed definition of the metrics can be found in Appendix D. We observe that among all the experiments, the variation of norms became smaller than 0.1 after 500 training epochs (the nondecreasing variation of norms is also found in [papyan2020prevalence], it may attribute to the network architecture and characteristic of realworld datasets); the cosines between pairs of last layer features and that of the classifiers converged to the equiangular state with maximum angles at rate ; the distance between normalized centered classifier and normalized last layer feature decreased at rate after 100 epochs, and the within class variation approximately decreased at rate after 100 epochs. It can be observed that all the aforementioned quantities either decrease or stay in small values during the training process, implying that neural collapse can occur with sufficient training epochs.5 Conclusion and Discussion
5.1 Conclusion
To understand the implicit bias of neural features from gradient descent training, we built a connection between maxmargin implicit bias and the neural collapse phenomenon and studied the ULPM in this study. We proved that the gradient flow of the ULPM converges to the KKT point of a minimumnorm separation problem, where the global optimum satisfies the neural collapse conditions. Although the ULPM is nonconvex, we show that ULPM has a benign landscape where all the stationary points are strict saddle points in the tangent space, except for the global neural collapse solution. Our study helps to demystify the neural collapse phenomenon, which sheds light on the generalization and robustness during the terminal phase of training deep networks in classification problems.
5.2 Relationship with Other Results on Neural Collapse
Theoretical analysis of neural collapse was first provided by [lu2020neural, wojtowytsch2020emergence, fang2021layer], who showed that the neural collapse solution is the only global minimum of the simplified nonconvex objective function. In particular, [wojtowytsch2020emergence, lu2020neural]
studied a continuous integral form of the loss function and showed that the features learned should have a uniform distribution on the sphere. A more realistic discrete setting was studied in
[fang2021layer], where the constraint is on the entire feature matrix rather than individual features. Our result utilizes the implicit bias of the crossentropy loss function to remove the feature norm constraint, which is not practical in real applications.Although the global optimum can be fully characterized by neural collapse conditions, the ULPM objective is still highly nonconvex. Regarding optimization, [mixon2020neural, poggio2020explicit, han2021neural] analyzed the unconstrained feature model with loss and established convergence results for collapsed features for gradient descent. However, they fail to generalize to the more practical crossentropy loss functions used in classification tasks. The analysis relies highly on the loss to obtain a closedform gradient flow and still requires some additional approximation to guarantee global convergence.
The most relevant study is a concurrent work [zhu2021geometric], which provides a landscape analysis of the regularized unconstrained feature model. zhu2021geometric turns the feature norm constraint in fang2021layer into feature norm regularization and still preserves the neural collapse global optimum. At the same time, it shows that the modified regularized objective shares a benign landscape, where all the critical points are strict saddles except for the global one. Although our study and zhu2021geometric discover similar landscape results, we believe our characterization remains closer to the real algorithms since we do not introduce any constraints or regularization on the feature norm following the conventional setting in realist training. The regularization of features introduced in zhu2021geometric is still different from weight decay regularization [krogh1992simple]. However, weight decay on homogeneous neural networks is equivalent to gradient descent with scaling step size on an unregularized objective [li2019exponential, zhang2018deep]. Moreover, neural networks are found to perform well in the absence of weight decay, which highlights the importance of implicit regularization [zhang2021understanding]. As a result, instead of explicit feature norm constraint/regularization, in this study, we consider implicit regularization brought by the gradient flow on crossentropy. We show that implicit regularization is sufficient to lead the dynamics to converge to the neural collapse solution without the explicit regularization.
Acknowledgments
We are grateful to Qing Qu and X.Y. Han for helpful discussions and feedback on an early version of the manuscript. Wenlong Ji is partially supported by the elite undergraduate training program of the School of Mathematical Sciences at Peking University. Yiping Lu is supported by the Stanford Interdisciplinary Graduate Fellowship (SIGF).
References
Appendix A Elements of optimization
In this section, we introduce some basic definitions and theory about optimization. In the following discussion, we consider a standard form inequality constrained optimization problem:
(9)  
s.t. 
In addition, we assume all of those functions and are twice differentiable. A point is said to be feasible if and only if it satisfies all of the constraints in (9), i.e., . And the Lagrangian of problem (9) is defined as following:
a.1 Karush Kuhn Tucker Conditions
Now let’s first introduce the definition of Karush Kuhn Tucker (KKT) point and approximate KKT point. Here we follows the definition of KKT point as in lyu2019gradient.
Definition A.1 (Definition of KKT point).
A feasible point is said to be KKT point of problem (9) if there exist such that the following Karush Kuhn Tucker (KKT) conditions hold:
Definition A.2 (Definition of KKT point).
Generally speaking, KKT conditions might not be necessary for global optimality. We need some additional regular conditions to make it necessary. For example, as shown in Dutta2013ApproximateKP we can require the problem to satisfy the following MangasarianFromovitz constraint qualification (MFCQ):
Definition A.3 (MangasarianFromovitz constraint qualification (MFCQ) ).
Moreover, when MFCQ holds we can build a connection between approximate KKT point and KKT point, see detailed proof in Dutta2013ApproximateKP:
Appendix B Omitted proofs from Section 3.1
Recall our ULPM problem:
Let’s review some basic notations defined in the main body. Let , the margin of a single feature is defined to be . We define the margin of entire dataset as We first prove Theorem 3.1 in the mainbody.
Proof of Theorem 3.1.
First we can find that the margin will not change if we minus a vector for all , so if we denote the mean of classifier and then we have , that is:
(11) 
Note that then sum this inequality over we have:
By Cauchy inequality, we have:
(12) 
Sum (12) over k and i we have:
(13) 
On the other hand, we know that:
Then we can conclude that:
as desired. When the equality holds, first we have which is equivalent to . Take it back into (13), then we must have all of the equality holds in (12) and (11), which give us:
Take this into (11) we have:
which implies neural collapse conditions. ∎
Now let’s turn to the training dynamics and prove Theorem 3.2. Before starting the proof, we need to introduce some additional notations. Since the are all optimization variables here, we denote as the whole parameter for simplicity and all of previous function can be defined on by matching the corresponding parameters. Denote as the norm of and . Now we can state our first lemma to show how training dynamics of gradient flow on ULPM objective (B) is related to a KKT point of (5).
Lemma B.1.
If there exist a time such that , then for any is an  approximate KKT point of the following minimumnorm separation problem. More precisely, we have
where:
is the angle between and its corresponding gradient.
Proof.
The training dynamics is given by gradient flow:
Then by the chain rule we have:
(14) 
It indicates that the loss function is monotonically decreasing. If , we have . On the other hand, note that
(15) 
which gives us .
Let , note that we can rewrite the ULPM objective function (B) as . By the chain rule and the gradient flow equation we have
where is the gradient of , i.e. . Now let and construct , we only need to show:
(16) 
(17) 
To prove (16), we only need to compute (Recall that ):
Note that:
(18) 
Then we have the following inequality:
(19) 
Comments
There are no comments yet.