assorted schoolwork, personal projects, independent research
(wip! needs editing)
Intro: As sigmoidally-activated neural networks progress through their learning lifecycles, they begin in an untrained phase where weight learning dynamics are nearly a linear dynamical system, but eventually they reach a trained, saturated phase where learning dynamics are nearly static in a system that resembles a graph. After making these concepts formal, a theoretical analysis sheds light on how networks go through these two previously-unspecified/unappreciated training phases. A preliminary result is that juvenile networks approach a clearly-defined intermediate or hybrid phase, where learning dynamics proceed somewhat as if they were a linear dynamical system but also somewhat as if they were a binary decision graph, and that they exit from this intermediate phase to the third phase slower than they arrive from the first phase. A consequence of this analysis is a possible way to conduct initial training that drastically cuts down on the time required to conduct initial training. Also under this analysis, some non-sigmoidal activations such as RELU are identified as having special properties that promote continued learning on more and more data and avoid model saturation suffered by many non-deep machine learning methods.
For context, this project was started when I noticed more or less simultaneously (a) that small weights moving on a straight line are effectively linear, and I've had linear dynamics and eigenvalue solutions to steady-state linear dynamical systems on my mind lately for ESORMA and such; (b) that a network of ones and zeros is a binary graph and can be interpreted as a collection of logical evaluations (e.g. True = A or C or F or G), which also had been on my mind; and (c) that a neural net achieves both seemingly-incompatible scenarios and has to somehow transition from one to the other. I'm presenting the info below in a similar manner of how I came to understand it myself.
Observation 1: If a deep learning vanilla neural network with logistic or tanh nonlinearities (et al*) start with small initial values (as in Glorot for any nontrivial layer sizes), then adjusting weights by small amounts acts in a linear way. That is, the dynamics of learning initially like a linear system even with strongly nonlinear activation functions. (Also includes RELU for values not at zero.) This has to do with the derivatives of sigm and tanh near zero are const (1st derivative) then zero (all higher derivatives). * softmax requires the output values of each part be similar. While logistic on two variables would be y_1=a(W_1x)=1/(1+e^-(W_1x)), y_2=a(W_2x)=1/(1+e^-(W_2x)), softmax instead has y_1=e^(W_1x)/(e^(W_1x)+e^(W_2x)) which is in phase 1 only when W_1x and W_2x are approximately equal (invariant to magnitude or sign) in addition to being within the phase 2 peaks.
Observation 2: When a network is highly learned, continually trying to learn on it with the same dataset will only increase the absolute value of the weights. That is, the values will tend towards +/- infinity. Put through tanh or sigmoid (wlog, let's use sigmoid), then the outputs of layers, ie the y values in y=sigm(Wx), will tend to be just below 1 or just above zero. An operator with values basically at either zero or one is a binary matrix, which can be interpreted as a graph or logical decision process, in the sense that a w.dot(x) gets assigned a 1 or 0, which can be interpreted as a truth value assignment. Even if a multilayer network's input features are not {0,1}- -valued, the output of a saturated (aka fully-learned) first layer will be {0,1}-valued, which upholds the subsequent binary graph interpretations. (Saturation can also be evaluated by how many signs of weight updates flip between training epochs.)
So, several questions arise:
Neural network attractors and dynamical systems, 2021. A project made of a few parts as of now. I'm asking questions about attractors of recurrent Hopfield/Boltzmann neural networks. Attractors serve as state systems and have been used to model associative memory and short-term memory of the cortex. So I'm curious about the shape and distribution of attractors to further understand these networks' computational complexity and expressiveness. To this end, first I employ standard theory of dynamical systems, stability theory, eigenspectra, and random matrix models of biologically-plausible networks. [ next steps are making distributed analogues, relate them to logical circuits & automata & LSTMs, and quantify the expressiveness per resources. ]
This is a hypothetical grant proposal that outlines, directly and succintly, the problem I'm working on, why I care about it, and how I'm going about it. Definitely draft material, but should be halfway legible to people who aren't me. I think I'm going to start labeling this project ESORMA, EigenSpectra Of Random Matrices / Attractors. This doesn't actually make perfect sense and is redundant, but it is helpful for getting the point across roughly and is a nice bite.
WIP Précis, 2021. This is a very rough document draft of new project I'm working for my own reference. Feel free to take a look, but it's very unedited thoughtdump/wordbarf!
Here is a github repo of computational analyses of (1) spectra of various random matrices and (2) finding the zeros of an equation related to spin-glasses, to attempt to gain an intuition around the classic result that attractor networks with N neurons can hold up to 0.138N patterns.