-->


Rho Thoughts

This is a reference sheet, mostly for myself but also whoever cares to peruse it, of a geeneral problem in mathematics that underpins making biologically-realistic neuromodulator action in artificial neural networks.

Related to the fundamental problem linked here from here:

Also see my github project on flow_forces for a great survey/summary of the half-exponential problem.

Find a function \(f(x)\) that admits continuously fractional function iterates equaling \(e^x\), real or complex, for some number of iterates / amount of iteration. That is, find functional \(f^{[\alpha]}=\exp\ \forall \alpha \in \mathbb{R} \cap (0,1) \) , where \(f^{[n]}\) denotes function iteration or n compositions of a function; e.g., \(f^{[3]}:=f(f(f(x)))\). A simpler version of the problem: Find f such that \(f(f(x)) = e^x\).


These problems are equivalent, by various justifications:

uncertain equivalences: hand-wavy equivalences (don't quote me!): other resolutions: some important knowns: Leads: approximations: A thought on attention models, 2022. Transformers (Vaswani et al, 2013) successfully use attention mechanisms. Attention mechanisms are powerful, and Vaswani demonstrated they can sometimes be sufficient in achieving state-of-the-art deep network performance sans any other improvement in the network such as the convolutional or recurrent layer made in the ANN model since its inception 60 years ago. While their transformer attention component is egregiously and unnecessarily bloated (e.g. Tay 2021 "Efficient Transformers"), it can still be gleaned that the core attentional interaction within the transformer mechanism involves (Vx)*exp(Wx) for the input vector x and weight matrices V and W. Many other attention mechanisms in literature use an exponential function of the input vector x multiplied by the input x again (e.g. Tay surveys, 2020 and 2021). Pending a thorough lit review, I'd say almost all popular attentional models use this fundamental schematic and relationship of multiplying x by itself with an exp in the middle somewhere. There are almost no exceptions to using this form; almost always the exp is specifically a softmax (i.e., x*exp(x)/sum_X(exp(X)) ), with notable exceptions diverging from (albeit were inspired by) this direct model, a class of kernel transformer attention models such as Linformer by Wang et al, 2020. Anthropomorphically, in (Vx)*exp(Wx) the "Vx" contribution is being modulated by the attention mechanism given by the "exp(Wx)" part. You could call the Vx the 'attendee' and the Wx or exp(Wx) the 'attendor'. Intuitively, however, it's unclear why one contributing part should be tasked with the attendor role and the other with the attendee role - it's rather arbitrary and restrictive. Would (Vx)exp((Wx)exp(Ux)) make sense? Where do the arbitrary human decisions stop -- just whenever you feel like it? Indeed, in the human body, while it can the case that one neurotransmitter fully modulates the other, there are scenarios where two neurotransmitters mutually modulate each other in a more equal way, with one not clearly being the modulator of the other (e.g. some ionotropic ones). Correspondingly, it's unclear why we can't have the two contributing parts Vx and Wx co-modulate - that is, give each some exp-like effect on the other but not a full effect, so that there's still 'material' to attend to. A functional square root of the exp function would be intuitive as a substitute, creating instead an f(Vx)*f(Wx) where each part modulates the other based on its value but together have the functionality of (Vx)*exp(Wx). The obvious stepping stone would then be to dismantle and equalize the structure into all the inputs x_i of x each having both an aspect of standard linear interactions (Wx) as well as some superlinear effect on each other, thus causing each input and weight to have a mix of attendor and attendee behaviour, giving a more nuanced improvement on the concept of attention as it stands currently. Furthermore, separating the propagation into Vx and Wx also goes against the trend of successful models; it's maybe best to somehow eliminate the distinct divisions Vx and Wx in our attentional model. Generally speaking, it's been observed time and time again in neural network progress that building in structure within the fundamental layer and restrictions such as tying parts to fixed roles is counterproductive (see using random initializations vs. initializing by an expert hand or tying features to outputs vs. letting them be learned from a blank slate). While it seems obvious to me that (a) looking beyond multiplication of a linear transformation of x and an exponential of another linear transform of x is ripe for improving network performance and (b) restricting roles of units (attendor vs attendee) and building within-layer structure (these x multiply those x) have consistently been shown as counterproductive to the neural network, I have yet to find a single instance of such an improvement in literature since 2017 when I started looking.