Rho Thoughts
This is a reference sheet, mostly for myself but also whoever cares to peruse it, of a geeneral problem in mathematics that underpins making biologically-realistic neuromodulator action in artificial neural networks.
Related to the fundamental problem linked here from here:
Also see my github project on flow_forces for a great survey/summary of the half-exponential problem.
Find a function \(f(x)\) that admits continuously
fractional function iterates
equaling \(e^x\), real or complex, for some number of iterates / amount of iteration.
That is, find functional \(f^{[\alpha]}=\exp\ \forall \alpha \in \mathbb{R} \cap (0,1) \) , where \(f^{[n]}\) denotes function iteration or n compositions of a function; e.g., \(f^{[3]}:=f(f(f(x)))\).
A simpler version of the problem: Find f such that \(f(f(x)) = e^x\).
These problems are equivalent, by various justifications:
- interpolating between addition and multiplication monotonically and continuously;
- finding continual fractional iterates of the exp function;
- interpolating the id and exp functions;
- interpolating the id and log functions;
- optimizing the information entropy impact of a single neuron in a neural network layer
, based on at least uniform distributions;
- finding a time complexity class between \(O(log^{[m+1]}(n))\) and \(O(log^{[m]}(n)) \)
for all \(m\) in \(\mathbb{N}\);
- find a satisfying continuous analog of the Ackermann function;
uncertain equivalences:
- finding the possibly unique functional class that is analytic and monotonic;
- disprove the continuum hypothesis (not equivalent but are bidirectionally informative);
- finding the eigenfunction for \(\rho\) : \(\mathbb{E}\rho x = \lambda \mathbb{E}x\), \(x \in\) distribution
\(\mathfrak{D}^\alpha\) where \(\alpha\) and/or \(\lambda\) is in a space of a fractional measure \(\mu \notin \mathbb{N}\);
hand-wavy equivalences (don't quote me!):
- finding a commutative exponential function, so-to-speak;
- re-characterize 'continuous' (or holomorphic where applicable) interpolation;
- interpolate a sum and an integral;
- find a particular 'exponential functional' / 'most beautiful' quasi-function / principle golden branch ....
related problems, auxiliary problems, partial solutions:
- finding *rational* fractional iterates of the exp (or, equivalently, log) function;
- the ModIAN problem of assigning credit over a cross product in a backpropagation
- the ModIAN-based conjecture that one can construct a partially-pi, partially-sigma 'layer' using common tensors
- finding a particular sufficient generalization generalized Dirichlet series
- find the best possible solution using only tetration
- resolve exp\(^{[\alpha]}\) for \(\alpha>1\)
- characterize the possible roles of hypercomplex numbers
- tradeoff the measure of vector field's dimensions with the measure of the set of values the field takes on,
- under some constraint of fixed information entropy (cf. continuum hypothesis)
- find the best possible entropy of neurons in a layer (relatively easy)
other resolutions:
- find a satisfactory invertible transform
- data-driven learned models given by trained neural networks, svms
- identify a bio-realistic dynamical system that numerically implements a partial exponential function
- test fitness of different numerical solutions on some target task, such as one employing neuromodulation
- resolve how the continuity is not tautological with the need for a parameter; uniqueness
- find alternative conditions besides finding exp\(^{[\alpha]}\) that implement systematic, consistent neuromodulation
- find a curve C and a force 'inverse-metric' (such as, inverse-distance-squared) whereby if two collections of fixed but force-projecting points form the Id line and the Exp curve in their densest limit (i.e. integrate over both), that the curve C remains unchanged when subjected to all the force of the Id an Exp points. Such a C would satisfy the time complexity class between \(O(log^{[m+1]}(n))\) and \(O(log^{[m]}(n)) \) under m+1 transformations of the space into logspace. Can elementary functions regress C perfectly?
- relatedly to the point above this one: find those points that are fixed points under this force subjugation, and contrast to points that 'flow' along a curve C. Is there more than one fixed point? Are there points that form small cycles, chaotic cycles, whatnot?
some important knowns:
- such a fractional iterate exp is not expressible with an arithmetic series endowed with exp(x) or log(x) on \(\mathbb{R}\)
- - ie exp\(^{[\alpha]}\) is transcendental in a
- is contingent on the use of distributions and Jensen's
- the answers for \(\mathbb{C}\) and \(\mathbb{R}\) are related but quite different problems
- exp(x)-1 has fixed point
- tetration algebra is linear like exp and x and +
- - power function(al)s behave well under tetration
- the axiom of choice can't be assumed de novo! (???? IDK about this)
- the inf iteration of exp\(^{[-\epsilon]}\) and inf iteration of log should converge, even if epsilon is irrational
Leads:
- understand solving the right Abel equation, starting with exp\((f(x))=f(x+1) \)
- solve for functors \(\eta^{\{N\}}_Y\circ F(f)=G(f)\circ\eta^{\{e^{-N}\}}_X\)
- study functional analysis, functional equations, infinite and fractional iterates of ~, fixed point theory
- study knuth up-arrow notation, other hyperoperations, and algebras of them
- plot Godfrey and Gasler's functions
- examine why several plotted results (as in Urban and Smagt's: Schroder and chi interpolations) aren't monotonic. > plot points without connecting lines?
- https://en.wikipedia.org/wiki/Large_countable_ordinal
- dissect what a neural network function does to approximate an exp\(^{[\alpha]}\) (pending Surya or Goeff figuring out a framework for analyzing NNs)
- a nice little ML lead: analyze generalization over training of a growing random forest
- look at graphs of convergence of various optimization approximations, under training for example.
- locate a hierarchy transcendent function that, loosely speaking, satisfies: \( \pm\alpha e^x \mp \beta x = f(x,\epsilon) \) as \(\alpha\rightarrow \epsilon\in(\alpha,0), \beta \rightarrow\alpha\).
- partial dimensional map-reduces for, e.g., approximately preserving information across all rho between sigma and pi
- infinitely partitioning linear matrix operations, ...
- The Steiner Tree problem is quasi-polynomial: runtime 2^poly(log n) . It consists of all nodes of a graph being in one of two sets. If all the nodes are in one set, it reduces to the shortest path problem: much less than O(E*V^2) , and if they're all in the other set, it reduces to the minimum spanning tree problem: O(m log n) or O(m + n log n). That is, at two poles, the problem is simpler. Compare: the information entropy is low on the two curves Id and Exp (etc), but the space of possibilities which reduce to them in limit cases have higher entropy, be it spatial randomness, location uncertainty @ probability, or some kind of oscillations across dimensions. (eg As time t++, points pop up or bounce around in a way that's cyclic over time which is important for them being stable in some sense.)
- work with Stirling's formula [\(\log(n!) = n \log(n) - n + O(\log(n))\)] \(\equiv\) [\( e^n \sim \sqrt{2\pi}\sqrt{n}(n^n)/(n!) \)]
- study what we empirically know about neuromodulation
- study fractional calculus, ugh
- study irrational powers and roots of various values, especially -1, and how they relate logic and analysis
- work on the polynomial series terms instead; something like: \(f(f(x)) = x^n/n! \)
- do to the lambert function what the gamma function did to factorials
- apply the Euler-Maclaurin formula to the expansions of log, sine, exp-convolve-exp, variants of such, etc.
- study the algebraic changes when converting between a sum and an integral; eg, gamma<->factorial, the riemann zeta(?). Relatedly:
- - examine the interactions of particles at different time and space granularities (dendrite circuits vs neural circuits, slow vs fast
acting receptors), especially the materials science understanding of the effects of particulate size on dynamics
- analyze how discretization of specific quantities change outcomes (exp\(^{[\epsilon]}\) is convex!)
- what is: \(\Sigma_{i=0}^3\)Ackermann\((a,b,i)\)? what about convolutions of related functions? do primitive recursive functions matter here?
- what's so special about 0.318+1.337i, a fixed point of complex log\(^n\)(x), and other functions that converge in a similar way
- examine the first iterations of complex log. Especially, compare them to the graudal convergence from sigma to pi
- assuming an exp\(^{[\alpha]}\), look at space transform equivalences. if sums in exp-space are full-rank-factorizable products, are
there sums in exp\(^{[\alpha]}\) space that can be fully factorized as well?
approximations:
- use arbitrary numerical approximation (eg, piecewise linear or fourier-harmonic) on:
- - tables of input-output values
- - on dynamical system constraints
- - on smoothly swapping zeros of \(\Sigma(x)\) and \(\Pi(x)\) (given random x, which is needed for commutativity)
- identify an infinite binary string {L, R} for selecting and routing arithmetic-geometric mean
- find a closed-form solution of Urban and Smagt's 2016 chi function
- use kneser's f(x)=exp(c+log(x)) -esque Abel equation solution for tetration
- use polynomial regression
- use carleman matrices
- expand Taylor series
- Cauchy integral? https://en.citizendium.org/wiki/Tetration#Polynomial_approximation
A thought on attention models, 2022. Transformers (Vaswani et al, 2013) successfully use attention mechanisms. Attention mechanisms are powerful, and Vaswani demonstrated they can sometimes be sufficient in achieving state-of-the-art deep network performance sans any other improvement in the network such as the convolutional or recurrent layer made in the ANN model since its inception 60 years ago. While their transformer attention component is egregiously and unnecessarily bloated (e.g. Tay 2021 "Efficient Transformers"), it can still be gleaned that the core attentional interaction within the transformer mechanism involves (Vx)*exp(Wx) for the input vector x and weight matrices V and W. Many other attention mechanisms in literature use an exponential function of the input vector x multiplied by the input x again (e.g. Tay surveys, 2020 and 2021). Pending a thorough lit review, I'd say almost all popular attentional models use this fundamental schematic and relationship of multiplying x by itself with an exp in the middle somewhere. There are almost no exceptions to using this form; almost always the exp is specifically a softmax (i.e., x*exp(x)/sum_X(exp(X)) ), with notable exceptions diverging from (albeit were inspired by) this direct model, a class of kernel transformer attention models such as Linformer by Wang et al, 2020. Anthropomorphically, in (Vx)*exp(Wx) the "Vx" contribution is being modulated by the attention mechanism given by the "exp(Wx)" part. You could call the Vx the 'attendee' and the Wx or exp(Wx) the 'attendor'. Intuitively, however, it's unclear why one contributing part should be tasked with the attendor role and the other with the attendee role - it's rather arbitrary and restrictive. Would (Vx)exp((Wx)exp(Ux)) make sense? Where do the arbitrary human decisions stop -- just whenever you feel like it? Indeed, in the human body, while it can the case that one neurotransmitter fully modulates the other, there are scenarios where two neurotransmitters mutually modulate each other in a more equal way, with one not clearly being the modulator of the other (e.g. some ionotropic ones). Correspondingly, it's unclear why we can't have the two contributing parts Vx and Wx co-modulate - that is, give each some exp-like effect on the other but not a full effect, so that there's still 'material' to attend to. A functional square root of the exp function would be intuitive as a substitute, creating instead an f(Vx)*f(Wx) where each part modulates the other based on its value but together have the functionality of (Vx)*exp(Wx). The obvious stepping stone would then be to dismantle and equalize the structure into all the inputs x_i of x each having both an aspect of standard linear interactions (Wx) as well as some superlinear effect on each other, thus causing each input and weight to have a mix of attendor and attendee behaviour, giving a more nuanced improvement on the concept of attention as it stands currently. Furthermore, separating the propagation into Vx and Wx also goes against the trend of successful models; it's maybe best to somehow eliminate the distinct divisions Vx and Wx in our attentional model. Generally speaking, it's been observed time and time again in neural network progress that building in structure within the fundamental layer and restrictions such as tying parts to fixed roles is counterproductive (see using random initializations vs. initializing by an expert hand or tying features to outputs vs. letting them be learned from a blank slate). While it seems obvious to me that (a) looking beyond multiplication of a linear transformation of x and an exponential of another linear transform of x is ripe for improving network performance and (b) restricting roles of units (attendor vs attendee) and building within-layer structure (these x multiply those x) have consistently been shown as counterproductive to the neural network, I have yet to find a single instance of such an improvement in literature since 2017 when I started looking.