Hey Alan--can you help me check my understanding here?
From the paper it looks like Definition 1 is what allows us to say that the main "work" being done by a neural network is a similarity function "K(x, xᵢ)" .
It's weird that aᵢ is dependent on x, but that dependence seems to be given by oftentimes pure loss functions like MSE, cross-entropy loss, etc--but this wouldn't take away from the fundamental understanding that neural networks are still doing instance-based learning as opposed to a kinda magic "general function approximation", from what I can tell.
I still find the proof a little hard to follow, but the step where you divide and multiply by a common term seems to track okay for me (since it's constrained by Definition 1, I don't see how you can swap the K(x, xᵢ) similarity function for any other arbitrary function). I clearly might be missing something though!