Re: So how does it work?
I don't think so. Viterbi decoders (someone correct me if I'm wrong here) are basically MLE (Maximum Likelihood Estimation) processes for Markov models (HMMs or MEMMs or otherwise).
This article (and some other bits I've skimmed about SpeechMatics) mentions deep recurrent neural networks. Like a Markov model, a neural network has a bunch of hidden nodes with weighted edges between states; but where an MM's edges represent probabilities of transitions between states, the edges in a NN represent signaling thresholds. So with a NN you can have multiple nodes "active" at any step, whereas an MM has only one current state.
Then add back-propagation (the bit Hinton's talking about), which is what makes it a recurrent neural network, and you have the weights on the edges constantly changing - as opposed to the typical MM which is trained once and static from then on.
The Viterbi algorithm is basically a tweak to the Forward algorithm - where the latter finds the total probability of a given output, Viterbi finds the most likely output (given current visible state, hidden state, and input vector). I think in theory it should be possible to convert any RNN to a MM (and thus to a Viterbi decoder), but you'd have a combinatorial increase in in the number of states.
Viterbi decoders are still used in ML, for example in the simpler stages of natural language processing such as part-of-speech identification (for many applications, not universally). But there's too much ambiguity in natural language to do much NLP with them. A lot of simple sentiment analysis used to be done with HMMs or MEMMs, for example, but it can't cope with sarcasm, references to other subjects, etc. (There was a good paper in CACM a few years ago about a new approach to implementing Rhetorical Structure Theory for sentiment analysis that demonstrates some of the issues.)
One of the things that's interesting to me about SpeechMatics is that they're using a deep RNN stack rather than deep convolutional neural nets. CNNs were the Next Big Thing for a while, with e.g. Google pushing them heavily. (Extremely simplified: CNNs use a convolution layer - basically a signal-shape matcher - rather than back-prop.) But there's so much research in ML that deep-learning stack architecture has become really complex, with dozens of layers that mix various kinds of networks with mixing layers and such.
And now we have things like generative adversarial networks, where one network produces a stew of real and forged data, and another network tries to distinguish the two, and they learn from each other. Interesting stuff.