All we care about is the joules per inference. The processing power is really more of a pissing contest. Google is offering the TPU 2.0 for about 1/2 the price of using Nvidia hardware on AWS doing the same amount of work.
What will be curious how much Google was able to lower additionally with the TPU 3.0? But it appears Google is now 2 generations ahead of Nvidia as Nvidia was yet to catch up to the TPU 2.0.
But it seems pretty obvious why Google needed the TPUs. They are doing the most real sounding text to speech I have ever heard. They are using a neural network with audio at 16k cycles a second. The computational power would be just incredible to get the better result. But the problem is the old method used very, very little computational power. BTW, I would consider text to speech to now be a solved problem.
So Google had to significantly lower the cost as in joules for the compute or they would not have been able to offer the service at a competitive price. There is just no third party silicon that can do that. State of the art is Nvidia and they do not have anything close as of today.
It is also how they are able to do the John Legend voice. We now have 6 voices but would not expect to see a ton of them. The issue is the most power used today is memory access and to move the model into memory would be too expensive.
I will be most interested to see how long it takes for anyone to be able to offer what Google is doing in this area. They continue to just keep raising the bar. We need someone to do it. Apple appears to have gone asleep. MS could have been doing it but without mobile kind of just does not mater.