Quite a bit happening this week! A big part of this is the aftermath of the NIPS deadline as papers slowly trickling onto arXiv, but so much else is happening too. As it is I'm getting buried under papers.
The Future of Go Summit
by Google DeepMind
AlphaGo at the Future of Go Summit, 23-27 May 2017
by Google DeepMind
AlphaGo's next move
by Google DeepMind
The biggest event of the week must go to DeepMind's Future of Go summit, with AlphaGo once again challenging the top minds in Go. AlphaGo's victory was decisive. It won 3-0 against the Ke Jie, the top-ranked player in the world, as well as 1 match against a team of 5 human players (it's less clear if the latter is that good a measure of AlphaGo's prowess considering its less conventional format). All games are available to watch on YouTube, and DeepMind has posted some well-deserved, self-congratulatory write-ups about AlphaGo's performance and its path forward. In short, it seems like they intend to retire AlphaGo as a competitive player, and use it more as a learning tool for the Go community. Meanwhile, the DeepMind team will begin to shift their focus to more general learning problems, using their experience with Go as a starting point.
Unfortunately, no code and papers for now. DeepMind does promise a final paper later this year though.
It's a little disappointing that this round of matches got a lot less fanfare than the Lee Sedol ones, though I guess at that point DeepMind had already demonstrated it's human-expert-level capability. Its later performance online under the "Master" pseudonym with apparently 50-0 wins probably cemented the idea in the Go community that the results was a given.
In any case, a big shout-out to DeepMind for pushing the boundaries of machine learning in Go, a game once considered "unsolvable" by computers, and also handling the roll-out and publicity of the matches excellently. They have earned their place in history.
OpenAI continues to make good on its mission to promote more open deep learning research, this time by publishing baseline models and algorithms for standard reinforcement learning problems.
This is hugely important. Deep learning is hard to get right, and deep reinforcement learning is even harder. The less intepretable and more black-box like the models are, and the more high-dimensional the problems get, that harder and slower it is to tell if there is a bug in your implementation or evaluation of your model. Kudos to the OpenAI team.
The blog post is well worth a read - the best practices section boils down lessons the OpenAI folks probably spent months learning the hard way.
A quick round-up of papers I really haven't had time to read (carefully):
On-the-fly Operation Batching in Dynamic Computation Graphs
by Graham Neubig, Yoav Goldberg, Chris Dyer
Modern deep learning frameworks are pretty good about abstracting away bits of thinking about the computation graphs (most importantly gradient computation), but even today batches remain front-and-center in every deep learning framework when designing your model programmatically. This paper seeks to perform batching automatically for dynamic computation graphs (i.e. PyTorch-style frameworks), in the same vein as gradients - "define for one; apply to batches". This is done by identifying operations with the same "signatures" (i.e. are designed to really be the same operation), in order of 1) no unresolved dependenceis, and 2) average depth (the latter is a heuristic). This would not only make the definition of networks simpler, it could also provide optimizations and improvements in computation time.
Unfortunately the authors' implementation is in MXNet, which I have little experience with. The PyTorch folks seem interested though.
PixColor: Pixel Recursive Colorization
by Sergio Guadarrama, Ryan Dahl, David Bieber, Mohammad Norouzi, Jonathon Shlens, Kevin Murphy
Another paper with a simple trick and great results. The problem being tackled here is automatic colorization of pictures. PixelCNNs (and PixelRNNs) have been pretty good for these tasks, but are slow, and the resulting images thus tend to be fairly small. The simple solution proposed here is to run the colorization-PixelCNN for a small output resolution (thus getting the desired colors), and then having a separate network scale it up to a higher resolution. The inspiration for this apparently comes from human generally perceiving colors at a lower spatial frequency (i.e. the resolution of our "color-recognition" is pretty low), and apparently JPEGs already take advantage of this in their compression!
Snapshot Ensembles: Train 1, get M for free
by Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, Kilian Q. Weinberger
Yet another paper with a simple trick and great results, though this one has potentially more theory to it. Normally in training we throw in some kind of SGD-optimizer, train till convergence, then call it a day. If we want an ensemble of networks (not a "pseudo-ensemble" like dropout), we'll need to do this $N$ times. This paper proposes an alternative approach for getting ensembles without having to train start-to-finish multiple times. The basic idea is to train as a cyclic (think cosines) learning rate, first going fast to slow (so we sort of reach a minima), take a snapshot, and then jacking it back to a fast learning rate again (so we hop out of the minima, presumably headed toward another minima). In this manner, we get multiple local minima that we can ensemble. Amazingly, we can do this in approximately the same time as training a single network start-to-end. I'm guess that's because since we have multiple minima, we're less concerned with getting that the minima just right and can thus train less carefully/more quickly.
Look, Listen and Learn
by Relja Arandjelović, Andrew Zisserman
Yet another paper with a simple trick and great results. The context here is that we have a lot of audio/video content that may not be labeled, so we can only do unsupervised training. We want to turn that into a labeled dataset for supervised training - so the idea here is to separate the audio and video, present them separately to the network (where they may not be from the same time slice), and ask the network to guess if the audio and video match. In this way, we're forcing the network to learn about how much the audio and video should match, and getting some semantic content out of it. The model can then be transfer-learned to other pure audio/visual tasks. Might be a good idea to run on the YouTube-8m Dataset.
More Pytorch Tutorials
by Yunjei Choi
I've been having a ton of fun with PyTorch (being able to debug models in-Python is such a game-changer), but it's still a relatively nascent framework with best practices for many common setups still being worke dout. Tutorials and examples like this are a god-send for learning PyTorch and learning it well. I'll find myself working through many of these in the coming weeks I presume.
Contents of this post are intended for entertainment, and only secondarily for information purposes. All opinions, omissions, mistakes and misunderstandings are my own.