Jason Phang, Fri 30 June 2017, Machine learning mailing list

After some consideration, I will be shortening / condensing this newsletter in favor of more consistency and complete coverage of news items. There may be 1-2 news items that I decide to dive into, but I'll be moving toward having shorter summaries of things I've read (especially when most of them are just short summaries + "you should read this!"). This post will sort of be a hybrid format between the two.

In other news, I will be going to the Deep RL Bootcamp in August! Can't wait!

A quick round-up of papers I really haven't had time to read (carefully):

**Do GANs actually learn the distribution? An empirical study**

* by Sanjeev Arora, Yi Zhang *

This paper proposes a new experiment for determining if GANs are empirically actually capable of learning a full distribution (as some theory posits), or if they are only able to generate visually appealing but semantically uninteresting samples - in other words, very similar outputs to the training set. The experiment proposed is pretty straightforward and based on the birthday paradox - sample N from the GAN generator, and see how many of them within that N are really similar (first based on euclidean distance, then human judgment). The idea is that if even within that small a sample we start to see almost-duplicates, then the generator is likely still collapsing to several modes (with some visually appealing noise).

I'll wait for the rest of the cokmmunity if determine if this is some kind of damning test, but it should be a relatively low bar for GANs to pass with modifications with this test in mind.

**GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium**

* by Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Günter Klambauer, Sepp Hochreiter *

More theory and interesting work from Hochreiter! Here, this paper has several small but somewhat meaningful contributions. First and most importantly, they show that under several reasonable conditions, notably lipschitz gradients (so no ReLUs) and diffferent learning rates from the generator and discriminator, GAN training indeed converges to a Nash equilibrium. The condition of different learning rates seems minor but is necessary - with the same update rates, they cannot show convergence. The key is that we need the faster update to converge while the slower update is still mobile. The other two contributions are analogizing the update dynamics of Adam to a "Heavy Ball with Friction" (HBF), and a new metric for judging the quality of GAN images. The latter two I am less qualified to comment on, but this is still a good dense read.

**Attention Is All You Need**

* by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin *

This paper seems to build further on the attention mechanism by having a model run primarily off it. One interesting and obvious innovation is the addition of "multi-head attention". Take the RNN case - attention normally just learns a softmax over the history of hidden activations. Well, multi-head attention learns multiple such softmaxes! I would not be surprised if multi-head attention because more widely adopted other this. The model within this paper does still involve several other network architecture elements like convolutional layers, so I will need to study it further. In any case, TensorFlow and PyTorch implementations have already appeared, so those would be good references.

**Device Placement Optimization with Reinforcement Learning**

* by Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, Jeff Dean *

Apparently you can learn how best to allocation your deep learning computations on GPU via deep learning?

**CleverHans** - Blog Post

*by Ian Goodfellow, Nicolas Papernot*

A project by Ian Goodfellow and others chipping away at the problem of adversarial examples and attacks on deep learning models. Research has shown that it's pretty easy to generate adversarial examples for a model, and subsequent attempts to build defenses against them have been unsuccessful. The theory around this field is still relatively nascent, but I suspect we will see more in the near future. In the meatime, this library provides some useful benchmarking vulnerabilities of networks to adversarial examples.

**Performance RNN** - GitHub

*by Magenta (Google)*

Magenta has more cool applications of neural networks to generation music. This time, it's composing its own notes and also outputing notes at a more varied time scale.

**keras-vis (Keras Visualization Toolkit)** - GitHub

*by Raghavendra Kotikalapudi*

A great new tool for visualizing networks, weights and gradients for mdoels in Keras. I appear to have gotten off the Keras boat right as it became widely-adopted (this toolkit, CoreML). I should get back on.

**BAIR Blog: Learning to reason with Neural Module Networks**

*by Berkeley AI Research*

BAIR (Berkeley AI Research) is starting its own research blog! Its first article is not all that interesting, but I am looking forward to more coming from Berkeley.

**A Review of Neural Network Architectures**

*by Eugenio Culurciello*

A pretty good review article of the history image-oriented network architectures. Not a whole of new content here, but it's a pretty good review with some short commentary. It also brought to my attention two review papers (1, 2) that I want to check out.

**A guide to receptive field arithmetic for Convolutional Neural Networks**

*by Dang Ha The Hien*

Good article on calculating the receptive fields of convolutional kernels. The formula for calculating the number of features for the number of features in a convolutional layer is pretty well known (in one dimension, $n_\text{out} = \left\lfloor \frac{n_\text{in} + 2p - k}{s} \right\rfloor$, see paper), but this article goes over the computation for the receptive field (how much of the original input is covered by nested convolutional layers) for a single convolutional output. This arithmetic is often overlooked given that most convolutional layers are pre-built nowadays, but it is nevertheless useful for deeper analysis of ConvNets, or for networks that need to be careful about their receptive field, such as the Casual Convolutions in WaveNet.

**What I've learnved about neural network quantization**

*by Pete Warden*

A wonderful article about quantizing neural networks (operating on them using fewer bits, for memory and computational efficiency). I have not studied much about the topic on my own, but this article does give a very solid rundown of the tips and observation the author has learned both from experience as well as from literature. Worth a read if this is something you may look into in the future.

**More Improved Training of Wasserstein GANs and DRAGAN**

*by Thomas Viehmann*

More on the training of WGANs and DRAGANs. I missed the boat entirely on DRAGANs, so I'll need to catch up on this.

**Notes on the Cramer GAN**

*by Arthur Gretton*

More deep, deep math on GANs and convergence, this time on the Cramer GANs. With these many mathematically oriented minds picking away at the problem, I would not be surprised if we soon have solid and ground-breaking theory on GANs.