Jason Phang, Sun 11 June 2017, Machine learning mailing list

A bit of drama (or discourse!) unfolded across the Deep Learning-scape over the weekend. It started with a blog post by Yoav Goldberg, critiquing a DeepAI paper and broader issues in the community's overwhelming use of arXiv for distributing preprints and preliminary research. His crticism of the DeepAI paper, named "Adversarial Generation of Natural Language", pertained to its seemingly grandiose title when the substance of the research is actually fairly unsubstantial and the novelties unnoteworthy. Instead, he argues it is reflective of a more pervasive habit of researchers (surprisingly particularly big research labs) of "flag-planting" and getting preprints out as early and as quickly as possible, without significant improvements or peer review, in order to get credit (and citations) for subsequent developments in the field, even if they don't make much of a contribution. He also criticizes the authors of this paper (though his criticism was also targeted more broadly) for titling their paper the rather significant "Adversarial Generation of Natural Language", when research is far from anywhere near solving the problem. He further points out that authors barely scratched the surface of the problem domain in its evaluation section, and this rather flippant publication disregards the significant work already put into the field by other approaches. In summary, in his view, people are gaming the arXiv pre-print system, and he worries about its impact on budding researches who come to see this preliminary works from big labs as the golden standard in research.

This argument seemed to resonate with many on the twitter-verse, though there was equally some blowback, and he followed up with some clarifications. More significantly, Yann LeCun swung back, not only defending arXiv for its significant contribution to open research in the field, but also criticizing Yoav for being overly defensive about his own field. Yann himself has had previous experience dealing with domains that are adversarial to newcomers, and sees Yoav as echoing that sentiment for criticizing the quickly iterative, if noisy, research paradigm in deep learning today.

The debate continues. Yoav responds defending his particular arguments against the abuse of arXiv. Others interjected (a little unrelatedly) by thoughtfully adding their history of language models, responding to Yoav's paper criticisms, and generally coming around to a balanced view on the whole issue.

As an outsider, it is hard for me to comment meaningfully on this issue. Nevertheless, I have been in awe in not only how quickly the field has iterated, but also how almost all entities involved, even the biggest companies, have been devoted to publicizing their research. Granted, a big part of this is for the PR benefit, and often the largest data sets are not released, but even so, the dedication to open sharing of research has been astounding and a great boon to the present and future development to the field. I have personally benefitted greatly from being able to access the latest papers and research on arXiv, and I can speak to how difficult it is in other fields (e.g. academic finance, which I once followed somewhat closely) to get the latest research while having to navigate paywalls and subscriptions. This letter is well worth the read for anyone who is interested or currently in the field, and I think we owe a lot to these early pioneers of open machine learning for taking a stand for the open sharing of research.

**Self-Normalizing Neural Networks**

* by Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Sepp Hochreiter *

Playing out at the same time as the arXiv drama, this paper was published on arXiv and started sending waves throughout the online deep learning community. I'm talking within a day, tutorials started getting written.

The big new innovation here is the SELU, or **Scaled Exponential Linear Unit**. By now we're all familiar with ReLUs, the Rectified linear units:

These are simple, computationally fast, and pretty effective. They're the building blocks of all modern deep learning networks (other than LSTM cells - even then, they still find use in the broader RNN model usually).

Sort of unrelated to this is the idea of **internal covariate shift**, which is a fancy term for describing the issue where, when we have many layers of networks, because we're doing many matrix multiplications in between, the scales of our activations can get out of hand quickly. One effective solution from 2015 was **batch normalization**, which is basically an additional operation we insert in between layers (around the same location as ReLUs, where there's some minor discussion about whether it happens before or after ReLUs), where we normalize the activations of that layer to something that has mean=0 and standard deviation=1. This is not so different from the way that we often normalize variables first before doing analyses on them, in more standard statistical problems. BatchNorm not only significantly improved training by reducing internal covariate shift, but it also had some auxiliary benefits like adding some regularization, since the normalization was performed *per-batch*, which means the randomness of SGD provided some regularizing effect.

So back to SELUs. The authors (by the way, Sepp Hochreiter is the co-creator of the LSTM) basically asked "is it possible to have an activation function that encourages activations to look statistically like standard normals, without any additional operations?" In other words, can we get something BatchNorm-like, without having to go through BatchNorm and just using a different activation function?

The answer is yes. That's SELU.

$$ \text{SELU}(x) = \lambda \begin{cases} x, &\text{if }x>0 \\ \alpha e^x-\alpha, &\text{otherwise} \end{cases} $$Through some math (a 90-page appendix!), they show that this has exactly the properties we want, for the right values of $\lambda$ and $\alpha$. Now what are those values? It turns out they're:

$$ \alpha = 1.6732632423543772848170429916717 \\ \lambda = 1.0507009873554804934193349852946 $$(Don't worry, there *are* closed form solutions for them!)

Granted, SELUs will be more computationally expensive than ReLUs. But it's *much* cheaper than BatchNorm, and it's elegant in its own way. The results so far seem pretty robust across tasks, at least for pure feed-forward neural networks so far. Deep learning people across the web quickly experimented and found that it delivered exactly as promised, maintaining activations with near standard normal distributions even across many layers. This is akin to ResNets in allowing for much deeper layers, but I think the simplicity and atomicity of this innovation means it could be even more significant.

What a day.

**Improved Training of Wasserstein GANs / WGAN-GP** - GitHub

* by Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville *

**The Cramer Distance as a Solution to Biased Wasserstein Gradients / Cramer-GAN**

* by Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, Rémi Munos *

In the further evolution of GANs, two recent competing papers have come out both improving significantly the training stability of GANs. Both have good theory behind them, and more importantly, are highly accessible reads. We will have to wait for the dust to settle to determine what the best way of training GANs will be, but in the meantime, it is always good to have more theory!

Wasserstein-GANs enforce the Lipschitz constraint via weight-clipping, which can lead to unstable or pathalogical results. WGAN-GP (WGANs with Gradient Penalties) directly incorporate the Lipschitz constraint by adding an additional cost term penalizing violation of the Lipschitz constraint:

$$ L = \underbrace{ \underset{ \tilde{x} \sim \mathbb{P}_g }{ \mathbb{E} }{ \big[D(\tilde{x})\big] } - \underset{ \tilde{x} \sim \mathbb{P}_r }{ \mathbb{E} }{ \big[D(x)\big] } }_{ \text{WGAN critic loss} } + \underbrace{ \lambda \underset{ \hat{x} \sim \mathbb{P}_{\hat{x}} }{ \mathbb{E} }{ \big[ \big( ||\nabla_{\hat{x}}D(\hat{x})||_2 - 1 \big)^2 \big] } }_{ \text{Gradient Penalty} } $$If the previous paper shows that the Lipschitz constraint of WGANs can be targeted more directly, this paper goes evne further to show that Wasserstein distance may not be the ideal probability metric. The authors argue that there are 3 ideal properties of a probability metric:

- Scale invariance,
- Sum invariance and
- Unbiased sample gradients (for SGD).

KL-divergence has properties 2 and 3, while Wasserstein-distance was 1 and 2. All 3 are desirable for stable. The authors in turn propose the use of Cramer distance, which satisfies all three properties.

$$ \text{Cramer Distance} = l_2^2(P,Q) = \int_{-\infty}^{\infty}\big( F_P(x) - F_Q(x) \big)^2 dx $$where $F_P$ and $F_Q$ are cumulative distribution functions for two distributions. (Note, it's the square root of the Cramer distance that is a proper distance metric).

I really like this paper. The liberal use of math in this paper appears intimidating but is surprisingly accessible. The theory behind GAN training is getting ever more established, and this is a good thing.

A quick round-up of papers I really haven't had time to read (carefully):

**Learning to Compute Word Embeddings on the Fly**

* by Dzmitry Bahdanau, Tom Bosc, Stanisław Jastrzębski, Edward Grefenstette, Pascal Vincent, Yoshua Bengio *

Word embeddings like *GloVe* and *word2vec* have been a stunning successful approach to simplifying sequence tasks by first projecting words to a smaller, fixed-dimension space before carrying out any sequence-level processing, but there nevertheless remain some faults in the approach. One such weakness is the poor ability to handle rare words in the vocabulary. Without sufficient observations of these rare words, the embedding function is unable to generate a good embedding for those words. In this paper, the authors propose an approach where for rare words, they train a separate embedding model that takes in the definition, or other auxiliary information about the rare word, and subtitute that for the word embedding if necessary. Granted, it sounds a little like cheating in moving away from the truly "end-to-end" approach of language learning, but it is not so different from how people deal with rare words (look it up in the dictionary!). The results do not seem ground-breaking but do indicate some improvement - I believe similar approaches of incorporating auxiliary data for rare words will evolve out of this one.

**The Atari Grand Challenge Dataset**

* by Vitaly Kurin, Sebastian Nowozin, Katja Hofmann, Lucas Beyer, Bastian Leibe *

A large dataset of human playthroughs of a subset of the usual suite of Atari games for RL. This data is collected via a web application that emulates Atari games in-browser and records human playthroughs. This data would be useful for alternative forms of RL, such as Inverse Reinforcement Learning and Imitation learning. Presumably, this could also speed up the training of standard RL agent models, which normally otherwise start playing the game completely ignorant and could spend many cycles stuck doing nonsensical actions. As with Google's *Quick, Draw!*, I'm always excited for applications that incorporate human interactions to collect useful data sets. This tends to be much more efficient that going out to specially collect data, especially when couched in the right manner where people will willingly supply their interactions.

**Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour**

* by Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He *

FAIR has a nice report on how it managed to train a complete competitive ImageNet model in 1 hour. Note that this used to take weeks. Not a lot of novel research here, but this is more a bag of tricks and observations than a groundbreaking piece of research. The main trick seems to be to crank up the minibatch-size (from 256 to 8192 samples per minibatch), while making the appropriate adjustments. In particular, one key is to scale up the learning rate linearly with the increase in minibatch-size (say, $k$). This works (approximately) because each minibatch can be taken as an independent sample, and it's sort of equivalent to just doing the $k$ minibatch updates at once. This gets tricky with other training regimes (e.g. BatchNorm, momentum-based optimizers), and the report covers several tricks around that. I'm not deep into optimization myself, but it's worth a quick read, as these tricks may trickle down into common practice soon.

**TensorFlow Implementation of Tacotron**

*by Kyubyong*

Unfortunately, the authors of Tacotron, the ridiculously named but surprisingly competitive text-to-speech model, have yet to release an official version of their code, so we currently have to make do with people attempting to replicate their results.

I probably missed this before, but this seems to be a new major release for TorchCraft, an interface for doing RL on StarCraft II with Torch.

Turn all (anime) girls to silver hair !

**Google Brain Residency**

* by Ryan Dahl *

This is an insightful and honest recounting of one researcher's time at his Google Brain Residency, a year-long apprenticeship at Google Brain. The author recounts his experiments, thought processes, and more importantly, mis-steps and failures. I like this article a lot. Most of what we see in research are the "cherry-picked" high-light reels of months and years of slogging away at a problem and chasing down dead-ends, and its refreshing to read about someone else's negative results, if only to put ours in context. Definite worth a read.

**Convolutional Methods for Text**

* by Tal Perry *

Here's a pretty insightful blog post talking pretty generally about the application of CNNs to sequence tasks. We've seen this approach before, and from what I see this tends to be pushed more by the Facebook/Yann school. While some CNNs architectures are competitive with RNNs, RNNs still tend to dominate benchmarks. However, that may simply be because more time and resources have been put into optimizing RNNs for sequence tasks. However, using CNNs does come with several advantages. Frameworks and hardware are better optimized for convolutional operations, convolutional networks are better understood and easier to conceptualize, and there are other potentially interesting theoretical/intuitive properties too (e.g. conditioning on the entire sequence rather than simply on the previous time-steps). Anyway, back to this blog post - it's a pretty good overview of the background and foundations of applying CNNs to text, and also covers several other network architectures (ResNets, DenseNets, WaveNets) and what they brought to the fold. A great read for an intermediate reader.

**Exploring LSTMs**

* by Edwin Chen *

A good exploration of how LSTMs work, along with a deep dive into a toy example. This post focuses a lot on building intuition, but the toy example shows some astoundingly interpretable results.

Good lord, I remember when I would use to write up a paragraph for each of these. But with all that's been happening, this will have to do:

- A Deep Reinforcement Learning Bootcamp, taught by some of the most prominent names in the field. Worth checking out if you have the time/money to go.
- At WWDC, Apple announced Core ML, a framework for integrating machine learning models onto iOS apps. Supports models from a good (but incomplete) range of frameworks, including Caffe2, Keras, XGBoost, scikit-learn and LIBSVM.
- Microsoft releases CNTK 2.0
- OpenAI has some new research on training multiple agents, who learn learn pretty nifty behaviors like "letting one guy be the decoy".