12 Jan 2018 # **A Tutorial On Boundary-Seeking Generative Adversarial Networks**

Devon Hjelm and Athul Paul Jacob

In this post, we discuss our paper, Boundary-Seeking Generative Adversarial Networks which was recently accepted as a conference paper to ICLR 2018.

Generative adversarial networks(a.k.a GANs)[Goodfellow et. al., 2014] is a unique generative learning framework that uses two separate models, a generator and discriminator, with opposing or *adversarial* objectives. Training a GAN only requires back-propagating a learning signal that originates from a learned objective function, which corresponds to the loss of the discriminator trained in an adversarial manner. This framework is powerful because it trains a generator without relying on an explicit formulation of the probability density, using only samples from the generator to train. We will give a brief overview of this framework.

Suppose we have empirical samples from a distribution \(\mathbb{P}\), \({x^{(i)} \in \mathcal{X}}_{i=1}^M\), where \(\mathcal{X}\) is the domain of images, word or character-based representations of natural language etc.

Our objective is then to find an induced distribution, \(\mathbb{Q}_\theta\) that describes these samples well.

We start with a simple prior, \(h(z)\) which is typically a gaussian or a uniform distribution and two *players*: The **generator** (\(\mathbb{G}_{\theta}: \mathcal{Z} \rightarrow \mathcal{X}\)) and the **discriminator** (\(\mathbb{D}_{\phi}: \mathcal{X} \rightarrow \mathcal{Z}\))

The goal of the generator is to find the parameters \(\theta\) to find the induced distribution \(\mathbb{Q}_\theta\). While, the goal of the discriminator is to classify the real and fake(generated) samples correctly.

In essense, the generator is trained to fool the discriminator into thinking that the generated samples comes from the true distribution \(\mathbb{P}\). While, the discriminator, is trained to distiguish between the samples from \(\mathbb{P}\) and the samples from \(\mathbb{Q}_\theta\).

This can then be formalized as a minimax game:

GANs have been shown to generate often-diverse and realistic samples even when trained on high-dimensional large-scale continuous data. GANs however have a serious limitation on the type of variables they can model, because they require the composition of the generator and discriminator to be fully differentiable.

With discrete variables, this is not true. For instance, consider using a step function at the end of a generator in order to generate a discrete value. In this case, back-propagation alone cannot provide the training signal, because the derivative of a step function is 0 almost everywhere. This is problematic, as many important real-world datasets are discrete, such as character- or word-based representations of language.

Let’s take a step back and try to understand one of the key choices that differentiates GANs. Suppose we are given two probability distributions, \(\realprob\) and \(\genprobp\). We need a *difference* measure between these distributions. The goal is to essentially, find \(\theta\) such that this difference measure is minimized. There are a variety of difference measure families such as:

**f-divergences:**KL-divergence, reverse KL-divergence, Hellinger distance, Total variation distance, \(\mathcal{X}^{2}\)-divergence etc.**IPM:**Kantorovich (Wasserstein dual), MMD, Fisher distance etc.

In the next section, we will focus on the \(f\)-divergence family of distances.

Suppose, we introduce a **convex** (lower semi-continuous) **function**:

The divergence generated by \(f\):

Note that,

Examples: KL, Jensen-Shannon, Squared Hellinger, Pearson \(\mathcal{X}^2\) etc.

This is the foundation of many generative learning algorithms.

Consider the **convex conjugate** of \(f\), \(\fdc\) and a family of functions, \(\SN\).

Then the dual form of the \(f\)-divergence is:

However, finding this supremum is hard. So we can instead use a family of neural networks(classifiers), \(\SN_{\phi}(x)\) which can be learnt!.

Hence, we have:

Now, we can proceed to show how BGAN can handle discrete data \TODO

Given a perfect discriminator (i.e \(\SNo\)), we show in Theorem 1 of our paper that,

Now we don’t have a perfect discriminator (i.e \(\SNo\)), but we do have a sub-optimal discriminator(i.e \(\SNp(x)\)). Let \(\iw(x) = (\pd{\fdc}{\SN})(\SNp(x))\) and \(\beta = \EE_{\genprob_{\dparams}}[\iw(x)]\) be a partition function. Then we can have a \(\fd\)-divergence importance weight estimator, \(\realdensest(x)\) as:

With this importance weight estimator, we have an option for training the generator in an adversarial way with the gradient of the \(KL\):

This gradient resembles other importance sampling methods for training generative models in the discrete setting. However, the variance of this estimator will be high, as it requires estimating the partition function, \(\beta\) (Through say, Monte-Carlo sampling). We address reducing this issue next.

We can use the expected KL over the conditionals instead! And so, we have,

We can then derive the normalized weights:

And the new partition function:

So now, the gradient for training the generator becomes:

From figure 7, we can see that this new policy gradient estimator (bold) has lower variance than the original policy gradient estimator (dashed) in estimating \(2 \times JSD - log 4\).

This can be implemented in Pytorch as:

REINFORCE is a common technique for dealing with discrete data in GANs. The lower-variance policy gradient estimator described above is a policy gradient in the special case that the reward is the normalized importance weights. This reward approaches the likelihood ratio in the non-parametric limit of an optimal discriminator. And so, we also make a connection to REINFORCE as it is commonly used, with baselines, by deriving the gradient of the reversed KL-divergence:

For continuous variables, minimizing the variational lower-bound suffices as an optimization technique as we have the full benefit of back-propagation to train the generator parameters, \(\gparams\). However, while the convergence of the discriminator is straightforward, to our knowledge there is no general proof of convergence for the generator except in the non-parametric limit or near-optimal case. What’s worse is the value function can be arbitrarily large and negative. Let us assume that \(\max \SN = M < \infty\) is unique. As \(\fdc\) is convex, the minimum of the lower-bound over \(\gparams\) is:

The generator objective is optimal when the generated distribution, \(\QQ_{\gparams}\), is nonzero only for the set \(\{x \mid \SN(x) = M \}\). Even outside this worst-case scenario, the additional consequence of this minimization is that this variational lower-bound can become looser w.r.t. the \(\fd\)-divergence, with no guarantee that the generator would actually improve. Generally, this is avoided by training the discriminator in conjunction with the generator, possibly for many steps for every generator update. However, this clearly remains one source of potential instability in GANs. And so, we can instead aim for the decision boundary and this can improve stability. We observe that for a given estimator \(\realdensest(x)\), \(\gendensp(x)\) matches when \(\iw(x) = (\pd{\fdc}{\SN})(\SN(x)) = 1\). And so, we can define the continuous BGAN objective as:

This objective can be seen as changing a concave optimization problem (which is poor convergence properties) to a convex one.

Our paper presents 10 sets of experiments. The code is available both in Pytorch as well as in Theano.

Reinterpreting the generator objective to match the proposal target distribution reveals a novel learning algorithm for training a generative adversarial network (GANs, Goodfellow et al., 2014). This proposed approach of boundary-seeking provides us with a unified framework under which learning algorithms for both discrete and continuous variables are derived. Empirically, we verified our approach quantitatively and showed the effectiveness of training a GAN with the proposed learning algorithm, which we call a boundary-seeking GAN (BGAN), on both discrete and continuous variables, as well as demonstrated some properties of stability.