diff --git a/tutorial/source/svi_part_i.ipynb b/tutorial/source/svi_part_i.ipynb index abd3a9009a..742fce54ff 100644 --- a/tutorial/source/svi_part_i.ipynb +++ b/tutorial/source/svi_part_i.ipynb @@ -25,9 +25,29 @@ "\n", "1. we can sample from each $p_i$\n", "2. we can compute the pointwise log pdf $p_i$ \n", - "3. $p_i$ is differentiable w.r.t. the parameters $\\theta$\n", - "\n", - "\n", + "3. $p_i$ is differentiable w.r.t. the parameters $\\theta$" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import math\n", + "import os\n", + "import torch\n", + "import torch.distributions.constraints as constraints\n", + "import pyro\n", + "from pyro.optim import Adam\n", + "from pyro.infer import SVI, Trace_ELBO\n", + "import pyro.distributions as dist" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## Model Learning\n", "\n", "In this context our criterion for learning a good model will be maximizing the log evidence, i.e. we want to find the value of $\\theta$ given by\n", @@ -45,34 +65,64 @@ "$$ p_{\\theta_{\\rm{max}}}({\\bf z} | {\\bf x}) = \\frac{p_{\\theta_{\\rm{max}}}({\\bf x} , {\\bf z})}{\n", "\\int \\! d{\\bf z}\\; p_{\\theta_{\\rm{max}}}({\\bf x} , {\\bf z}) } $$\n", "\n", - "Note that the denominator of this expression is the (usually intractable) evidence. Variational inference offers a scheme for finding $\\theta_{\\rm{max}}$ and computing an approximation to the posterior $p_{\\theta_{\\rm{max}}}({\\bf z} | {\\bf x})$. Let's see how that works.\n", - "\n", + "Note that the denominator of this expression is the (usually intractable) evidence. Variational inference offers a scheme for finding $\\theta_{\\rm{max}}$ and computing an approximation to the posterior $p_{\\theta_{\\rm{max}}}({\\bf z} | {\\bf x})$. Let's see how that works." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## Guide\n", "\n", "The basic idea is that we introduce a parameterized distribution $q_{\\phi}({\\bf z})$, where $\\phi$ are known as the variational parameters. This distribution is called the variational distribution in much of the literature, and in the context of Pyro it's called the **guide** (one syllable instead of nine!). The guide will serve as an approximation to the posterior.\n", "\n", "Just like the model, the guide is encoded as a stochastic function `guide()` that contains `pyro.sample` and `pyro.param` statements. It does _not_ contain observed data, since the guide needs to be a properly normalized distribution. Note that Pyro enforces that `model()` and `guide()` have the same call signature, i.e. both callables should take the same arguments. \n", "\n", - "Since the guide is an approximation to the posterior $p_{\\theta_{\\rm{max}}}({\\bf z} | {\\bf x})$, the guide needs to provide a valid joint probability density over all the latent random variables in the model. Recall that when random variables are specified in Pyro with the primitive statement `pyro.sample()` the first argument denotes the name of the random variable. These names will be used to align the random variables in the model and guide. To be very explicit, if the model contains a random variable `z_1`\n", - "\n", - "```python\n", + "Since the guide is an approximation to the posterior $p_{\\theta_{\\rm{max}}}({\\bf z} | {\\bf x})$, the guide needs to provide a valid joint probability density over all the latent random variables in the model. Recall that when random variables are specified in Pyro with the primitive statement `pyro.sample()` the first argument denotes the name of the random variable. These names will be used to align the random variables in the model and guide. To be very explicit, if the model contains a random variable `z_1` with standard normal distribution" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "def model():\n", - " pyro.sample(\"z_1\", ...)\n", - "```\n", - "\n", - "then the guide needs to have a matching `sample` statement\n", - "\n", - "```python\n", + " return pyro.sample(\"z_1\", dist.Normal(0,1))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "then the guide needs to have a matching `sample` statement: The distribution can be different, but the names must line up 1-1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "def guide():\n", - " pyro.sample(\"z_1\", ...)\n", - "```\n", - "\n", - "The distributions used in the two cases can be different, but the names must line-up 1-to-1. \n", + " return pyro.sample(\"z_1\", dist.Beta(1,1))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that returning the result of `pyro.sample()` is not neccessary but can be useful to generate samples.\n", "\n", "Once we've specified a guide (we give some explicit examples below), we're ready to proceed to inference.\n", "Learning will be setup as an optimization problem where each iteration of training takes a step in $\\theta-\\phi$ space that moves the guide closer to the exact posterior.\n", - "To do this we need to define an appropriate objective function. \n", - "\n", + "To do this we need to define an appropriate objective function. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## ELBO\n", "\n", "A simple derivation (for example see reference [1]) yields what we're after: the evidence lower bound (ELBO). The ELBO, which is a function of both $\\theta$ and $\\phi$, is defined as an expectation w.r.t. to samples from the guide:\n", @@ -92,28 +142,63 @@ "\n", "This KL divergence is a particular (non-negative) measure of 'closeness' between two distributions. So, for a fixed $\\theta$, as we take steps in $\\phi$ space that increase the ELBO, we decrease the KL divergence between the guide and the posterior, i.e. we move the guide towards the posterior. In the general case we take gradient steps in both $\\theta$ and $\\phi$ space simultaneously so that the guide and model play chase, with the guide tracking a moving posterior $\\log p_{\\theta}({\\bf z} | {\\bf x})$. Perhaps somewhat surprisingly, despite the moving target, this optimization problem can be solved (to a suitable level of approximation) for many different problems.\n", "\n", - "So at high level variational inference is easy: all we need to do is define a guide and compute gradients of the ELBO. Actually, computing gradients for general model and guide pairs leads to some complications (see the tutorial [SVI Part III](svi_part_iii.ipynb) for a discussion). For the purposes of this tutorial, let's consider that a solved problem and look at the support that Pyro provides for doing variational inference. \n", - "\n", + "So at high level variational inference is easy: all we need to do is define a guide and compute gradients of the ELBO. Actually, computing gradients for general model and guide pairs leads to some complications (see the tutorial [SVI Part III](svi_part_iii.ipynb) for a discussion). For the purposes of this tutorial, let's consider that a solved problem and look at the support that Pyro provides for doing variational inference. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## `SVI` Class\n", "\n", "In Pyro the machinery for doing variational inference is encapsulated in the `SVI` class.\n", "\n", - "The user needs to provide three things: the model, the guide, and an optimizer. We've discussed the model and guide above and we'll discuss the optimizer in some detail below, so let's assume we have all three ingredients at hand. To construct an instance of `SVI` that will do optimization via the ELBO objective, the user writes\n", - "\n", - "```python\n", - "import pyro\n", - "from pyro.infer import SVI, Trace_ELBO\n", - "svi = SVI(model, guide, optimizer, loss=Trace_ELBO())\n", - "```\n", - "\n", + "The user needs to provide three things: the model, the guide, and an optimizer. We've discussed the model and guide above and we'll discuss the optimizer in some detail below, so let's assume we have all three ingredients at hand. To construct an instance of `SVI` that will do optimization via the ELBO objective, the user defines optimizer (see below) and loss " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "optimizer = Adam({\"lr\": 0.0005, \"betas\": (0.90, 0.999)})\n", + "loss=Trace_ELBO()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "and then writes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "svi = SVI(model, guide, optimizer, loss)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "The `SVI` object provides two methods, `step()` and `evaluate_loss()`, that encapsulate the logic for variational learning and evaluation:\n", "\n", "1. The method `step()` takes a single gradient step and returns an estimate of the loss (i.e. minus the ELBO). If provided, the arguments to `step()` are piped to `model()` and `guide()`. \n", "\n", "2. The method `evaluate_loss()` returns an estimate of the loss _without_ taking a gradient step. Just like for `step()`, if provided, arguments to `evaluate_loss()` are piped to `model()` and `guide()`.\n", "\n", - "For the case where the loss is the ELBO, both methods also accept an optional argument `num_particles`, which denotes the number of samples used to compute the loss (in the case of `evaluate_loss`) and the loss and gradient (in the case of `step`). \n", - "\n", + "For the case where the loss is the ELBO, both methods also accept an optional argument `num_particles`, which denotes the number of samples used to compute the loss (in the case of `evaluate_loss`) and the loss and gradient (in the case of `step`). " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## Optimizers\n", "\n", "In Pyro, the model and guide are allowed to be arbitrary stochastic functions provided that\n", @@ -127,36 +212,57 @@ "\n", "All of this is controlled by the `optim.PyroOptim` class, which is basically a thin wrapper around PyTorch optimizers. `PyroOptim` takes two arguments: a constructor for PyTorch optimizers `optim_constructor` and a specification of the optimizer arguments `optim_args`. At high level, in the course of optimization, whenever a new parameter is seen `optim_constructor` is used to instantiate a new optimizer of the given type with arguments given by `optim_args`. \n", "\n", - "Most users will probably not interact with `PyroOptim` directly and will instead interact with the aliases defined in `optim/__init__.py`. Let's see how that goes. There are two ways to specify the optimizer arguments. In the simpler case, `optim_args` is a _fixed_ dictionary that specifies the arguments used to instantiate PyTorch optimizers for _all_ the parameters:\n", - "\n", - "```python\n", - "from pyro.optim import Adam\n", - "\n", + "Most users will probably not interact with `PyroOptim` directly and will instead interact with the aliases defined in `optim/__init__.py`. Let's see how that goes. There are two ways to specify the optimizer arguments. In the simpler case, `optim_args` is a _fixed_ dictionary that specifies the arguments used to instantiate PyTorch optimizers for _all_ the parameters:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "adam_params = {\"lr\": 0.005, \"betas\": (0.95, 0.999)}\n", - "optimizer = Adam(adam_params)\n", - "```\n", - "\n", + "optimizer = Adam(adam_params)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "The second way to specify the arguments allows for a finer level of control. Here the user must specify a callable that will be invoked by Pyro upon creation of an optimizer for a newly seen parameter. This callable must have the following signature:\n", "\n", "1. `module_name`: the Pyro name of the module containing the parameter, if any\n", "2. `param_name`: the Pyro name of the parameter\n", "\n", - "This gives the user the ability to, for example, customize learning rates for different parameters. For an example where this sort of level of control is useful, see the [discussion of baselines](svi_part_iii.ipynb). Here's a simple example to illustrate the API:\n", - "\n", - "```python\n", - "from pyro.optim import Adam\n", - "\n", + "This gives the user the ability to, for example, customize learning rates for different parameters. For an example where this sort of level of control is useful, see the [discussion of baselines](svi_part_iii.ipynb). Here's a simple example to illustrate the API:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "def per_param_callable(module_name, param_name):\n", " if param_name == 'my_special_parameter':\n", " return {\"lr\": 0.010}\n", " else:\n", " return {\"lr\": 0.001}\n", "\n", - "optimizer = Adam(per_param_callable)\n", - "```\n", - "\n", - "This simply tells Pyro to use a learning rate of `0.010` for the Pyro parameter `my_special_parameter` and a learning rate of `0.001` for all other parameters.\n", - "\n", + "optimizer = Adam(per_param_callable)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This simply tells Pyro to use a learning rate of `0.010` for the Pyro parameter `my_special_parameter` and a learning rate of `0.001` for all other parameters." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "## A simple example\n", "\n", "We finish with a simple example. You've been given a two-sided coin. You want to determine whether the coin is fair or not, i.e. whether it falls heads or tails with the same frequency. You have a prior belief about the likely fairness of the coin based on two observations:\n", @@ -184,9 +290,15 @@ "source": [ "To learn something about the fairness of the coin that is more precise than our somewhat vague prior, we need to do an experiment and collect some data. Let's say we flip the coin 10 times and record the result of each flip. In practice we'd probably want to do more than 10 trials, but hey this is a tutorial.\n", "\n", - "Assuming we've collected the data in a list `data`, the corresponding model is given by\n", - "\n", - "```python\n", + "Assuming we've collected the data in a list `data`, the corresponding model is given by" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "import pyro.distributions as dist\n", "\n", "def model(data):\n", @@ -199,14 +311,24 @@ " for i in range(len(data)):\n", " # observe datapoint i using the bernoulli \n", " # likelihood Bernoulli(f)\n", - " pyro.sample(\"obs_{}\".format(i), dist.Bernoulli(f), obs=data[i])\n", - "```\n", - "\n", + " pyro.sample(\"obs_{}\".format(i), dist.Bernoulli(f), obs=data[i])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "Here we have a single latent random variable (`'latent_fairness'`), which is distributed according to $\\rm{Beta}(10, 10)$. Conditioned on that random variable, we observe each of the datapoints using a bernoulli likelihood. Note that each observation is assigned a unique name in Pyro.\n", "\n", - "Our next task is to define a corresponding guide, i.e. an appropriate variational distribution for the latent random variable $f$. The only real requirement here is that $q(f)$ should be a probability distribution over the range $[0.0, 1.0]$, since $f$ doesn't make sense outside of that range. A simple choice is to use another beta distribution parameterized by two trainable parameters $\\alpha_q$ and $\\beta_q$. Actually, in this particular case this is the 'right' choice, since conjugacy of the bernoulli and beta distributions means that the exact posterior is a beta distribution. In Pyro we write:\n", - "\n", - "```python\n", + "Our next task is to define a corresponding guide, i.e. an appropriate variational distribution for the latent random variable $f$. The only real requirement here is that $q(f)$ should be a probability distribution over the range $[0.0, 1.0]$, since $f$ doesn't make sense outside of that range. A simple choice is to use another beta distribution parameterized by two trainable parameters $\\alpha_q$ and $\\beta_q$. Actually, in this particular case this is the 'right' choice, since conjugacy of the bernoulli and beta distributions means that the exact posterior is a beta distribution. In Pyro we write:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "def guide(data):\n", " # register the two variational parameters with Pyro.\n", " alpha_q = pyro.param(\"alpha_q\", torch.tensor(15.0), \n", @@ -214,9 +336,13 @@ " beta_q = pyro.param(\"beta_q\", torch.tensor(15.0), \n", " constraint=constraints.positive)\n", " # sample latent_fairness from the distribution Beta(alpha_q, beta_q)\n", - " pyro.sample(\"latent_fairness\", dist.Beta(alpha_q, beta_q))\n", - "```\n", - "\n", + " pyro.sample(\"latent_fairness\", dist.Beta(alpha_q, beta_q))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "There are a few things to note here:\n", "\n", "- We've taken care that the names of the random variables line up exactly between the model and guide.\n", @@ -224,22 +350,35 @@ "- The variational parameters are `torch.tensor`s. The `requires_grad` flag is automatically set to `True` by `pyro.param`.\n", "- We use `constraint=constraints.positive` to ensure that `alpha_q` and `beta_q` remain non-negative during optimization.\n", "\n", - "Now we can proceed to do stochastic variational inference. \n", - "\n", - "```python\n", - "# set up the optimizer\n", - "adam_params = {\"lr\": 0.0005, \"betas\": (0.90, 0.999)}\n", - "optimizer = Adam(adam_params)\n", - "\n", + "Now we can proceed to do stochastic variational inference. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# setup the inference algorithm\n", - "svi = SVI(model, guide, optimizer, loss=Trace_ELBO())\n", + "svi = SVI(model, guide, optimizer, loss)\n", + "\n", + "# create some data with 6 observed heads and 4 observed tails\n", + "data = []\n", + "for _ in range(6):\n", + " data.append(torch.tensor(1.0))\n", + "for _ in range(4):\n", + " data.append(torch.tensor(0.0))\n", "\n", "n_steps = 5000\n", "# do gradient steps\n", "for step in range(n_steps):\n", - " svi.step(data)\n", - "``` \n", - "\n", + " svi.step(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "Note that in the `step()` method we pass in the data, which then get passed to the model and guide. \n", "\n", "The only thing we're missing at this point is some data. So let's create some data and assemble all the code snippets above into a complete script:" @@ -298,7 +437,7 @@ " beta_q = pyro.param(\"beta_q\", torch.tensor(15.0), \n", " constraint=constraints.positive)\n", " # sample latent_fairness from the distribution Beta(alpha_q, beta_q)\n", - " pyro.sample(\"latent_fairness\", dist.Beta(alpha_q, beta_q))\n", + " return pyro.sample(\"latent_fairness\", dist.Beta(alpha_q, beta_q))\n", "\n", "# setup the optimizer\n", "adam_params = {\"lr\": 0.0005, \"betas\": (0.90, 0.999)}\n", @@ -377,7 +516,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.10" + "version": "3.9.1" } }, "nbformat": 4,