**References**

**Notes** {{word-count}}

**Summary**:

**Key points**:

We want to learn $p_\theta (y \mid x)$, and it is a model which approximates the true $p(y \mid x)$.

A good model should make the data look probable.

We choose $\theta$ such that $p(\mathcal{D})=\prod_{i} p\left(x_{i}\right) p_{\theta}\left(y_{i} \mid x_{i}\right)$ is maximized.

However, one numerical problem here is that we are multiply together many numbers less than one.

To solve the problem, we can use $\log$ to convert multiplication into addition.

$\log p(\mathcal{D})=\sum_{i} \log p\left(x_{i}\right)+\log p_{\theta}\left(y_{i} \mid x_{i}\right) =\sum_{i} \log p_{\theta}\left(y_{i} \mid x_{i}\right)+\text { const }$

$\theta^{\star} \leftarrow \arg \max _{\theta} \sum_{i} \log p_{\theta}\left(y_{i} \mid x_{i}\right)$

This can also be formulated as a minimization problem.