**References**

**Notes** {{word-count}}

**Summary**:

**Key points**:

How is a Dataset generated?

There exists an underlying data generating distribution $p(x)$.

The Conditional Probability distribution over labels is represented as $p(x \mid y)$.

By the Chain Rule (Probability), the joint distribution of $(x, y)$ is $p(x, y)$.

A training set, \mathcal{D}=\left{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{n}, y_{n}\right)\right}, is generated by the joint distribution of $(x, y)$.

What is $p(\mathcal{D})$?

One assumption we need to make here is the Independent and Identically Distributed (i.i.d.) assumption.

When it is Independent and Identically Distributed, $p(\mathcal{D})=\prod_{i} p\left(x_{i}, y_{i}\right) = \prod_{i} p\left(x_{i}\right) p\left(y_{i} \mid x_{i}\right)$.