Machine Learning Garden

Dataset

References

Tags: concept

Sources:

Related notes:

Updates:

April 20th, 2021: created note.

Notes {{word-count}}

Summary:

Key points:

How is a Dataset generated?

There exists an underlying data generating distribution $p(x)$ .

The Conditional Probability distribution over labels is represented as $p(x \mid y)$ .

By the Chain Rule (Probability), the joint distribution of $(x, y)$ is $p(x, y)$ .

A training set, $\mathcal{D}=\left{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{n}, y_{n}\right)\right}$ , is generated by the joint distribution of $(x, y)$ .

What is $p(\mathcal{D})$ ?

One assumption we need to make here is the Independent and Identically Distributed (i.i.d.) assumption.

Independent means every

(x_i, y_i)

is independent of each

(x_j, y_j)

Identically distributed means every

(x_i, y_i)

comes from the same distribution.

When it is Independent and Identically Distributed,

p(\mathcal{D})=\prod_{i} p\left(x_{i}, y_{i}\right) = \prod_{i} p\left(x_{i}\right) p\left(y_{i} \mid x_{i}\right)

Referenced in

Machine Learning Concepts

Dataset

How is a Dataset generated?

Dataset