Machine Learning Garden

Powered by 🌱Roam Garden

Dataset

References

Tags: concept

Sources:

Related notes:

Updates:

April 20th, 2021: created note.

Notes {{word-count}}

Summary:

Key points:

How is a Dataset generated?

There exists an underlying data generating distribution p(x)p(x).

The Conditional Probability distribution over labels is represented as p(x∣y)p(x \mid y).

By the Chain Rule (Probability), the joint distribution of (x,y)(x, y) is p(x,y)p(x, y).

A training set, \mathcal{D}=\left{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{n}, y_{n}\right)\right}, is generated by the joint distribution of (x,y)(x, y).

What is p(D)p(\mathcal{D})?

When it is Independent and Identically Distributed, p(D)=∏ip(xi,yi)=∏ip(xi)p(yi∣xi)p(\mathcal{D})=\prod_{i} p\left(x_{i}, y_{i}\right) = \prod_{i} p\left(x_{i}\right) p\left(y_{i} \mid x_{i}\right).

Dataset