**References**

**Tags**: concept

**Sources**:

**Related notes**:

**Updates**:

April 19th, 2021: add lecture notes from CS 182: Lecture 2, Part 2: Machine Learning Basics.

April 18th, 2021: created note.

**Notes** {{word-count}}

**Summary**:

**Key points**:

In Supervised Learning, given \mathcal{D}=\left{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{n}, y_{n}\right)\right}, the objective is to learn $f_{\theta}(x) \approx y$.

Generally, our goal is to predict $y$ given some $x$.

However, prediction itself is very difficult because there are many boundary cases in the real world.

**Predicting probabilities**

So we use probabilities to represent the likelihood of a prediction falling into a certain category.

Predicting probabilities instead of labels can make training easier, which is due to smoothness,

Intuitively, discrete labels cannot be changed by a bit. It's either all or nothing.

Given \mathcal{D}=\left{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{n}, y_{n}\right)\right}, the objective is to learn $p_{\theta}(y \mid x)$.

$x$ is a Random Variable representing the input.

$x$ is a random variable because we do not know what $x$ we will get. There is some true underlying process in the real world that gives rise to different $x$'s.

$y$ is a Random Variable representing the output.

$p(x, y)=p(x) p(y \mid x)$ by Chain Rule (Probability).

$\displaystyle p(y \mid x)=\frac{p(x, y)}{p(x)}$ by the definition of Conditional Probability.

Models that learn $p(y \mid x)$ are called Discriminative Model because the goal is to discriminate between different $y$'s.

Models that learn $p(x, y)$ are called Generative Model because such a model can learn to generate $x$.

When predicting probabilities, instead of representing the output by object labels, we do it using objective probability: what is the likelihood of this object falling into this category?

In Supervised Learning, given \mathcal{D}=\left{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{n}, y_{n}\right)\right}, the objective is to learn $f_{\theta}(x) \approx y$.