- Simple: assumes error-free, noise-free data
- White-box model: prediction is interpretable and explainable
- A boolean-valued function over a set of input instances (each comprising input attributes)
- A form of supervised learning: infer an unknown boolean-value function from set of training examples
-
Expressiveness: a quadratic function (
$ax^2+bx+c$ ) is more expressive than a linear function ($y=mx+c$ ), since more functions can be represented with the quadratic function -
Input Instances,
$X$ : each instance$x \in X$ is represented by the input attributes describing$X$ - eg.
$X$ =sky
(with possible$x$ valuessunny
,cloudy
,rainy
, etc.) - eg. given
$X=[T/F]$ ,$Y=[0/1/2]$ , number of possible input instances$= 2 \times 3 = 6$
- eg.
-
Unknown Target Concept/Function
$c$ , a boolean-valued function over a set of input instances- eg. EnjoySport:
$c : X \rightarrow {0,1}$
- eg. EnjoySport:
-
Noise-free training examples,
$D$ : +ve and -ve examples of the target function -
Hypothesis,
$h$ : a conjunction of constraints on input attributes -
Hypothesis Space,
$H$ : contains all possible$h \in H$ - As expressiveness increase,
$H$ increase, and usually needs more data to find target concept function$c$
- As expressiveness increase,
-
Synthetically Distinct Hypotheses: add 2 other values,
$?$ and$\emptyset$ to each input instance- eg. given
$X=[T/F]$ ,$Y=[0/1/2]$ , number of synthetically distinct hypotheses$= 4 \times 5 = 20$
- eg. given
-
Semantically Distinct Hypotheses: since
$\emptyset$ normally only taken in the hypothesis where all input instances are$\emptyset$ (eg.$\langle \emptyset, \emptyset, \emptyset \rangle$ but not$\langle Sunny, \emptyset, Strong \rangle$ ), add only$?$ to each input instance- eg. given
$X=[T/F]$ ,$Y=[0/1/2]$ , number of semantically distinct hypotheses$= 1 + 3 \times 4 = 13$ - Presence of empty set
$\emptyset$ means hypothesis$h$ matches nothing, thus$\langle ?, \emptyset, ? \rangle=\langle ?, ?, \emptyset \rangle = \langle \emptyset, \emptyset, \emptyset \rangle$
- eg. given
-
Satisfy: input instance
$x \in X$ satisfies a hypothesis$h \in H$ iff$h(x) = 1$ , thus only consider 1 other value$?$ in each input instance -
Consistent:
$h$ is consistent with a set of training examples$D$ iff$h(x) = c(x)$ $\forall \langle x,c(x) \rangle \in D$
Let
Goal: search for a hypothesis
- Every hypothesis containing 1 or more
$\emptyset$ symbols represents an empty set of input instance, hence classifying every instance as a -ve example
-
$h_j \geq_g h_k$ :$h_j$ is more general than or equal to$h_k$ iff any input instance$x$ that satisfies$h_k$ also satisfies$h_j$ $\forall x \in X, (h_k(x)=1) \rightarrow (h_j(x)=1)$ - Intuitively, it means
$h_j$ is a superset of$h_k$ - Negated form:
$\exists x \in X, (h_k(x)=1) \wedge (h_j(x)=0)$ -
$\geq_g$ relation also defines a partial order (reflexive, antisymmetric, transitive)
-
$h_j >_g h_k$ :$h_j$ is strictly more general than$h_k$ iff ($h_j \geq_g h_k) \wedge (h_k \not\geq_g h_j)$ -
$h_j >_g h_k$ :$h_k$ is strictly more specific than$h_j$ -
Version Space,
$VS_{H,D}$ wrt. hypothesis space$H$ and training example$D$ : the subset of$h \in H$ consistent with$D$ $VS_{H,D} = {h\in H | h \text{ is consistent with } D}$ -
$VS_{H,D}$ contains all consistent hypotheses
-
General Boundary,
$G$ of$VS_{H,D}$ : the set of maximally general members of$H$ consistent with$D$ -
Specific Boundary,
$S$ of$VS_{H,D}$ : the set of maximally specific members of$H$ consistent with$D$
-
$h$ is consistent with$D$ iff every +ve training instance satisfies$h$ and every -ve training instance does not satisfy$h$ - Suppose that
$c \in H$ . Then$h_n$ is consistent with$D ={\langle x_k, c(x_k) \rangle}_{k=1,\dots,n}$ - An input instance
$x$ satisfies every hypothesis in$VS_{H,D}$ iff$x$ satisfies every member of$S$ - An input instance
$x$ satisfies none of the hypotheses in$VS_{H,D}$ iff$x$ satisfies none of the members of$G$
Idea: Start with most specific hypothesis. Whenever it wrongly classifies a +ve training example as −ve, “minimally” generalize it to satisfy its input instance.
- Init
$h$ to most specific hypothesis in$H$ (ie.$h = \langle \emptyset , \dots, \emptyset \rangle$ ) - For each positive training instance
$x$ (ie. ignore -ve training intances),- For each attribute constraint
$a_i$ in$h$ ,- If
$x$ satisfies constraint$a_i$ in$h$ ,- Do nothing
- Else,
- Replace
$a_i$ in$h$ by the next more general constraint that is satisfied by$x$
- Replace
- If
- For each attribute constraint
- Output hypothesis
$h$
- Unable to tell whether Find-S has indeed learnt the target concept
- Unable to tell when training examples are inconsistent (since Find-S assumes noise free traning examples)
Idea: start with most general and specific hypotheses. Each training example "minimally" generalises
- For each training example
$d$ - If
$d$ is a +ve example- Remove from
$G$ any hypothesis inconsistent with d - For each
$s \in S$ not consistent with$d$ - Remove
$s$ from$S$ - Add to
$S$ all minimal generalizations$h$ of$s$ s.t.$h$ is consistent with$d$ , and some member of$G$ is more general than$h$ - Remove from
$S$ any hypothesis that is more general than another hypothesis in$S$
- Remove
- Remove from
- Else if
$d$ is a −ve example- Remove from
$S$ any hypothesis inconsistent with d - For each
$g \in G$ not consistent with$d$ - Remove
$g$ from$G$ - Add to
$G$ all minimal specializations$h$ of$g$ s.t.$h$ is consistent with$d$ , and some member of$S$ is more specific than$h$ (==why?==) - Remove from
$G$ any hypothesis that is more specific than another hypothesis in$G$
- Remove
- Remove from
- If
In lines 1.1.1.3 and 1.2.1.3, why is it enough to ascertain that "$h$ is consistent with
Intuition for line 1.2.1.2: make as many
-
$S$ and$G$ might reduce to$\emptyset$ with sufficiently large data, due to- Error/noise in training data (+ve wrongly labeled as -ve)
- Insufficiently expressive hypothesis representation (biased hypothesis space
$c \not\in H$ )
- Majority vote is the most probable classification, assuming all hypotheses in
$H$ are equally probable a priori - Works for any hypothesis space (even conjunctions of hypothesis)
- Requires at least
$\lceil\log_2(VS_{H,D})\rceil$ training examples to find target concept$c$ - Since version space reduces by at most half with each training example (==why?==)
- Any query from inside
$G$ but outside$S$ will reduce the version space (by at most half) - Any query from inside
$S$ or outside$G$ will NOT reduce the version space
$\langle Sunny, Warm, Normal, Strong, Warm, Same \rangle, EnjoySport = Yes$ $\langle Sunny, Warm, High, Strong, Warm, Same \rangle, EnjoySport = Yes$
$\langle Rainy, Cold, High, Strong, Warm, Change \rangle, EnjoySport = No$
$\langle Sunny, Warm, High, Strong, Cool, Change \rangle, EnjoySport = Yes$
Final version space
We wish to assure that the hypothesis space
However, this will make the learning algorithm unable to generalise beyond the observed eamples.
Let
The inductive bias of
where the
- Assumes the target concept
$c$ is contained in the given hypothesis space$H$
Concept Learning | DT Learning | |
---|---|---|
Target Concept | Binary Outputs | Discrete Outputs |
Training Data | Noise-free | Robust to noise |
Hypothesis Space | Restricted | Complete, expressive |
Search Strategy | Complete: version space; Refine search per example | Incomplete: prefer shorter tree (soft bias); Refine search using all examples; No backtracking |
Exploit Structure | General to Specific ordering | Simple to complex ordering |
- Continuous values (non integer values like 2.2321) have to be classified into a discrete set of possible cataegories, eg. (0-5, 5-10)
- This can lead to classification problems
- Hypothesis Space (number of distinct binary decision trees) with
$m$ boolean attributes,- = number of distict truth tables with
$2^m$ rows - =
$2^{2^m}$
- = number of distict truth tables with
- Decision trees can express any function of the input attributes
- function
$DTL(examples, attributes, parent_examples)$ returns$tree$ - if
$examples$ is empty- return
$PluralityValue(parent_examples)$
- return
- else if all
$examples$ have the same classification- return the classification
- else if
$attributes$ is empty- return
$PluralityValue(examples)$
- return
- else
$A \leftarrow \text{argmax}_{a\in attributes} Importance(a, examples)$ -
$tree \leftarrow$ a new decision tree with root test$A$ - for each value
$v_k$ of$A$ , Do$exs \leftarrow {e:e\in examples \wedge e.A = v_k}$ $subtree \leftarrow DTL(exs, attribues-A, examples)$ - add a branch to
$tree$ with label$(A=v_k)$ and substree$subtree$
- return
$tree$
- if
Intuition: find a small tree consistent with the training examples by greedily choosing the "most important" attribute as root of (sub)tree
- Most Important Attribute: a good attribute splits the examples into subsets that are ideally all +ve or all -ve, ie. after splitting, it should make the classification clearer
- Forms a Learned Decision Tree: a substantially simpler tree compared to the "true" decision tree. More complex hypotheses may not be classified correctly (as compared to the "true" tree)
Measures the uncertainity of classification:
-
$H(C) = 1$ : maximum uncertainty, equal proportion of +ve and -ve examples -
$H(C) = 0$ : no uncertainty, only +ve OR -ve examples
Suppose
A chosen attribute
-
$B(\frac{p_i}{p_i+n_i})$ : entropy of the child node -
$\frac{p_i+n_i}{p+n}$ : weight, or the proportion of examples that is dedicated to attribute value$i$
Information gain of target concept
-
$B(\frac{p}{p+n})$ : entropy of this node -
$H(C|A)$ : expected remaining entropy after testing$A$
- Using
$Gain(C,A) = B\left(\frac{p}{p+n}\right) - H(C|A)$ , find for all attributes$A$ , the highest$Gain$ - Use the attribute
$A$ with the highest$Gain$ as root - Branch out to all values of
$A$ , seeing which value will result in a definite answer - If definite answer for a value
$v$ branch (ie. all training examples are$+$ or$-$ for$v$ )- Simply branch out to the classification
- If no definite answer for a value
$v$ branch (ie. needs to have a new subtree for$v$ )- Using
$v$ , construct a new table for the subset of attributes that contain$v$ - Repeat from step 1 to create a new subtree with these subset of attributes
- Using
- If no more subset of attributes can be formed, use Plurality-Value of the attributes to determine the classification
Assumes:
- Shorter trees are preferred
- Trees that place high information gain attributes close to the root are preferred
- If only (1) is considered, is is the exact inductive bias of BFS for the shortest consistent DT, which can be prohibitively expensive
- Bias is a preference for some hypotheses, rather than a restriction of hypothesis space
- Simple hypotheses are preferred: Occam's Razor, long/complex hypothesis that fits data may be coincidence
Hypothesis
In other words, a hypothesis
- Training examples contain random errors or noise
- A more complex tree is drawn that fits these noisy examples but which do not fit the true concept
- Limited data
- Data is costly, problem is more acute than noisy training examples
- Stop growing DT when expanding a node is not statistically significant
- Allow DT to grow, then post-prune it
- Idea: partition data into training and validation sets
- Produces smallest version of most accurate subtree
- Needs ample data
Do until further pruning is harmful:
- Evaluate impact on validation set of pruning each possible node
- Greedily remove the one that most improves validation set accuracy
- Infer the DT from the training set, growing the tree until the training data is fit as well as possible and allowing overfitting to occur
- Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node
- Prune each rule by removing any preconditions that result in improving its estimated accuracy
- Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances
Define discrete-valued input attributes to partition the values into a discrete set of intervals for testing.
For attributes with many values, use
Attributes like Temperature
, BiopsyResult
, BloodTestResult
, vary significantly in their costs. In such cases, we prefer low-cost attributes where possible, relying on high-cost attributes only when needed to produce reliable classifications.
Replace
where
Use training example anyways and sort through DT
- If node
$n$ tests$A$ , then assign most common value of$A$ among other examples sorted to node$n$ - Assign most common value of
$A$ among other examples sorted to node$n$ with same value of output/target concept - Assign probability
$p_i$ to each possible value of$A$ - Assign fraction
$p_i$ of example to each descendant in DT
- Assign fraction
A robust approach to approximating real-valued, discrete-valued, and vector-valued target functions. Extremely popular in NLP, speech recognition, computer vision, and healthcare.
- Many neuron-like threshold switching units
- Many weighted interconnections among units
- Highly parallel, distributed process
- Emphasis on tuning weights automatically
Appropriate for problems with the following characteristics:
- Instances are represented by many attribute-value pairs
- Target function output may be discrete, real, or a vector of several real or discrete attributes
- Training examples may contains errors
- Long training times are acceptable
- Fast evaluation of the learned target function may be required
- Ability of humans to understand the learned target function is not important
Given inputs
where each
Line show in (a) is the line
The weight vector
Assuming
Note: points that lie directly on the line are classified as negative examples
Intuition: we want to determine a weight vector that causes the perceptron to produce the correct
Idea: initialise
Weights are modified at each step according to the perceptron training rule, which revises the weight
for
-
$t = c(x)$ is the target output for the training example$\langle x,c(x) \rangle$ -
$o=o(x)$ is the perceptron output -
$\eta$ is a sufficiently small $+$ve constant called learning rate -
$\Delta w_i$ is non-zero only if there is a misclassification
Intuition: search
Consider a simpler linear unit:
Learn
-
$D$ is the set of training examples -
$t_d$ is the target output for training example$d$ -
$o_d$ is the output of linear unit for training example$d$
Idea: find
This direction can be found by computing the derivative of
with the training rule
which will yield
Idea: initialise
Each training example
- Initialise each
$w_i$ to some small random value - Until the termination condition is met, Do
- Initialise each
$\Delta w_i$ to zero - For each
$d \in D$ , Do- Input the instance
$\vec{x}$ to the unit and compute the output$o$ - For each linear unit weight
$w_i$ , Do$\Delta w_i \leftarrow \Delta w_i + \eta(t-o)x_i$
- Input the instance
- For each linear unit weight
$w_i$ , Do$w_i \leftarrow w_i + \Delta w_i$
- Initialise each
- Perceptron training rule is guaranteed to converge if
- Training examples are linearly separable
- Learning rate
$\eta$ is sufficiently small
- Linear unit training rule utilising gradient descent is guaranteed to converge to hypothesis with min. squared error/loss
- If learning rate
$\eta$ is sufficiently small - Even when training examples are noisy and/or linearly non-separable by
$H$
- If learning rate
- Key practical difficulties in applying GD:
- Converging to a local minimum can sometimes be quite slow (require many thousands of GD steps)
- If there are multiple local minima in the error surface, then there is no guarantee that the procedure will find the global minimum
- SGD approximates the GD search by updating weights incrementally, following the calculation of the error for each individual example
- Sigmoid function,
$\sigma(net)=\frac{1}{1+e^{-net}}$ - Output ranges between
$0$ and$1$ - Output increases monotonically with its input
$net \geq 0 \implies \text{output}\geq \frac{1}{2}$
- Output ranges between
$\frac{\partial\sigma(net)}{\partial(net)} = \sigma(net)(1-\sigma(net))$
$$ \begin{aligned} \frac{\partial L_D}{\partial w_i} &= \frac{\partial}{\partial w_i}\frac{1}{2} \sum_{d\in D}(t_d - o_d)^2\ &= \frac{1}{2} \sum_{d\in D}\frac{\partial}{\partial w_i}(t_d - o_d)^2 \ &= \frac{1}{2} \sum_{d\in D} 2(t_d-o_d) \frac{\partial}{\partial w_i}(t_d - o_d) \ &= \sum_{d\in D} (t_d-o_d) \left( -\frac{\partial o_d}{\partial w_i} \right) \ &= -\sum_{d\in D} (t_d-o_d) \frac{\partial o_d}{\partial net_d} \frac{\partial net_d}{\partial w_i} \ \frac{\partial o_d}{\partial net_d} &= \frac{\partial\sigma(net_d)}{\partial net_d} = o_d(1-o_d)\ \frac{\partial net_d}{\partial w_i} &= \frac{\partial(\vec{w}\cdot\vec{x}d)}{\partial w_i} = x{id}\ \frac{\partial L_D}{\partial w_i} &= -\sum_{d \in D}(t_d-o_d)o_d(1-o_d)x_{id} \end{aligned} $$
User GD to learn
-
$K$ : set of output units in the network -
$t_{kd}$ : target output of sigmoid unit associated with$k$ -th output unit and training example$d$ -
$p_{kd}$ : output of sigmoid unit associated with$k$ -th output unit and training example$d$
Idea: init
- Initialise all network weights to small random numbers (eg. between
$-0.05$ and$0.05$ ) - Until satisfied, Do
- For each training example
$\langle\vec{x},(t_k)_{k\in K}^\top \rangle$ , Do- Input instance
$\vec{x}$ to the network and compute output of every sigmoid unit in the hidden and output layers - For each output unit
$k$ , compute error$\delta_k \leftarrow o_k(1-o_k)(t_k-o_k)$ - For each hidden unit
$h$ , compute error$\delta_h \leftarrow o_h(1-o_h)\sum_{k\in K}w_{hk}\delta_k$ - Update each weight
$w_{hk} \leftarrow w_{hk}+\Delta w_{hk}$ where$\Delta w_{hk} = \eta\delta_k o_h$ - Update each weight
$w_{ih} \leftarrow w_{ih}+\Delta w_{ih}$ where$\Delta w_{ih} = \eta\delta_h x_i$
- Input instance
- For each training example
-
$L_D$ has multiple local minima- GD is guaranteed to converge to some local minima, but not necessarily to the global minimum
- In practice, GD still often performs well, especially after using multiple random initialisations of
$\vec{w}$
- Often include weight momentum,
$\alpha \in [0,1)$ :$\Delta w_{hk} \leftarrow \eta\delta_k o_h + \alpha\Delta w_{hk}$ $\Delta w_{ih} \leftarrow \eta\delta_k x_i + \alpha\Delta w_{ih}$
- Easily generalised to feedforward networks of arbitrary depth
- Step 3: Let
$K$ denote all units in the next deeper layer whose inputs include output of$h$ - Step 5: Let
$x_i$ denote the output of unit$i$ in previous layer that is input to$h$
- Step 3: Let
- Expressive hypothesis space; requires limited depth feedforward networks
- Every Boolean function can be represented by a network with one hidden layer but may require exponential hidden units in no. of inputs
- Every bounded continuous function can be approximated with arbitrarily small error by a network with one hidden layer
- Any function can be approximated to arbitrary accuracy by a network with two hidden layers
- Approximate inductive bias
- Smooth interpolation between data points
Penalise large weights:
-
$\gamma\sum_{j,l}w^2_{jl}$ is the regularisation term; an additional term on top of the loss function -
$\gamma$ is the tradeoff parameter- To focus on lowering the weights
$w$ , raise$\gamma$ - So that it looks like a linear function/generalises well
- To focus on reducing the squared loss, lower
$\gamma$ - So that we can predict as accurately as possible wrt. training data (but might end up overfitting)
- To focus on lowering the weights
Train on target values as well as slopes:
-
$(t_{kd}-o_{kd})^2$ is the target value -
$\mu\sum_{i=1}^n\left( \frac{\partial t_{kd}}{\partial x_{id}} - \frac{\partial o_{kd}}{\partial x_{id}} \right)^2$ is the slope -
$\mu$ is the tradeoff parameter
Tie together weights: eg. phoneme recognition networks
- Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct
- Prior knowledge can be combined with observed data to determine the final probability of a hypothesis
- Accomodates hypotheses that make probabilistic predictions (eg. "this pneumonia patient has a 94% chance of complete recovery")
- New input instances can be classified by combining predictions of multiple hypotheses weighted by their probabilities/beliefs
-
$P(h)$ : prior belief of hypothesis$h$ - Initial probability that
$h$ holds before observing training data - May reflect any background knowledge we have about the chance that
$h$ is correct. If no such prior knowledge, then each hypothesis might simply get the same prior probability
- Initial probability that
-
$P(D|h)$ : likelihood of data$D$ given$h$ - Denotes the probability of observing
$D$ given some world in which hypothesis$h$ holds
- Denotes the probability of observing
-
$P(D) = \sum_{h\in D}P(D|h)P(h)$ : marginal likelihood/evidence of$D$ - Probability of
$D$ given no knowledge about which hypothesis holds
- Probability of
-
$P(h|D)$ : posterior belief of$h$ given$D$ - Reflects our confidence that
$h$ holds after we have seen the training data$D$
- Reflects our confidence that
Defintion: any maximally probable hypothesis
$$ \begin{aligned} h_{\text{MAP}} &= \text{argmax}{h\in H}P(h|D)\ &=\text{argmax}{h\in H} \frac{P(D|h)P(h)}{P(D)}\ &=\text{argmax}_{h\in H} P(D|h)P(h) \end{aligned} $$
$\text{argmax}_{h\in H}P(D|h)P(h)$ denotes the hypothesis$h$ such that$P(D|h)P(h)$ is maximised.
If every hypothesis in
Joint probability
If events
- For each hypothesis
$h \in H$ , compute posterior belief$P(h|D) = \frac{P(D|h)P(h)}{P(D)}$
- Output hypothesis
$h_{\text{MAP}}$ with highest posterior belief$h_{\text{MAP}} = \text{argmax}_{h\in H}P(h|D)$
A learning algorithm is a consistent learner if it outputs a hypothesis that commits zero errors over the training examples.
Every consistent learner outputs a MAP hypothesis if we assume
- A uniform prior probability distribution over
$H$ - Deterministic, noise free training data
Assumptions:
- Training data
$D$ is noise free - Target concept
$c$ is contained in the hypothesis space$H$ - No a priori reason to believe that any hypothesis is more probable than any other
Thus, it is reasonable for us to assign the same prior probability to every hypothesis
Are there other probability distributions for
$P(h)$ under which Find-S outputs MAP hypotheses?Yes
- Because FIND-S outputs a maximally specific hypothesis from the version space, its output hypothesis will be a MAP hypothesis relative to any prior probability distribution that favors more specific hypotheses
- More precisely, we can have a probability distribution
$P(h)$ that assigns$P(h_1)\geq P(h_2)$ if$h_1$ is more specific than$h_2$
The probability of data
Thus, for the posterior belief
Case 1:
Case 2:
Define
Thus,
Conclusion: every consistent hypothesis is a MAP hypothesis.
Consider any real-valued target function
$t_d = f(\vec{x}_d) + \epsilon_d$ -
$\epsilon_d$ is a random noise variable drawn independently for each$\vec{x}_d$ according to$\epsilon_d \sim N(0,\sigma^2$
Then the maximum likelihod hypothesis
$$ h_{ML}=\text{argmin}{h\in H}\frac{1}{2}\sum{d\in D}(t_d-h(\vec{x}_d))^2 $$
$\text{argmax}_h [-F(h)] = \text{argmin}_h [F(h)]$
Consider target function/concept
-
$X$ denotes patients in terms of their symptoms and$c(x)$ is of value 1 if patient$x$ survives, and 0 otherwise -
$X$ denotes loan applicants in terms of their past credit history and$c(x)$ is of value 1 if loan applicant$x$ repays loan, and 0 otherwise
We want to learn a neural network to output
$$ h_{ML}=\text{argmax}{h\in H}\sum{d\in D}t_d \ln h(x_d) + (1-t_d)\ln(1-h(x_d)) $$
Occam's Razor: prefer shortest hypothesis that fits the data:
$$ \begin{aligned} h_{MAP} &= \text{argmax}{h\in H}P(D|h)P(h) \ &= \text{argmax}{h\in H} \log_2P(D|h) + \log_2P(h)\ &= \text{argmin}_{h\in H} -\log_2P(D|h)-\log_2P(h) \end{aligned} $$
-
$-\log_2P(h)$ : description length of$h$ under optimal code for$H$ -
$-\log_2P(D|h)$ : description length of$D$ given$h$ under optimal code for describing data$D$
$$ h_{MDL}=\text{argmin}{h\in H}L{C_1}(h) + L_{C_2}(D|h) $$
-
$L_C(x)$ : description length of$x$ under encoding$C$
Example: given
-
$L_{C_1}(h)$ : number of bits to describe tree$h$ -
$L_{C_2}(D|h)$ : number of bits to describe$D$ given$h$ -
$L_{C_2}(D|h)=0$ if examples classified perfectly by$h$ . Otherwise, only misclassifications need to be described
-
- By minimising
$length(tree)$ and$length(misclassifications(tree))$ ,$h_{MDL}$ trades off tree size for training errors to mitigate overfitting
Given a new instance
Example: consider
Thus, in terms of prediction, the MAP hypothesis will say that it is $+$ve for this instance
Bayes-optimal classification: for all hypotheses, do marginalisation over all the hypotheses,
$$ \text{argmax}{t\in T}P(t|D)=\text{argmax}{t\in T}\sum_{h\in H}P(t|h)p(h|D) $$
Let
$$ \sum_{h\in H}P(+|h)P(h|D)=(0.4\times 1) + (0.3\times 0) + (0.3 \times 0) = 0.4\ \sum_{h\in H}P(-|h)P(h|D)=(0.4\times 0) + (0.3\times 1) + (0.3 \times 1) = 0.6\ \text{Thus, }\text{argmax}{t\in{ +,- }}\sum{h\in H}P(t|h)P(h|D)=0.6 $$
Bayes-optimal classifier provides best performance but is computationally costly if
$H$ is large
- Sample a hypothesis
$h$ from posterior belief$P(h|D)$ - Use
$h$ to classify new instance$x$
Supposing target concepts are sampled from some prior over
Limitations:
- Moderate or large amount of training data needed
- Input attributes are conditionally independent given classification
Consider target function/concept
$$ \begin{aligned} t_{MAP} &=\text{argmax}{t\in T}P(t|x_1,\dots,x_n) \ &=\text{argmax}{t\in T}\frac{P(x_1,\dots,x_n|t)P(t)}{P(x_1,\dots,x_n)} \ &=\text{argmax}_{t\in T}P(x_1,\dots,x_n|t)P(t) \end{aligned} $$
Thus,
$$ t_{NB}=\text{argmax}{t\in T}P(t)\prod{i=1}^nP(x_i|t) $$
Naive-Bayes-Learn(D):
- For each value of target output
$t$ :-
$\hat{P}(t) =$ estimate$P(t)$ using$D$ - For each value of attribute
$x_i$ :-
$\hat{P}(x_i|t) =$ estimate$P(x_i|t)$ using$D$
-
-
Classify-New-Instance(x):
- $t_{NB} = \text{argmax}{t\in T}\hat{P}(t)\prod{i=1}^n\hat{P}(x_i|t)$
Predict target concept
Thus,
- Conditional independence assumption is often violated, but still works surprisingly well in practice
- Estimated posteriors
$\hat{P}(t|x)$ need not be correct; only need that $\text{argmax}{t\in T}\hat{P}(t)\prod{i=1}^n\hat{P}(x_i|t) = \text{argmax}_{t\in T}P(t)P(x_1,\dots,x_n|t)$ - If none of the training instance with target ouput value
$t$ have attribute value$x_i$ , then$\hat{P}(x_i|t) = 0$ and$\hat{P}(t)\prod_{i=1}^n\hat{P}(x_i|t) = 0$
Solution for problem 3: use bayesian estimate
-
$|D_t|$ is number of training examples with target output value$t$ -
$|D_{tx_i}|$ is number of training examples with target output value$t$ and attribute values$x_i$ -
$p$ is prior estimate for$\hat{P}(x_i|t)$ -
$m$ is weight given to prior$p$
Used when:
- Data is only partially observable
- Unsupervised clustering (target output unobservable)
- Supervised learning (some input attributes unobservable)
- To find max likelihood parameters of a model involving hidden/latent variables that cannot be directly observed from the data
- eg. depression is a hidden variable (training data will most likely not have such a data)
Given:
- Instances from
$X$ generated by a mixture of$M$ Gaussians with the same known variance$\sigma^2$ - Unknown means
$\langle \mu_1,\dots,\mu_M \rangle$ of the$M$ Gaussians - Do not know which instance
$x_d$ is generated by which Gaussian
Determine ML estimates of
Consider full description of each instance as
-
$z_{dm}$ is unobservable and is of value 1 if$m$ -th Gaussian is selected to generate$x_d$ , and 0 otherwise (basically it is an indicator variable) -
$x_d$ is observable
Converges to local ML hypothesis