Homepage | Course content | Spanish |
No matter how big a dataset is, at the end of the day it has a limited number of datapoints.
Data is very valuable, and to develop robust machine learning methods, we must use it carefully. WARNING: the improper use of data is guaranteed to be a waste of your time.
Cross-validation is a simple and robust method to carefully use data to systematically analize how well a model can perform.
Cross-validation further splits randomly the development data into a C
number of folds (usually five) of equal size.
Then, we will train our model (from scratch) using C-1
of the development folds, and we will validate (i.e. assess model performance) using the remaining fold.
We will repeat this procedure until we have used each of the C
folds as the validation fold, and we will calculate the average cross-validation accuracy across folds.
Note: some datasets have already been split into testing and development sets, and the development set can also be already split into fixed training and validation sub-sets. In this situation it is recommended to follow the suggested training and validation data splitting intead of creating training and validation folds.
Given a datapoint (i.e. with D features) and a dependent observation , can we use the equation of a line to describe the relationship between and ?
To assess this model, we must assume there will be an error that we will have to “tolerate”, and find the variables and using the objective function , where is the number of datapoints we are using to calculate the objective, and .
The objective function is effectively half the mean squared error between our model prediction and the ground-truth value .
due Mar 1st at 11:59PM (Eastern Standard Time)
© Iran R. Roman 2022