100 % Accuracy: Supremacy or Imperfection (Overfitting Vs Underfitting)
When you are training a machine learning model, your main expectation from that model is that: it should generalize well. The meaning of generalization is that once you train a model on the given set of data points then that model should learn the pattern from those training points and should be able to generalize that pattern for some unseen data points. So once the model is trained, we need to assess the model on some basis to know that how well the model is performing. And one of the very common( I would not say very popular) approaches is to check the accuracy of the model.
In the case of machine learning, the meaning of accuracy is the number of correct predictions out of the total predictions. By the definition you can see that there is no effect of the false predictions on the measure of accuracy, that is the reason we should not prefer accuracy more often. Two more versatile and concrete measures of the fitness of the model are precision and recall. For this article, we are not going into this discussion that which fitness function is better, so let’s rest this discussion here only.
Once a machine learning model is trained and the training accuracy is calculated, so there might be a huge chance that the accuracy would result in a high range probably in the nineties or even 100%. So, what does that mean? Does it mean that our model is 100% accurate and no one could do better than us? The answer is “NO”. A high accuracy measured on the training set is the result of Overfitting. So, what does this overfitting means?
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or more than the required data points present in the given dataset. Because of this, the model starts caching noise and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the model. The overfitted model has low bias and high variance.
It means that the more you train your data on the same points, it will start to treat the noise also as the data and will just imitate the entire pattern. The overfitting model is just like a student cramming the entire syllabus for the exam and will perform well if presented in a way he remembered but the same student will fail miserably when you ask the same question reframed as another problem.
Low Bias:- No careful decision on what to study and what to not.
High Variance:- Trying to learn everything, even remember the page number of the book.
Underfitting
Underfitting occurs when our machine learning model is not able to capture the underlying trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped at an early stage, due to which the model may not learn enough from the training data. As a result, it may fail to find the best fit of the dominant trend in the data. The under-fitted model has high bias and low variance.
It means that even if you are having enough data points to capture the pattern of the data but you restrict your model with only a few data points. Taking back our example of a student, then this will be the case of that over a smart student who tries to find the important topics out of the entire syllabus and just finishes them. When it comes to testing the knowledge, this student can’t perform well because he/she won't have complete knowledge.
The problem of overfitting is more serious than the problem of underfitting because in overfitting one might not get an idea that whether the high accuracy is the result of overfitting, or the model is actually performing well. While in the case of underfitting one will directly get a hint that something is wrong with the model.
How to avoid the problem of overfitting
Both overfitting and underfitting cause the degraded performance of the machine learning model. But the main cause is overfitting, so there are some ways by which we can reduce the occurrence of overfitting in our model.
- Cross-Validation
- Training with more data
- Removing features
- Early stopping the training
Credit for the image: https://www.educative.io/api/edpresso/shot/6668977167138816/image/5033807687188480
Petroleum Engineers and Machine Learning Addict
1ythanks prof, I'd like to ask you regarding range of overfitting, I have to predict something using random forest regression model, and resulted R^2 Training 0.971 and testing 0.8228, is it 0.15 difference considered as overfitting ? or there is specified value for example difference between training and testing is about 0,2 and otherwise is considered overfitting ?
Account executive - Specialist Markets- Gallagher Benefits Services UKl IABAC Certified Data scientist
1yThe training accuracy is 100% and testing accuracy also 100%. Checked the cross validation score and its showing 1. F1 score is also 1. Does this means model is perfect or overfitting ? Is it possible to get everything 100%
Artificial intelligence Researcher
2yThanks prof, but any advice if training and testing accuracy reach 100%! in a deep learning model.. where should i start to fix and what Best regards
Data Science, Big Data Analytics, NLP, ML/DL, Business Analytics, Data Visualisation, Power BI, Tableau, Advance Excel, Python, R
2yNice explanation, Sir