Wine Quality Prediction

Project information

Wine Quality Prediction

In this project, conducted as part of the MBAN - Machine Learning & AI course, we aim to analyze datasets related to wine quality and classification of red and white variants of Portuguese "Vinho Verde" wine. Vinho Verde is a renowned product from the Minho region of Portugal, appreciated for its medium alcohol content and freshness, especially during the summer months. The datasets used in this project are publicly available for research purposes and were provided by P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis.

Problem Statement:

The main objective of this project is to predict wine quality based on physicochemical properties using Random Forests. We will approach this problem in three parts:

  • Data Preprocessing: We will begin by downloading the datasets for both white and red wines from the UCI repository and performing necessary preprocessing steps to clean and prepare the data for analysis.
  • Predicting Wine Quality (Regression Problem): Using the dataset containing only white wines, we will treat wine quality as a continuous numerical variable and employ Random Forests to identify the most important predictors and make predictions. We will compare the performance of the Random Forest model with a parametric model.
  • Predicting Wine Quality (Classification Problem): In this phase, we will regroup the outcome variable "quality" into "less-than-average" and "better-than-average" categories, creating a binary variable. We will again use Random Forests to identify important predictors and make predictions. We will report metrics such as AUC, uncertainty, and confusion table to evaluate the model's performance.
  • Predicting Wine Type (Red or White): Finally, we will append both white and red wine datasets and create a unified dataset with an identifier for each wine type. We will then use parametric and non-parametric classifiers to predict the wine type and select the model with the best predictive power based on AUC and uncertainty metrics. Important predictors will be identified, and the final results using the test data will be reported.

Approach:

  • Data Collection and Preprocessing: We will download the datasets for white and red wines from the UCI repository and perform data cleaning and preprocessing steps, including handling missing values, scaling features, and encoding categorical variables.
  • Predicting Wine Quality (Regression): For the regression problem, we will treat wine quality as a continuous numerical variable and use Random Forests to identify important predictors and make predictions. We will compare the performance of the Random Forest model with a parametric model such as linear regression.
  • Predicting Wine Quality (Classification): In this phase, we will categorize wine quality into "less-than-average" and "better-than-average" categories and use Random Forests to predict wine quality. We will evaluate the model's performance using AUC, uncertainty, and confusion table metrics.
  • Predicting Wine Type (Red or White): Finally, we will combine both white and red wine datasets, create a unified dataset with identifiers for each wine type, and use various classifiers to predict the wine type. We will select the model with the best predictive power based on AUC and uncertainty metrics and report important predictors.
By following this approach, we aim to provide actionable insights into predicting wine quality and type using physicochemical properties, which can be valuable for wine companies and enthusiasts alike.

Conclusion:

Based on the analysis conducted:

  • Prediction of Wine Quality:

    Random Forest models outperformed Linear Regression in predicting wine quality, as evidenced by lower Root Mean Squared Prediction Error (RMSPE).

    The Random Forest model achieved a mean RMSPE of approximately 0.656, indicating its ability to predict wine quality with high accuracy.

    Alcohol content emerged as the most influential predictor of wine quality, followed by chlorides and citric acid.

  • Classification of Wine Quality:

    Random Forest models demonstrated higher predictive accuracy, measured by Area Under the Curve (AUC), compared to Linear Regression in classifying wines as "More-than Average" or "Less-than Average" based on their quality ratings.

    The Random Forest model achieved a mean AUC of approximately 0.862, indicating its effectiveness in distinguishing between wine quality categories.

  • Prediction of Wine Type:

    Both Random Forest and Linear Regression models achieved high accuracy in classifying wines as white or red based on combined dataset analysis.

    Random Forest models, however, exhibited slightly higher predictive performance compared to Linear Regression in this classification task.

  • Overall Accuracy and Conclusion:

    Random Forest models demonstrated superior predictive accuracy across all tasks, including wine quality prediction and wine type classification.

    The consistent performance of Random Forest models highlights their effectiveness in accurately predicting wine characteristics and types.

    These findings underscore the importance of utilizing Random Forest models, particularly when predicting wine quality and classifying wine types, to ensure high accuracy and reliable predictions.

In conclusion, the analysis confirms the robustness and accuracy of Random Forest models in predicting wine quality and classifying wine types, providing valuable insights for wine industry applications.