Citizen Data Scientist, Module II: Supervised learning: Predicting the Future with Labeled Data

Supervised learning is at the core of many machine learning applications, where the goal is to predict an output based on labeled data. In this post, we’ll not only explore foundational concepts like regression and classification, but also dive deeper into logistic regression, precision and recall, decision trees, and learning rate optimization.

 

Supervised learning: The Core Concept

Supervised learning revolves around using labeled data to train a model that can predict outputs for new, unseen inputs. The learning process involves two key steps:

  • Training: The model learns from labeled data, which pairs inputs (features) with correct outputs (labels).

  • Testing: The trained model is then tested on new data to evaluate its accuracy.

Supervised learning is widely applied in various fields:

  • Regression: Predicting continuous outcomes, like house prices.

  • Classification: Categorizing data, such as distinguishing spam emails from legitimate ones​


Regression: Predicting Continuous Outcomes

Regression is used to predict continuous values. One of the simplest examples is linear regression, which fits a straight line to the data points. This line can be expressed mathematically as:


Classification: Predicting Categories

In contrast to regression, classification deals with discrete categories. Logistic regression is one of the most widely used methods for binary classification problems—those with two possible outcomes, like spam or not spam.

Unlike linear regression, logistic regression predicts a probability between 0 and 1. We use the logistic (or sigmoid) function to model this probability:

For problems with more than two categories, we extend logistic regression to softmax regression. For example, when classifying types of flowers (famous Iris dataset with setosa, versicolor, virginica species measurements), softmax regression models the probability of each class. The softmax function is given by:


Precision, Recall, and F1-Score: Evaluating Classification Models

When it comes to classification, accuracy alone can be misleading, especially if the classes are imbalanced (e.g., rare disease detection). That’s where precision and recall come in.

Predictions

  • True Positive (TP): The model correctly predicts a positive class (e.g., predicting "spam" when the email is actually spam)

  • False Positive (FP): The model incorrectly predicts a positive class (e.g., predicting "spam" when the email is not spam)

  • False Negative (FN): The model incorrectly predicts a negative class (e.g., predicting "not spam" when the email is spam)


Using these terms, we can define:

For instance, in an email spam classifier, precision might tell us how often emails classified as "spam" are indeed spam, while recall tells us how many of the actual spam emails were correctly identified​


Decision Trees: Learning by Asking Questions

Decision Trees

Decision trees are another popular method for both regression and classification tasks. They work by recursively splitting the dataset based on the feature that maximizes information gain. Each decision node asks a yes/no question, leading to further branches until a final prediction is made.

Splitting Criteria: Gini Impurity and Cross-Entropy

To decide how to split the data, decision trees use criteria like Gini Impurity or Cross-Entropy:

The goal is to minimize these values at each split, making the decision tree more confident in its predictions​

Predicting Loan Repayment

Imagine we’re building a decision tree to predict whether a customer will repay a loan. We might start with a question like "Is the customer employed?" If yes, we proceed with further splits like "Does the customer have savings over $10,000?" Each branch of the tree represents a decision path that leads to a final prediction


Gradient Descent and Learning Rate: Fine-Tuning Model Optimization

Gradient descent is the optimization technique used to minimize the cost function in regression and classification models. The model’s parameters (e.g., slope and intercept in linear regression) are updated iteratively in the direction of the steepest descent (calculated by the gradient of the cost function).

Gradient Descent

Imagine the cost function as a landscape with hills and valleys. The goal is to find the lowest point (the minimum error). At each step, gradient descent calculates the slope and moves the parameters in the direction of the steepest descent until it reaches a minimum

A crucial component is the learning rate, which controls how big the update steps are. If the learning rate is too high, the model might overshoot the minimum and fail to converge. If it’s too low, the model will take too long to converge or get stuck in a suboptimal solution.

For example, in training a logistic regression model, a well-chosen learning rate ensures that the parameters are adjusted in such a way that the loss decreases steadily without overshooting​


Train-Test Split: Evaluating Model Performance

It’s essential to evaluate a model’s performance not only on the training data but also on unseen data. That’s why we split the dataset into training and testing sets. Typically, we use 70% of the data for training and 30% for testing. This ensures the model generalizes well to new data and doesn’t just memorize the training set


Conclusion: A Balanced Approach to Supervised Learning

Supervised learning, whether through regression, classification, or decision trees, plays a pivotal role in machine learning applications. By understanding how models are trained, optimized, and evaluated, you can build robust predictive models that generalize well to new data. As you move deeper into the world of machine learning, these foundational concepts will continue to serve as the building blocks for more complex algorithms and applications.

Previous
Previous

Citizen Data Scientist, Module III: Measuring Model Performance: Metrics That Matter

Next
Next

Fluid-Structure Interaction for Beginners: From Bridges to Blood Flow