Random Forest Algorithm Clearly Explained!

Hello, people from the future welcome to Normalized
Nerd! Today we’ll set up our camp in the Random
Forest. First, we’ll see why the random forest is
better than our good old decision trees, and then I’ll explain how it works with visualizations. If you wanna see more videos like this, please
subscribe to my channel and hit the bell icon because I make videos about machine learning
and data science regularly.

So without further ado let’s get started. To begin our journey, we need a dataset. Here I’m taking a small dataset with only
6 instances and 5 features. As you can see the target variable y takes
2 values 0 and 1 hence it’s a binary classification problem. First of all, we need to understand why do
we even need the random forest when we already have decision trees. Let’s draw the decision tree for this dataset. Now if you don’t know what a decision tree
really is or how it is trained then I’d highly recommend you to watch my previous
video. In short, a decision tree splits the dataset
recursively using the decision nodes unless we are left with pure leaf nodes. And it finds the best split by maximizing
the entropy gain.

If a data sample satisfies the condition at
a decision node then it moves to the left child else it moves to the right and finally
reaches a leaf node where a class label is assigned to it. So, what’s the problem with decision trees? Let’s change our training data slightly. Focus on the row with id 1. We are changing the x0 and x1 features. Now if we train our tree on this modified
dataset we’ll get a completely different tree. This shows us that decision trees are highly
sensitive to the training data which could result in high variance. So our model might fail to generalize. Here comes the random forest algorithm. It is a collection of multiple random decision
trees and it’s much less sensitive to the training data. You can guess that we use multiple trees hence
the name forest. But why it’s called random? Keep this question in the back of your mind
you’ll get the answer by the end of this video.

Let me show you the process of creating a
random forest. The first step is to build new datasets from
our original data. To maintain simplicity we’ll build only
4. We are gonna randomly select rows from the
original data to build our new datasets. And every dataset will contain the same number
of rows as the original one. Here’s the first dataset. Due to lack of space, I’m writing only the
row ids. Notice that, row 2 and 5 came more than once
that’s because we are performing random sampling with replacement. That means after selecting a row we are putting
it back into the data. And here are the rest of the datasets. The process we just followed to create new
data is called Bootstrapping. Now we’ll train a decision tree on each
of the bootstrapped datasets independently. But here’s a twist we won’t use every
feature for training the trees. We’ll randomly select a subset of features
for each tree and use only them for training.

For example, in the first case, we’ll only
use the features x0, x1. Similarly, here are the subsets used for the
remaining trees. Now that we have got the data and the feature
subsets let’s build the trees. Just see how different the trees look from
each other. And this my friend is the random forest containing
4 trees. But how to make a prediction using this forest? Let’s take a new data point. We’ll pass this data point through each
tree one by one and note down the predictions.

Now we have to combine all the predictions. As it’s a classification problem we’ll
take the majority voting. Clearly, 1 is the winner hence the prediction
from our random forest is 1. This process of combining results from multiple
models is called aggregation. So in the random forest, we first perform
bootstrapping then aggregation and in the jargon, it’s called bagging. Okay so that was how we build a random forest
now I should discuss some of the very important points related to this. Why it’s called random forest? Because we have used two random processes,
bootstrapping and random feature selection. But what is the reason behind bootstrapping
and feature selection? Well, bootstrapping ensures that we are not
using the same data for every tree so in a way it helps our model to be less sensitive
to the original training data. The random feature selection helps to reduce
the correlation between the trees. If you use every feature then most of your
trees will have the same decision nodes and will act very similarly. That’ll increase the variance. There’s another benefit of random feature
selection.

Some of the trees will be trained on less
important features so they will give bad predictions but there will also be some trees that give
bad predictions in the opposite direction so they will balance out. Next point, what is the ideal size of the
feature subset? Well, in our case we took 2 features which
is close to the square root of the total number of features i.e. 5. Researchers found that values close to the
log and sqrt of the total number of features work well. How to use this for regression? While combining the predictions just take
the average and you are all set to use it for regression problems. So that was all about it.

I hope now you have a pretty good understanding
of the random forest. If you enjoyed this video, please share this
and subscribe to my channel. Stay safe and thanks for watching!.