Decision Tree

7 min readJan 14, 2021

Introduction of Decision tree

A decision tree is a classification algorithm that comes under the supervised learning technique. This algorithm is very easy to understand and explain. There are very few algorithms that fall into this category. A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions. Perhaps you are wondering why the machine learning community came up with the name decision tree. Well, if you look carefully at the figure of the algorithm, you will see it as a tree.

Decision trees mimic the human brain. The human brain splits up any task into many decision nodes before reaching the final decision. Now, let me give you one more example to better understand the functionality of the decision tree. You must have dialed a toll-free number of internet providers or credit card companies as it redirects you to an intelligent computerized assistant and it asks you a certain set of questions You are asked to select a question, selecting it will redirect you to another set of questions and continue until you find the right person. Here, the company is using the decision tree algorithm to efficiently manage the system.

Terminologies related to Decision tree

A decision tree is a graphical representation of all possible solutions to a decision based on a certain condition. The tree starts from the root node, and then the root node is divided into leaf nodes and decision nodes. A subsection of the entire tree is called a branch and sub-tree. A node composed of child nodes is called a parent node.

The root node is the base node, or we can say the first node of the tree. It contains the entire sample of the data set.

The leaf node cannot be split further, or we can say that it is the end of the node.

Splitting means dividing the root node or sub-nodes into further sub-nodes based on certain conditions.

Branch: A subsection of the entire tree is called a branch or sub-tree.

Pruning is the opposite of splitting we remove unnecessary branches or sub-nodes from the decision tree.

Parent/Child Node is a node that is divided into sub-nodes is called parent nodes whereas the sub-nodes are called child nodes.

Depth of the tree is the longest path from the root node to the leaf node of the tree.

Concept of Homogeneity

When the node contains only a single type of data, which means that the split is completely pure, it is called a homogeneous node or homogeneous split. Conversely, if a node contains one or more data categories, it is not homogeneous.

The functionality of the Decision tree

Now suppose we have this data set and we want to classify whether a tree is a lemon tree, mango tree, or a grape tree. the root node receives the entire dataset while the decision nodes receive the list of rows as input. Now, the accuracy of the decision tree model depends on the splitting process. Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable. The decision tree splits the nodes on all available variables and then selects the split which results in the most homogeneous sub-nodes. The selection of the algorithm is mainly based on the target variable.

ID3

The ID3 algorithm uses a top-down greedy search method to build a decision tree in the possible branch space without backtracking. A greedy algorithm, as the name suggests, always makes the choice that seems to be the best at that moment.

2. CHAID(Chi-square automatic interaction detection)

The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree classification methods. It finds the statistical significance of the difference between the child node and the parent node. We measure it by the sum of squares of standardized differences between observed and expected frequencies of the target variable.

It works with the categorical target variable. It can perform two or more splits. The higher the chi-square value, the higher the statistical significance of the difference between the child node and the parent node.

3. CART (Classification and regression tree)

The CART or Classification & Regression Trees method was introduced in 1984 by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone as an umbrella term to refer to the following types of decision trees:

1.Classification Trees: where the target variable is categorical and the tree is used to identify the “class” in which a target variable would likely fall.

2. Regression tree: The target variable is continuous, and the tree is used to predict its value.

How to select the attributes?

Gini Index

Gini index or Gini impurity measures the probability of a particular variable being wrongly classified when it is randomly chosen. It works like a cost function that is used to evaluate the splits in the dataset. Before we move ahead first, let’s understand what is impurity. Suppose there is a basket full of apples and another basket that has labels of apples. Now, if I have to choose an item from each basket then it is certain that it will be an apple and a label of apple so impurity will be zero. In another scenario, if the first basket has four different fruits, and the second basket has four different labels and when it comes to choosing an item from each basket then it is not sure that label and item will match So the impurity is not going to zero. The maximum value of the Gini impurity is 0.5. We can only use Gini impurity for categorical features.

Entropy

Entropy measures the purity/impurities of the split. It shows the degree of randomness in the data or how random is your data. The value of entropy always ranging from zero to one. If the entropy of the node is one, it means that it is a complete impure split Whereas if the entropy value of the node is zero, it is considered a pure split. A homogenous dataset will have zero entropy, while a perfectly random dataset will yield a maximum entropy of 1. The higher the entropy, the harder it is to draw any conclusion from that information.

In many state-of-the-art algorithms, the Gini index is the first choice rather than entropy because the Gini index is computationally efficient. In entropy, we have to perform logarithmic calculations.

Where P positive is the percentage of positive class, p negative number is the percentage of the negative class.

Information Gain

Information gain can be calculated in some way by transforming the data set to calculate the reduction in entropy. It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values. If the entropy decreases due to a split in the dataset, it will yield an information gain. Constructing a decision tree model is all about finding an attribute that returns the maximum information gain and lowest entropy. The attribute with the highest information gain is considered the best. Or, if I use a different way to describe the information gain to decide, which attribute should be selected as the decision node.

Information Gain

Overfitting

Truncation

Stop the tree while it is still growing, so that it does not end up with leaves with very low data points. One way to do this is to set a minimum number of training inputs to use on each leaf.

Pruning

Pruning is another way to avoid the problem of decision tree overfitting. In pruning, the decision tree grows to the maximum on the training data set. After that, we evaluate the performance of the tree on the validation data set and evaluate whether a particular split is useful for the performance of the tree. If it is not useful, we will remove the split node from the tree. This is pruning.

Advantages and disadvantages of the decision tree

Advantages

A decision tree is very simple to understand and interpret.
Require very little data preparation.
The time complexity of the decision tree is logarithmic.
It is able to handle both numerical and categorical data.
It is possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.

Disadvantages

A decision tree can create an over-complex model that does not generalize well on training data. This is called overfitting.
Decision tree learners can create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.
Decision trees can be unstable because small variations in the data might result in a completely different tree

Inspired by Kdnugget’s Decision tree algorithm

Decision Tree Algorithm, Explained (kdnuggets.com)