An ID3 implementation: Node Class (Part 1/3)

4 minute read

It is natural and intuitive to classify a pattern through a sequence of questions. For example, we always asks if it is raining outside, or even cloudy, since we may need an umbrella as our shield.

There would not be an algorithmic way to replicate this behaviour? Yes, there is! A Decision Tree is a very good candidate to represent this “step-by-step” way of thinking and reasoning.

One decision tree learning algorithm is called ID3 (Iterative Dichotomiser 3), and in this serie of posts I will explain how I implemented it and generalized it for a multiclass setting.

I did it for this Mini-Project. The implementation for the project is incomplete, since the pre-prunning is not available, nor some “accessory functions” (for counting nodes and leaves; or for estimating accuracies; or for keeping track of the internal paths), which are needed to answer the proposed questions.

I used Python in Google Collaboratory, and you can find all the project in my GitHub.

For this post, I will focus on the Node Class, since I started the project with it.

Now let’s start the fun part!

Initial Considerations

The first step that I took was to define the basic Attributes and Methods I would like my Node Class to have.

A node basically needs to store an information and point to its own child nodes (if it is not a leaf node).

In the context of Decision Tree Learning, the node needs to store up to the following four things:

  1. What attribute must be tested. Remember: the instances are ordered vectors, and each position is an unambiguous feature. If the instance is array-like, the attribute to be tested can be stored like the feature position. Let’s call this node class attribute attribute_to_test.

  2. The attribute values. The feature specified in attribute_to_test will be used to classify the instance (to try to predict the class). The Decision Tree Learning Algorithm assumes that each attribute value must represent a branch. Then, I must annotate them. I will use the list-like node class attribute attribute_value.

  3. If the current node is a leaf (is_leaf node class attribute). Up to this moment, I only discussed the node as if it is a non-leaf one (a decision node), but it can be a leaf (node with no branch at all) indeed. So a boolean class attribute will let me aware what kind of node I am dealing with

  4. If the node is a leaf, it must present what label the tree should impute to the instance. Let’s create the node class attribute label to do the job.

The last node class attribute is the list-like child_pointers. If the node needs to know its children, let’s point to them all.

The few methods defined are only three:

  1. __init__: a class constructor, but void.
  2. add_branch: create a branch down the current node.
  3. turn_node_to_leaf: turn the current node in a leaf.

Code

Start coding the class with its constructor. Notice it does not create any value for any attribute at all.

class MtxslvNode:

  def __init__(self):
     self.attribute_to_test = None
     self.attribute_value = []
     self.child_node = []
     self.is_leaf = False
     self.label = None

The next thing needed is the add_branch method. It has three parameters: what attribute must be tested (attr_2_test), what value it must have(attr_value) and a child node to append.

Notice the inner working of it. The current node that runs this method must define what instance feature must be tested. “Testing the feature” is to see if the instance feature matches a value (the attr_value) so if it does, the next valid node to look for is child_no. This way of appending nodes (that is, creating branches), make the class attributes attribute_value and child_node to have a pair like behaviour: if the instance feature has attribute_value(k), then child_node(k) is the branch you should go to.

The other important thing the method does is to assure the class consistency. The node cannot be a test node and a leaf node at the same time. Then, if the method is called, is_leaf goes to False and label is None.

  def add_branch(self, attr_2_test, attr_value, child_no):
    if(attr_value == None):
      raise Exception('Attribute value cannot be None')
    elif(child_no == None):
      raise Exception('Child node cannot be None')
    else:  
      self.attribute_to_test = attr_2_test      
      self.attribute_value.append(attr_value)
      self.child_node.append(child_no)
      self.is_leaf = False
      self.label = None

The last method deals with leaf nodes. Actually, it turns a node into a leaf. The class function has three attributes: attr_2_test, attr_value and class_label. Turning a node into a leaf makes is_leaf go to True and label to class_label.

During the writing of this post I realized attr_value parameter ended having no purpose, since the searching algorithm defined in the Tree Class does not use it anymore. It is a deprecated idea.

  def turn_node_to_leaf(self, attr_2_test, attr_value, class_label):
    if(class_label == None):
      raise Exception('Class label cannot be None')
    elif(attr_value == None):
      raise Exception('Attribute value for testing cannot be None')
    elif(attr_2_test == None):
      raise Exception('No attribute to test')  
    else:
      self.is_leaf = True
      self.label = class_label
      self.attribute_value = [attr_value]
      self.attribute_to_test = attr_2_test

Notice the first part of the last two methods tests if any parameter was past None type.

Material Used and Special Thanks

This project was the first time I used Python Classes. A lot of doubts appeared, and I needed some references. I would like to write them down.

I would like also to thank two great friends of mine that helped me a lot with the language details and OOP/Tree Data Structure concepts: Cogitus and RodolfoStark.

Vocabulary Clues

  • attributes: each element in an instance.
  • instances: an example; a collection of numerical measures (phisical measures of any kind).
  • multiclass setting: a problem where we need to classify a pattern in more than two classes.
  • pattern: in the context of this post, it can be a collection of numerical measures (that is a synonym of instance)