An ID3 implementation: Math Functions (Part 2/3)
In the previous post I delimited the project goals and what was done and why. I also explained the first of two important classes: Node Class. The second one will be explained in the third post of this serie.
In this post I will explain a group of functions that is needed to make the Tree Class (yet to come) work.
Codes
The file groups four functions, so I will explain each separately.
Best Classifier Attribute
The ID3 algorithm will be properly explained in the next post, but it is important to mention the optimization/search step tries to get the tree that classifies the instances the best. This is done by choosing what attribute, at each question (remember the last post introdution?), best classifies the instance.
Then, to choose what attribute best classifies the examples is an important step to be made. This attribute will be chosen “to split on” (it means there are as much branches below the node which tests this attribute as this attribute values).
A good quantitative measure of the worth of an attribute is the statistical property of entropy (explained in a topic below). The function detailed here uses the mentioned value to select the attribute that is most useful for classifying examples.
Let’s take a look ate the project paper to know how to select the best attribute:
If we choose to split on an attribute ‘A’ taking m values, then the conditional entropy for the same is given by,
\(\hat{H}(D \vert A) := \sum\limits_{i = 1}^{m} \hat{p}(A = a_i \vert D)H(D[A = a_i])\), where
\(D[A = a_i] := \{(x,l) \in D \vert x[A] = a_i\}\), \(\hat{p}(A=a_i \vert D):= \frac{\vert D[A=a_i] \vert}{\vert D \vert}\).
Recall that in the original ID3 algorithm, the attribute to split on is chosen to be one which (…) minimizes the conditional entropy.
Let’s try to understand the expression for \(\hat{H}(D \vert A)\):
- Notice \(D[A = a_i]\) is literally the chunk of dataset that have attribute A with value \(a_i\).
- \(\hat{p}(A = a_i \vert D)\) is the probability of occurence that attribute A has value \(a_i\). It is calculated as the ratio of how many examples has attribute A with value \(a_i\) and how many examples exists at all.
The python function that indicates what is the best attribute to split on can be seen below. It has three attributes: features
and labels
(the dataset itself); and copy_of_attr_2_test
. A brief note about this last one: when the ID3 algorithm is running (creating/searching the best tree), it needs to keep track of what attributes where already tested. The last parameter is a boolean vector: each position represents an instance attribute (if the attribute was already tested, then False
, if not True
).
The function returns the position of the attribute that best classifies the data (best attribute to split on).
def best_classifier_attribute(features, labels, copy_of_attr_2_test):
if( np.shape(features)[1] != len(copy_of_attr_2_test) ):
raise Exception("Features' Attributes do not match Attributes vector given")
entropy_dataset_given_attributes = []
for i in range(np.shape(features)[1]):
if(copy_of_attr_2_test[i]):
all_possible_attribute_values = list(set(features[:,i]))
probability_A_equals_ai = []
entropy_A_equals_ai = []
for j in all_possible_attribute_values:
probability_A_equals_ai.append(features[:,i].tolist().count(j)/np.shape(features[:,i])[0])
dataset_A_equals_ai = []
for k in range(np.shape(features)[0]):
if(features[k,i] == j):
dataset_A_equals_ai.append(labels[k].tolist())
entropy_A_equals_ai.append( mtxslv_entropy(np.array(dataset_A_equals_ai)) )
entropy_dataset_given_attributes.append( np.dot(entropy_A_equals_ai,probability_A_equals_ai) )
else:
entropy_dataset_given_attributes.append( float('inf') )
return np.argmin(entropy_dataset_given_attributes)
The code starts testing if the shape of the given boolean vector matches the instances’ attributes quantity. It is done just to be sure the parameters are OK.
A void list named entropy_dataset_given_attributes
is allocated. Each position of it will receive the conditional entropy of the dataset given each attribute.
The program iterates over the attributes, or column by column (imagine the dataset as a table, the examples lies on the lines, and the last column is the labels). Two situations may appear: the attribute does not need to be tested (copy_of_attr_2_test
is False
), then entropy_dataset_given_attributes
at that column/attribute position store Inf
(an inifinity value, because I want the smallest one, so this will not bother me). If the attribute is testable, a process starts.
If the attribute is testable, let’s allocate three lists:
all_possible_attribute_values
: which collects all values \(a_i\) that attribute A can have.probability_A_equals_ai
: the probability of occurence that attribute A equals \(a_i\). Notice is the list of \(\hat{p}\) values.entropy_A_equals_ai
: the entropy of a chunk of the dataset (all the examples where A equals \(a_i\)). Notice it is the list of values related to \(\hat{H}(D[A = a_i])\).
Let’s start a nested loop: for each possible \(a_i\) value. Inside this loop the code calculates the probability of occurence of \(a_i\) by the ratio of the number of elements that equals \(a_i\) and all the number of examples.
In the same inner loop (for each possible \(a_i\) value) let’s allocate a list dataset_A_equals_ai
and fill it with all the examples for which the current attribute equals the value \(a_i\).
Back to the for each attribute values loop, we calculate the entropy of this dataset (and adds the value at the entropy_A_equals_ai
list).
The last step of the testable attribute situation is to calculate \(\hat{H}(D[A = a_i])\). It is basically the sum of the element-wise product between entropy_A_equals_ai
and probability_A_equals_ai
. If you think for a while, you will notice is the dot product between the lists.
The last command of the function is to return the index of the smallest value in entropy_dataset_given_attributes
. Remember: each position of this list is related to an instance attribute position. Then, the position n is returned because the nth attribute is the best to split on.
Entropy of a Dataset
The Entropy of a Dataset is a very important concept because it indicates the attribute that best classifies (or separates into “purer” chunks) the dataset.
A good definition of the Entropy of a Dataset can be seen here. The transcription of the paper explanation can be seen below.
\[\hat{H}(D) := \sum\limits_{i = 1}^{k} \hat{p_i}\log_2(\hat{p_i})\]Suppose the (current) dataset D \(= \{(x_j,l_j) \}\) contains |D| examples belonging to k different classes. Let the count for class ‘i’ in a set S be given by \(\nu_i := |\{(x,l) \in S | l = c_i\}|\). The estimated probability-of-occurence for each class is then given by, \(\hat{p_i} = \frac{\nu_i(D)}{|D|}\), and the (estimated) entropy of D is given by,
How to read this mathy thing? Let’s understand it step by step.
The dataset D is the collection of instances we have. It means all the vectors in the form \([features, labels]\). In the expression above, \(x_j\) means the features the instance have, and \(l_j\) the respective label.
Let’s suppose there are i different classes. For example, if I would classify a meteorological pattern as will rain or will not rain, i would equal 2. Now I may ask: how many instances have class i? Let’s call this value \(\nu_i\). Well, this value is defined simply by how many pairs (x,l) have l equal to i.
With the \(\nu_i\) value in hands, we can compute the probability of occurence for each class. It is just the amount of times class i appears in the dataset: \(\nu_i(D)/ \vert D \vert\). This value is called \(\hat{p_i}\).
That said, let’s dive in the code.
Notice that in order to compute the entropy \(\hat{H}\) there is no need to know the instances’ attributes. Knowing just the labels is enough to calculate it, because the mathematical formula deals only with quantities, not the values itself. So the parameters of the function are the label column (imagine the dataset as a table, the last column is the labels). I don’t know if it is interesting to change the logarithm base, then I will let it changeable (as a parameter bas
), though as an optional parameter with default value 2.
def mtxslv_entropy(labels,bas = 2):
existent_classes = list(set(labels[:,0]))
probability = []
# calculate the statistics of each class (probability of each one)
for i in existent_classes:
probability.append(labels.tolist().count(i)/np.shape(labels[:,0])[0])
return entropy(probability, base= bas)
The first step is to find out how many classes there exist. Let’s call it existent_classes
.
I also declare a void list probability
, which will collect the probability of each class (\(\hat{p}\)).
Then the code iterate for each class value i
. It appends in probability
how many values equals i
divided by how many labels there exists.
The function return the scipy function entropy, which receives probability
and bas
.
How to Find the Most Common Class
One step of the Decision Tree Learning Algorithm (explained in the third post of this serie) requires that we know what Class (label, in this context) appears the most in the dataset. It is important to know that because when trying to predict a label, the choice that is less likely to fail is that one that bet the most probable label.
The function most_common_class
needs only one parameter: labels
(column-like array).
def most_common_class(labels):
existent_classes = list(set(labels[:,0]))
probability = []
# calculate the statistics of each class (probability of each one)
for i in existent_classes:
probability.append(labels.tolist().count(i)/np.shape(labels[:,0])[0])
#turn probability list to numpy array so we can use np.argmax
np_prob = np.array(probability)
where_is_max_prob = np.argmax(np_prob)
return existent_classes[where_is_max_prob]
The initial behaviour is similar to the mtxslv_entropy
function: we find the existent labels (existent_classes
variable) and calculate the probability of each of them appear in the dataset (probability
variable).
We turn the probability
list into a numpy array (np_prob
) so we can find the index of the largest value (where_is_max_prob
), using np.argmax()
.
The function returns the class with high probability.
Notice each position of probability
list (and therefore np_prob
array) refers to each of the existent classes. Then, the position in the list where the probability is high is related to the most probable class.
Getting a Subset out of a Dataset
The ID3 algorithm (which will be explained in the following post), need to compute (or get) examples that have specific value for some attributes. This process is done in a separate function so the Class file do not become too unnecessarily long.
Then, the function mtxslv_get_subset()
has four parameters
features
: the instances attributes;labels
: the instances labels;attribute
: what attribute must be tested;attribute_value
: what value the attribute must have.
It returns two numpy arrays, one related to the subset features, and the other one related to the subset labels.
I allocate two void lists: subset_features
and subset_labels
. After a brief Exception treatment, the process begins properly.
The program iterates over the instances, comparing the attribute indicated by the parameter attribute
with the value attribute_value
. If they match, the attributes and label are appended in the respective lists.
The function return the numpy array of these lists.
Below the function can be seen.
def mtxslv_get_subset(features,labels,attribute,attribute_value):
subset_features = []
subset_labels = []
if(not(attribute_value in set(features[:,attribute]))):
raise Exception("Attribute Value does not exist in Assigned Attribute")
else:
for it in range(np.shape(features)[0]):
if(features[it,attribute] == attribute_value):
subset_features.append(features[it,:].tolist())
subset_labels.append(labels[it,:].tolist())
return np.array(subset_features),np.array(subset_labels)
Material Used
The concept of Entropy was very confused for me the first time I read about it. Then I was advised to read this Medium Article. It helped me a lot.