Overview

I recently had to create a Decision Tree-based classifier model for ranking a set of image labels (or classes), given a feature (or latent) space and a query image for CSE515 at ASU. The caveat was that we weren’t permitted to use any libraries to bulld the decision tree classifier (which would’ve been trivial using something like scikit-learn)

I came across this blog post that heavily influenced my solution (most of the algorithm is the exact same as Narsimha has demonstrated in his blog post, so I really don’t take any credit here), to create the following:

Code

class DecisionNode:
    def __init__(self, feature=None, threshold=None, left=None, right=None, class_counts=None, value=None):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.class_counts = class_counts        # For ranking
        self.value = value
 
def decision_tree_train(X, y, max_depth=10, min_samples_split=5):
    """
    Train decision tree with a set max depth to prevent overfitting
 
    Args:
        X: Feature matrix of shape (n_samples, n_features)
        y: Labels (n_samples, ) -> encoded as 0=glioma, 1=menin, 2=tumor
        max_depth: Maximum tree depth
        min_samples_split: Minimum samples required to split node
    """
 
    def build_tree(X, y, depth):
        num_samples, num_features = X.shape
 
        # Calculate class counts for ranking
        unique_labels, counts = np.unique(y, return_counts=True)
        class_counts = {int(label): int(count)
                            for label, count in zip(unique_labels, counts)}
 
        # Base cases
        if (num_samples < min_samples_split or
            len(set(y)) == 1 or
            depth > max_depth):
            # Return leaf node with class counts for ranking
            majority_class = unique_labels[np.argmax(counts)]
            return DecisionNode(value=majority_class, class_counts=class_counts)
 
        # Find best split
        best_feature = None
        best_threshold = None
        best_gini = float('inf')
 
        for feature_index in range(num_features):
            # Sample thresholds to avoid excessive computation
            feature_values = X[:, feature_index]
            thresholds = np.percentile(feature_values, [25, 50, 75])
 
            for threshold in thresholds:
                left_mask = feature_values <= threshold
                right_mask = ~left_mask
 
                # Skip if all go one way
                if left_mask.sum() == 0 or right_mask.sum() == 0:
                    continue
 
                left_gini = gini_impurity(y[left_mask])
                right_gini = gini_impurity(y[right_mask])
                weighted_gini = (left_mask.sum() * left_gini + right_mask.sum() * right_gini)
 
                if weighted_gini < best_gini:
                    best_feature = feature_index
                    best_threshold = threshold
                    best_gini = weighted_gini
 
        # Handle edge case where all feature values are identical
        if best_feature is None:
            majority_class = unique_labels[np.argmax(counts)]
            return DecisionNode(value = majority_class, class_counts=class_counts)
 
        # Build subtrees
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
 
        left_subtree = build_tree(X[left_mask], y[left_mask], depth + 1)
        right_subtree = build_tree(X[right_mask], y[right_mask], depth + 1)
 
        return DecisionNode(feature=best_feature, threshold=best_threshold, left=left_subtree,
                            right=right_subtree, class_counts=class_counts)
 
    def gini_impurity(y):
        if len(y) == 0:
            return 0
        _, counts = np.unique(y, return_counts=True)
        probabilities = counts / len(y)
        return 1 - np.sum(probabilities ** 2)
 
    return build_tree(X, y, depth=0)

Algorithm

I spent quite some time actually understanding the underlying algorithm (hence this study note), so I’ll explain it below along with the adaptation I made for my specific use-case:

Step 1: Start with all data

def build_tree(X, y, depth):
	# X shape: (num_images, num_features)
	# y shape: (num_images, ) - labels like [0, 0, 1, 2, 0, 1, ...] corresponding to different image classes
	
	num_samples, num_features = X.shape

Step 2: Check stopping conditions

Stop if:

  • Too few samples to split node (n_samples < min_samples_split)
  • All objects belong to the same class
  • Max depth reached
if (num_samples < min_samples_split or 
        len(set(y)) == 1 or 
        depth >= max_depth):
        
        # Create a LEAF NODE
        majority_class = unique_labels[np.argmax(counts)]
        return DecisionNode(value=majority_class, 
                          class_counts=class_counts)

Step 3: Find best split

  1. Try every feature and threshold
    • for each feature:
      • sample 3 thresholds (for example): np.percentile(feature_values, [25, 50, 75])
best_feature = None      # Which feature to split on
    best_threshold = None    # What value to compare against
    best_gini = float('inf') # Best "impurity" score (lower = better)
    
	for feature_index in range(num_features):  # Try all 512 features
		feature_values = X[:, feature_index]
		# Get all values for this feature across all images
		# Example: [0.23, 0.45, 0.12, 0.89, ...]
		  
		# Instead of trying ALL unique values, sample 3 thresholds
		thresholds = np.percentile(feature_values, [25, 50, 75])
		# Example: [0.15, 0.40, 0.70]
  1. For a given feature, find left and right Gini impurities
    • [i] Gini impurity measures how “mixed” the classes are
    • then, find weighted Gini (weight by number of samples in each child — left and right)
left_gini = gini_impurity(y[left_mask])
right_gini = gini_impurity(y[right_mask])
weighted_gini = (left_mask.sum() * left_gini + right_mask.sum() * right_gini)
 
# ========================================
 
def gini_impurity(y):
	if len(y) == 0:
		return 0
	_, counts = np.unique(y, return_counts=True)
	probabilities = counts / len(y)
	return 1 - np.sum(probabilities ** 2)
  1. Evaluate each possible split
    • find the above weighted Gini impurity for each threshold in thresholds
  2. Keep track of best split
if weighted_gini < best_gini:
	best_feature = feature_index
	best_threshold = threshold
	best_gini = weighted_gini
  1. After trying all features and thresholds, we’ll have found the best split for that decision node
    • e.g., Best split: feature[42] 0.6

Step 4: Recursively build children

# Split the data using the best feature/threshold found
left_mask = X[:, best_feature] <= best_threshold
right_mask = ~left_mask
    
# RECURSIVELY build left and right subtrees
left_subtree = build_tree(X[left_mask], y[left_mask], depth + 1)
right_subtree = build_tree(X[right_mask], y[right_mask], depth + 1)
    
# Return an internal node (not a leaf)
return DecisionNode(
	feature=best_feature,
	threshold=best_threshold,
	left=left_subtree,
	right=right_subtree,
	class_counts=class_counts
)

This recursion continues until all branches reach leaf nodes (stopping conditions met).

Complete Tree Example

                    Root (3016 images)
                    Feature[42] <= 0.6?
                         /        \
                       YES         NO
                       /             \
              Left (1800 imgs)    Right (1216 imgs)
              Feature[105] <= 0.3?   Feature[208] <= 0.7?
                /        \              /          \
              YES         NO          YES           NO
              /            \          /              \
         Leaf1          Leaf2     Leaf3           Leaf4
      (900 imgs)     (900 imgs)  (600 imgs)     (616 imgs)
      70% glioma     60% tumor   80% menin      55% glioma
      20% tumor      30% glioma  15% glioma     30% tumor
      10% menin      10% menin    5% tumor      15% menin

The Adaptation

The key adaptation I’ve made in my version over the reference is that each Decision node also maintains a record of class counts. What this does is allows for the ranking of labels in leaf nodes by turning their counts into probabilities.

Oftentimes with appropriate max depth and min samples parameters, leaf nodes are homogeneous and return a probability of 1 (100%) to the single, correct label. But if there were leaf nodes with a more varied mix as well, the above adaptation would account for this to give us a ranking that can still classify the query.