Overview
I recently had to create a Decision Tree-based classifier model for ranking a set of image labels (or classes), given a feature (or latent) space and a query image for CSE515 at ASU. The caveat was that we weren’t permitted to use any libraries to bulld the decision tree classifier (which would’ve been trivial using something like scikit-learn)
I came across this blog post that heavily influenced my solution (most of the algorithm is the exact same as Narsimha has demonstrated in his blog post, so I really don’t take any credit here), to create the following:
Code
class DecisionNode:
def __init__(self, feature=None, threshold=None, left=None, right=None, class_counts=None, value=None):
self.feature = feature
self.threshold = threshold
self.left = left
self.right = right
self.class_counts = class_counts # For ranking
self.value = value
def decision_tree_train(X, y, max_depth=10, min_samples_split=5):
"""
Train decision tree with a set max depth to prevent overfitting
Args:
X: Feature matrix of shape (n_samples, n_features)
y: Labels (n_samples, ) -> encoded as 0=glioma, 1=menin, 2=tumor
max_depth: Maximum tree depth
min_samples_split: Minimum samples required to split node
"""
def build_tree(X, y, depth):
num_samples, num_features = X.shape
# Calculate class counts for ranking
unique_labels, counts = np.unique(y, return_counts=True)
class_counts = {int(label): int(count)
for label, count in zip(unique_labels, counts)}
# Base cases
if (num_samples < min_samples_split or
len(set(y)) == 1 or
depth > max_depth):
# Return leaf node with class counts for ranking
majority_class = unique_labels[np.argmax(counts)]
return DecisionNode(value=majority_class, class_counts=class_counts)
# Find best split
best_feature = None
best_threshold = None
best_gini = float('inf')
for feature_index in range(num_features):
# Sample thresholds to avoid excessive computation
feature_values = X[:, feature_index]
thresholds = np.percentile(feature_values, [25, 50, 75])
for threshold in thresholds:
left_mask = feature_values <= threshold
right_mask = ~left_mask
# Skip if all go one way
if left_mask.sum() == 0 or right_mask.sum() == 0:
continue
left_gini = gini_impurity(y[left_mask])
right_gini = gini_impurity(y[right_mask])
weighted_gini = (left_mask.sum() * left_gini + right_mask.sum() * right_gini)
if weighted_gini < best_gini:
best_feature = feature_index
best_threshold = threshold
best_gini = weighted_gini
# Handle edge case where all feature values are identical
if best_feature is None:
majority_class = unique_labels[np.argmax(counts)]
return DecisionNode(value = majority_class, class_counts=class_counts)
# Build subtrees
left_mask = X[:, best_feature] <= best_threshold
right_mask = ~left_mask
left_subtree = build_tree(X[left_mask], y[left_mask], depth + 1)
right_subtree = build_tree(X[right_mask], y[right_mask], depth + 1)
return DecisionNode(feature=best_feature, threshold=best_threshold, left=left_subtree,
right=right_subtree, class_counts=class_counts)
def gini_impurity(y):
if len(y) == 0:
return 0
_, counts = np.unique(y, return_counts=True)
probabilities = counts / len(y)
return 1 - np.sum(probabilities ** 2)
return build_tree(X, y, depth=0)Algorithm
I spent quite some time actually understanding the underlying algorithm (hence this study note), so I’ll explain it below along with the adaptation I made for my specific use-case:
Step 1: Start with all data
def build_tree(X, y, depth):
# X shape: (num_images, num_features)
# y shape: (num_images, ) - labels like [0, 0, 1, 2, 0, 1, ...] corresponding to different image classes
num_samples, num_features = X.shapeStep 2: Check stopping conditions
Stop if:
- Too few samples to split node (
n_samples<min_samples_split) - All objects belong to the same class
- Max depth reached
if (num_samples < min_samples_split or
len(set(y)) == 1 or
depth >= max_depth):
# Create a LEAF NODE
majority_class = unique_labels[np.argmax(counts)]
return DecisionNode(value=majority_class,
class_counts=class_counts)Step 3: Find best split
- Try every feature and threshold
- for each feature:
- sample 3 thresholds (for example):
np.percentile(feature_values, [25, 50, 75])
- sample 3 thresholds (for example):
- for each feature:
best_feature = None # Which feature to split on
best_threshold = None # What value to compare against
best_gini = float('inf') # Best "impurity" score (lower = better)
for feature_index in range(num_features): # Try all 512 features
feature_values = X[:, feature_index]
# Get all values for this feature across all images
# Example: [0.23, 0.45, 0.12, 0.89, ...]
# Instead of trying ALL unique values, sample 3 thresholds
thresholds = np.percentile(feature_values, [25, 50, 75])
# Example: [0.15, 0.40, 0.70]- For a given feature, find left and right Gini impurities
- [i] Gini impurity measures how “mixed” the classes are
- then, find weighted Gini (weight by number of samples in each child — left and right)
left_gini = gini_impurity(y[left_mask])
right_gini = gini_impurity(y[right_mask])
weighted_gini = (left_mask.sum() * left_gini + right_mask.sum() * right_gini)
# ========================================
def gini_impurity(y):
if len(y) == 0:
return 0
_, counts = np.unique(y, return_counts=True)
probabilities = counts / len(y)
return 1 - np.sum(probabilities ** 2)- Evaluate each possible split
- find the above weighted Gini impurity for each threshold in thresholds
- Keep track of best split
if weighted_gini < best_gini:
best_feature = feature_index
best_threshold = threshold
best_gini = weighted_gini- After trying all features and thresholds, we’ll have found the best split for that decision node
- e.g., Best split: feature[42] 0.6
Step 4: Recursively build children
# Split the data using the best feature/threshold found
left_mask = X[:, best_feature] <= best_threshold
right_mask = ~left_mask
# RECURSIVELY build left and right subtrees
left_subtree = build_tree(X[left_mask], y[left_mask], depth + 1)
right_subtree = build_tree(X[right_mask], y[right_mask], depth + 1)
# Return an internal node (not a leaf)
return DecisionNode(
feature=best_feature,
threshold=best_threshold,
left=left_subtree,
right=right_subtree,
class_counts=class_counts
)This recursion continues until all branches reach leaf nodes (stopping conditions met).
Complete Tree Example
Root (3016 images)
Feature[42] <= 0.6?
/ \
YES NO
/ \
Left (1800 imgs) Right (1216 imgs)
Feature[105] <= 0.3? Feature[208] <= 0.7?
/ \ / \
YES NO YES NO
/ \ / \
Leaf1 Leaf2 Leaf3 Leaf4
(900 imgs) (900 imgs) (600 imgs) (616 imgs)
70% glioma 60% tumor 80% menin 55% glioma
20% tumor 30% glioma 15% glioma 30% tumor
10% menin 10% menin 5% tumor 15% menin
The Adaptation
The key adaptation I’ve made in my version over the reference is that each Decision node also maintains a record of class counts. What this does is allows for the ranking of labels in leaf nodes by turning their counts into probabilities.
Oftentimes with appropriate max depth and min samples parameters, leaf nodes are homogeneous and return a probability of 1 (100%) to the single, correct label. But if there were leaf nodes with a more varied mix as well, the above adaptation would account for this to give us a ranking that can still classify the query.