Wednesday, 27 April 2011

An introduction to Decision Trees

Decision Trees belong to the class of supervised algorithms. They are one of the most popular classification techniques and they are widely used for cross / up / deep selling and churn modeling. A Decision Tree consists of a set of rules which associate a set of conditions with a specific outcome. These rules can also be represented in an intuitive tree format, enabling the visualization of the relationships between the predictors and the output.

Decision Trees are often used for insight and for the profiling of target events / attributes due to their transparency and the explanation of the predictions that they provide. They perform a kind of ‘supervised segmentation’: they recursively partition (split) the overall population into ‘pure’ sub-groups, that is homogeneous sub-groups in respect to the target field, for instance sub-groups containing only churners or only members of a particular cluster.

They start by an initial or root node which contains the entire dataset. At each level, the effect of all predictors on the target field is evaluated and the population is split with the predictor that yields the best separation in respect to the target field. The process is applied recursively and the derived groups are split further into smaller and ‘purer’ sub-groups until the tree is fully grown into a set of ‘terminal’ nodes. In general, the tree growing procedure stops when additional splits can no longer improve the separation or when one of the user-specified stopping criteria has been met.

Decision trees, when used in classification modeling, they generate a rule set containing rules of the following format:

IF (PREDICTOR VALUES) THEN (PREDICTION=TARGET OUTCOME with a specific CONFIDENCE or PROPENSITY SCORE).

For example:

IF (Tenure <= 4 yrs and Number of different products <= 2 and Recency of last Transaction > 20 days) THEN (PREDICTION=Churner and CONFIDENCE =0.78)


When the tree is fully grown, each record lands into one of the terminal nodes. Each terminal node comprises a distinct rule which associates the predictors with the output. All records landing to the same terminal node get the same prediction and the same confidence (likelihood of prediction)

The path of the successive splits from the root node to each terminal node indicates the rule conditions, in other words, the combination of predictor characteristics associated with a specific outcome. In general, if misclassification costs have not been defined, the modal, the most frequent, outcome category of the terminal node denotes the prediction of the respective rule. The proportion of the modal category in each terminal node designates the confidence of the prediction. Thus, if the 78% or records in a terminal node fall in the category of non-churners, then the respective rule’s prediction would be ‘churn’ with a confidence of 78% or 0.78. Since the target field in this example is binary and simply separates churners from non-churners, the rule’s churn propensity would be (1-0.78) thus 0.22 or 22%. When the model scores new cases, all customers which satisfy the rule conditions, that is all customers with the terminal node’s profile, will be assigned a churn propensity of 22%.

There are various Decision Tree algorithms with different tree growth methods. All of them have the same goal of maximizing the total purity by identifying sub-segments dominated by a specific outcome. However they differ according to the measure they use for selecting the optimal split.

Classification and Regression Trees (C&RT) produce splits of two child nodes also referred to as binary splits. They typically incorporate an impurity measure named Gini for the splits. Gini is a measure of dispersion that depends on the distribution of the outcome categories. It ranges from 0 to 1 and gets a maximum value (worst case) in the case of balanced distributions of the outcome categories and a minimum value (best case) when all records of a node are concentrated on a single category.

The C5.0 algorithm can produce more than two sub-groups at each split, offering non-binary splits. The evaluation of possible splits is based on the ratio of the information gain which is an information theory measure.

Both C5.0 and C&RT tend to provide bushy trees. That’s why they incorporate an integrated pruning procedure for producing smaller trees of equivalent predictive performance. Analysts can specify options that control the extent and the severity of the pruning.

The CHAID algorithm is a powerful and efficient Decision Tree technique which also produces multiple splits and it is based on the Chi-square statistical test of independence. CHAID stands for Chi-squared Automatic Interaction Detector. It is a statistical test for the independence of two categorical fields. In the CHAID model, the Chi-square test is used for examining whether the output and the evaluated predictor are independent. At each branch, all predictors are evaluated for split according to this test. The most significant predictor, that is the predictor with the smallest p-value (observed significance level) on the respective Chi-square test, is selected for split, provided of course the respective p-value is below a specified threshold (significance level of the test). Before evaluating predictors for split the following actions take place:

· Continuous predictors are discretized in bands of equal size, typically at 10 groups of 10% each and recoded to categorical fields with ordered categories

· Predictors are regrouped and categories that do not differ in respect to the outcome are merged. This regrouping of predictor categories it is also based on relevant Chi-Square tests of independence.




In all the above models (CHAID, C5.0, C&RT), analysts can specify in advance the minimum number of records in the child nodes to ensure a minimum support level for the resulting rules.

No comments: