Welcome back to data science and predictive analytics, DSPA. This is a massive open online course offering [INAUDIBLE]. Once this lecture is recorded, you can also try up here in terms of archived video. The Class Notes, I will open it in a new tab, as usual. Please do download the R Code because you'll have to practice some of the examples that are within the notes. And once you're finished with that chapter, those of you that are enrolled and interested in getting the certificate for completion, can complete and submit the assignments through the course management system. Okay, so in this section, we're gonna start off with a simple example of the iris data set, illustrating classification using decision trees. Then we're gonna go over the fundamentals of what is a product of a decision tree classification, specifically the divide and conquer approach. We'll describe several metrics that are used for measuring the classification error, either on the, during the learning phase or the training data, or later in the validation phase on a tested data. This include misclassification error entropy and the gene index. Then we're going to talk very specifically about the C5O decision tree algorithm and trimming decision trees. And then, we will close with a case study looking at quality of life in chronic diseases. We'll gonna have a step by step protocol that illustrates the decision tree mechanism of classification. Then we're going to talk about comparing in curating indices and talking about different classification rules. And we'll end up at the end with a second view at the quality of life in chronic conditions example, comparing alternative models. So we have already seen in the previous chapters, k nearest neighbors and the Bayesian probabilistic classifications. They may not be useful when we are required to end up with a protocol that has rigid rules for classification. So this is what decision trees provides. Very explicit declarative rules that can be used for classifying data. So all decision trees begin with a trunk, all data are part of the same cohort. And then we iteratively split at every phase, at every step the node, into narrower and narrower branches. When we have some kind of a lot of preservation of energy, if you start with 100 subjects, and then they can be split into three branches. Some of the subjects or cases or instances that get allocated each of three branches into the parent node, that is 100. So most often we are going to explain the trace in a binary fashion bifurcating the children from every parent node. However, multinomial classification is possible as well. The terminal leaf nodes provide the final classes at every instance or every case or every unit it gets associated with. So there are a number of part packages that provide decision tree classification. So let's start off with the simple example that we saw earlier in chapter two. The Iris data set for which we have the petal and sepal width and length for three different species of iris flowers. After we did this, there was occurs by default in R. So we can use a ctree, or classification tree, essentially saying that we want to predict this species. Want to predict each of these ten classes that we call the species based upon the four type of shape morphometry metrics that we have here. And this computes the classification trim print in the tree. It coalescence in words, and you can plot the tree as well, to give a more intuitive representation of what classification is. So for instance, at the first branching, where all 150 flowers are gathered they can be subdivided into two, based upon the petal length if it's less than 1.9 you're almost guaranteed to be in the setosa taxon. If the petal length exceeds 1.9, then you're gonna have a second branch that compares the petal without, and dependent upon its size, you are branching into one of two nodes, this one over here and the one on the right. The one on the right almost exclusively identifies the verginica taxon of the iris flowers. The one that goes in the middle is further subdivided into two nodes. And you can see that we have exclusive segmentation that's almost perfectly identifying the three main species, except that you have these additional n cases that cannot be perfectly classified. Similarly, we can use the rpart package for our partitioning to do exactly the same thing. Species is again the outcome variable, and we're gonna be predicting it using all the iris morphometry measures. Again, some of the ports here are a result in iris partition that you get are reported over here. And you can do a very similar graph as we saw before, again, some dividing based upon the petal length and width. Okay, so decision trees, these represent upside down trees with lots of branches, where a series of logical. Operators and decisions are encoded as branches but connecting notes on the graph. These graphs do not have [INAUDIBLE] so they are hierarchical trees. There's always a node that's called a route. That's the basis where all the subjects represent the same population the same. And the lymph nodes are the final clusters. So the typical organization of the data is as shown in this diagram here, the features are typically represented as the variables, which are listed in the columns. The case is the instances or the units or subjects of studying the corresponding roles. Decision tree algorithms proceed top-down increasingly by dividing and conquering, which simply means that we're gonna evaluate the script of a data set based upon how much information is being gained by fitting a note in to children notes. So the basic conditions are the following three. All samples belong to the same class. That is, they have the same label and the sample is already pure. In other words, if you reach a node where only members are in the same label class then there is no need to first divide. All you subdivide are the majority of the points are already in the same class. That'll give you some error threshold. Or a final criterion could be there are no remaining attributes on which the sample may be further partitioned. So these are possible endpoint criteria for terminating the [INAUDIBLE]. There are three examples of indices used to evaluate this [INAUDIBLE] that we get at every step out the process, misclassification, Gini index, and Entropy. Let's go ahead and examine a little bit deeper into the space time with the entropy. Entropy is a constant that comes from physics, it has been around for a while. It's essentially, a measure of stochasticity or information, the higher the entropy, the more information is contained in sample. So obviously static observations, static samples, carry very little information so they'll have low entropy and completely sarcastic or random sequences or data sets. Or they will have higher entropy. You can imagine that if you are considering a node for potential splitting. If the node has higher entropy this would mean that typically there are observations that possibly belong to different classes that heterogeneity would yield high entropy which perhaps would be suggestive that a split might be appropriate. On the flip side, if you have lower entropy, and that's an indication that the sample's a bit more homogeneous or stationary. Just as an example here, if you have n features or variables, where every one of them can take on k possible states, the of this process D. So, you notice the number of distinct categories that can be stored, the amount of distinct combinations that can be stored in the object D will be K to the N, because every one of these observations has K states. Now p1 through pn represents the proportions of each of these classes. So everyone of these subjects is within one class, so we have the PIs are known negative and they add up to one. So the entropy then can be defined as negative sum of PI times log base two of PI. This negative sign simply relates to the fact that this p's are indeed less than 1. They're all positive and they're [INAUDIBLE] equal to 1. So effectively, this negative sign makes a log, the argument of the log here, the positive and not negative. Okay, so if each of of the incased states is equally like to be observed with probability [INAUDIBLE] one over k, then the entropy is maximized simply because you can just plug in one over k for Pi, simple arithmetic you can see that the entropy is maximize the front and vice versa. At the other extreme, the entropy is minimized. For a single clap or a single classification for the probability of one class is unary and the probability of costs are trivial. So we can see how the entropy in this situation is zero. So in general, the entropy will be a function that spans the range from zero to one. I'll show you a graph in a second. So more generally, in classification settings, higher entropy, more disorder corresponds to a sample that has mixed collection of labels. So if PI represents the probability of a data point in D being a class CI and change the number of clusters, then the entropy of D can be defined in those terms. And the probability of a certain class given the observed data can be estimated by essentially counting the [INAUDIBLE] data set for which xis are redeemed. And have a corresponding out come labels [INAUDIBLE] divided by the [INAUDIBLE] drd. So one little [INAUDIBLE] for us to remember is that the base of the low ranging function here doesn't really matter. There's a very easy transition for [INAUDIBLE] you can go from any base to any other base including the natural log that uses the other [INAUDIBLE] over the base. So that's the entropy. Misclassification error and Gini, the Gini index has similar representations. So for example, the misclassification error is defined as 1 minus the maximum pk. Again, this would be a value between zero and 1, because the PI is between zero and 1, and the Gini index is the quadratic expression representing the product PK times its residual 1 minus PK. So again a slightly different, this corresponds to the L1 norm, this corresponds to the L2 norm in a way. So let's talk about the C5.0 Decison Tree Algorithm. So, in this case we're going to be using the entropy as a measure, That quantifies how well the tree represents the classes between our data set. Okay, so here is one example where the entropy, when you have two states would be a parabola that goes between 0 and 1 achieving the maximum right in the middle. Where on the axis, we have the proportion of jcc class one and then the entropy is on the vertical axis. Obviously, the proportion of cases in class one is exactly one minus the proportion of cases in class two. So, this is a simple situation where you have just two possible outcomes, here's the outcome that actually prints that optimization. So the binary proportion split is 0.5 corresponding to greater entropy, so decision trees aim to split the data that are aiming to reduce the entropy. That was viewed in more homogenous cases within the resulting children of the split. So the information gained from suggested partitioning of the data can be assessed by computing the difference. Between the entropy of the original and the entropy of the suggested split, Branch. So oftentimes, this leads to decision trees which may in the extreme case, end up with lymph nodes representing individual cases, individual units, individual subjects. This is not very efficient obviously, because the stratification is extreme, there is overfitting that happens. So ideally, we would like to prune the trees in some kind of a clever way that would preserve the similarities. Yet yield something that's not too complex to be interpreted in the context of the specific study. So let's look at one example here, quality of life and chronic disorders, here's the link to the data set, it's a CSV file. The important variables are Charlson Comorbidity Index which ranges between 0 and 10. And represents the number of comorbidity conditions that patients might have. And second important feature is chronic disease score which is a summary score based on the presence and complexity of prescription medications for select chronic conditions. So in step two we are gonna be doing some quick data exploration, create some summaries. For example, look at the question on one, the first question in your working life and to obtain. How many cases are reporting one as the outcome on question 1, 2, 3, all the way up to 6. So this is the corresponding frequency distribution, we can go ahead and remove missing values or erroneous values. QoL, quality of life is our object, and then this negation of missing or -9 values are gonna be stripped out. So if you can summary now, obviously we're not gonna have any missing values at the end. And what we can do is we can dichotomize, we can create the new variable. For chronic disease called CD, which is essentially going to compare all the values against the mean. If the values of the chronic disease score is below the mean, we will call it a minor disease. If it's over, then we're gonna call it severe, so we're dichotomizing that. We are gonna order all the observations according to their indices and then go ahead and remove the ID column. Now, this extra variable that we added up is gonna be in column 41, the CD variable that we just added. Become automatically representing column 41 of the data set, you can do a few of these data to actually see how that works. So I can quickly open my RStudio, Get the data, If you do Ctrl+Enter it's going to run it in your shell and then of course, I can do view(qol). And oftentimes it is very useful to print the dimensions of these data sets. This tells me that I have 2356 cases, 41 features, again the ID variable is the first one. So we're gonna be left with 40 columns or 40 features as they're listed over here. And then we are going to split the data into training and testing. You can either split them by selecting certain number of the ordered cases to represent the training and then the residual to be testing. So that's one way in which you can do it, this is a non-random assignment or split of the data into training or testing. You can also do this randomly by percentage if you have 80% of the data to train on. You can construct the training index of these 22 14 cases and then assign these to the training data. Again, this is the first index represents the rows, the second index represents the columns, so all the columns are down here. Is gonna be the residual, so if you minus the train indexes, we can get the residual cases. And I'm printing the probability just to make sure that the approximate distributions of mild or minor and severe disease are the same. They are roughly speaking, proportional. So next what we're gonna do is we're gonna be using the C5 algorithm to train the data it takes in. For parameters, two of these are required for the train and the class. So the labels are class variables, they're our socio train, so that's gonna be our CD column for the first column in our data set. This method is available in the library (C50), so if we remove the last two. So column fourth, you remember, was the diagnostic variable which we converted to a binary dichotomous variable. So in moving these two and printing the summaries is gonna report for us what kind of data we're dealing with. Here's where I'm going to be calling the c5.0 algorithm. Again, providing the features of our training data. And removing, obviously, the out classes. And the second argument will be the actual labels that we have associated with train on test. They have to argue once in this case is suppressed, so different types are gonna be used. And we can go ahead and print the tree, which is gonna result a decision tree of size 25 nodes with about 28% error, here's your. Confusion matrix. 726 of the cases that were minor disease were classified as minor disease. 548 were severe were classified as severe and there were some that were misclassified. So this means to be fairly large air rate of almost 30%. Now the [INAUDIBLE] as seen up here from decision three plane, the attributes that relate. So here is your graphical representation of the [INAUDIBLE]. You can see the trench is of the tree here. But the CHARLSONSCORE was the key variable that contributed to the classification AGE, RACE_ETHNICITY, and that [INAUDIBLE] in this quarter. Okay, now let's go in and evaluate the model performance. We are going to be using the predict function just like we used before in Chapter 2 and 7, which tell how we assess model prediction and classification model quality. So qol_predc is gonna contain our predicted values. Using the model that was learned earlier on the training dataset. And now we will provide the features only for a testing dataset, mind you, were are excluding again the regional diagnostic variable and the derived binary class variable that goes with it. Current disease score and city. The current package has the confusion matrix routine that we've seen also before. So the confusion matrix is gonna be important for us, an accuracy of 63% give or take some. Again, here is the confusion matrix. Remember the [INAUDIBLE] matrix here is slightly different from what we had before. Sensitivity about .67, specificity about .6. Not too bad concerning an upgrade. An upgrade classification. Now it says 62% because every time you run this it's actually exactly 16% of what. If you rerun this without setting the seed, you're likely to get a value that's anywhere between 0.58 and 0.68. Let's look at the trial option of the C5.0 function. This is an integer that specifies the number of boosting integrations that Should be performed during the learning process. So here is where we're gonna be using trials = 6, pretty much the same call as we did before. We're gonna call this new classification, boost six, because we're using trials = 6. And furthermore we can actually plot the result. Again, you can see the Charleston score is the first thing that subdivides, that branches on the top level, and age, and then race, ethnicity, MSA, etc. Within each of these notes you can see Then each of these nodes you can see what is the majority. In this case, it's a very severe disease, 38 subjects. The error rate was 34.2%. Zoom back out a little bit. And all right, one little point of caution here is that when you do decision trees, you can get the algorithm to fail with some of your features or variables. Start with non-terminal characters, anything that's not a standard letter is likely to generate a problem so be very careful of this. We can also do print the confusion matrix, you have a little bit of an improvement of the accuracy of the algorithm. Not a huge improvement, but somewhat an improvement over previously so. Now, suppose that you want, in many of these situations the confusion matrix has false positive and false negative. So if the algorithm has predicted severity, when it fact it's a severe disease or infected disease is minor, it's called force positive. And if the algorithm is predicted a minor, when in fact patient has a severe disease It's called false negative discovery. Often times the impact of false positive and false negative are not equal. It may be more costlier to miss a severe case because you can't intervene immediately. So in this case, you may want to have loaded misclassification design matrix, which can be specified as follows here. And then in the code to the tree subdivision classification. You simply provide these errors cost matrix which suggests that this is gonna be four times as costly as that. It's a two by two matrix just like we have seen before in chapter two. So in this case our accuracy is comprised we're going from 64 down to 59% accuracy. However mind you the severity seems we have much fewer of these cases, we choose to be 75. So, we are pretty much eliminating most of the false negatives, of course compromising our false positives. So, parameter tuning is the next section. So let's look at some examples here. We are specifically using the arc part method with specific complexity parameter values. So if you want this, we have- A very simple tree that gets generated that kinda splits the cases according to again, the Charlson score below 0.5 or above 0.5. And then the second spree is based on age. There are more fancier plots that you can do in the [INAUDIBLE] package. There is something called Fancy Art Part Plot. That can be used as well. Now if you use the prediction model for this specific classification model that we just generated using the r part package right up here with this specific complexity parameter, 0.01. You should, by the way, change this and try to look at the results for different CP values. Your classification now has increased slightly to 65%. Much better rate of false negatives than before. And obviously what you can do is you can, Plot the, Relative error rate for different CP values. And you can see, you can essentially identify this CP value for which you are achieving the smallest possible error rate. And then, now we can prune the tree according to the optimal CPU or complexity parameter to which the r part object will be trimmed. Pardon my typo here, this should be 1 minus r squared. The representative discrepancy between the observed labels and the model predicted labels. By the way, in Chapter 13, we're gonna talk about various different strategies for assessing the model quality. And later in Chapter 20, we're gonna be discussing from cross-validation, which is a statistical internal validation strategy that produces much better results. So the method for approving the tree is just you can prune. You provide a model. You provide a CP value, complexity parameter value. And then you can generate a product of results. And then once you have this truncated, this is called a truncated tree, you can use it to predict the test cases and your point of confusion matrix and your accuracy slightly improves, 65%. A little bit better but roughly speaking in the same bulk. Okay, if you want to plot the unpruned tree, you can very easily get very complicated plots that are gonna be difficult to interpret within the content. And that, of course, is a large image that I believe you can zoom in on to see that it carries kind of very similar information as we had in the previous simpler clause where we actually did prune the tree. There we go, okay. [COUGH] I'm sorry. All right, so pressing on here, how do we compare different impurity indices? We can change the split equals to entropy to error or a gini for corresponding tree indices. So here is one example where in the rpart model we are gonna gain overwrite in the previous quality of life model. And then a slightly different result is gonna be generated. The next section we're gonna talk about classification rules. We should utilize if else logical statements to assign classes to data that is not labeled. So we are gonna spend some time with tree complementary classification rule strategies. The first one is separate and conquer, which repeatedly splits the data and the subsets of the data by rules that cover a subset of examples. This is similar strategy to the divide and conquer, that's why it's called separate and conquer approach. However, the difference is that each rule can be independent, yet each decision node in the tree has to be linked to past decisions. The One Rule algorithm, Is an extension of something called the ZeroR rule, which aims to assign the mode class to unlabeled test observations regarding of its future values. So the One Rule algorithm is an improved version of the ZeroR that uses a single rule for a classification. It splits the training dataset into several segments based on future events and assigns the mode of the classes in each segment to all related observations in the unlabeled test data. The third case is something called RIPPER algorithm, which stands for Repeated Incremental Pruning to Produce Error Reduction. It combines the previous two algorithms, and presents in three steps. First of all, the growing phase. It adds conditions to a rule until it cannot split the data into more segments. Second step is the pruning, it deletes some of the conditions that have large error rates associated with them. And the last, third condition is to optimize, which repeats in the right order previous two steps until we cannot add or delete any of the residual permissions. Let's take another look now at the QoL in Chronic Disease dataset. So we're gonna try all these algorithms on it, so the OneR method is called with class predictors and dataset. We're gonna be using the RWeka implementation for that. So, if you don't have the RWeka Library, you can install it and then load it in the environment. Set a seed for your producibility purposes. Now here's where we're gonna construct our quality of life OneR mile predicting a game that the CD, and removing only the 40th feature, which was the chronic disease core, which was the variable that led to the CD. So in other words, the CD stays in here. However, it is used not as a feature but it's used as an outcome that needs to be predicted. Printing the result, it kind of tells you that One Rule only fits the database with the Charlsonscore, its values. The magnitude of the Charlsonscore tends to yield a specific class, as you can see up here. And you have 1353 of the 2214 cases currently. 35 which use 66% accuracy is very simple, simple subdivision rule. Not the perfect classification ut not bad as well. Summarizing again, 65 to 66% in accuracy. The confusion matrix, you do have quite a few of these, Mild cases, or minor disease cases, that are labeled as severe by the algorithm. I will encourage you also to change the seed and see how stable the 1R algorithm is based upon the seed. Let's look at alternative model1, using the JRip for the RIPPER algorithm. Pretty much the same invocation as we had before. And here are the corresponding rows, it says you have three rows, you have 65 to 66% accuracy. Slightly better here, or in fact a lot better on the false positive. K97 360 versus 549 and 212. So in a way the JRip did perform a little bit better I suppose. Let's look at our alternative method two, model2. This method uses something called random forest where we repeat the generation of trees many, many, many times. Predict according to each tree's performance, and finally ensemble those weighted votes into a combined classification result. You're gonna see more about random forest later in Chapter 14. So here is the code to the random forest. Pretty much the same invocation syntaxes as before except that we are using a few additional parameters there. And this is the result of the corresponding, Variables importance plots according to two different criteria, the accuracy of the classification and the gene index. You can see that there's fair agreement. CHARLSONSCORE is obviously the top dog here. With most of the impact followed by age and 207, we've seen that before, race also comes in. Up here we talk about the gene here and then in the accuracy, race is much farther down the list. This algorithm has an error rate of 35%, 36%. Fairly high rate of false positive and false negative error rate. You can also plot the, Mean square error for the rf.fit, as well as two alternatives where we manipulate some of the input parameters. And you can see that as the number of trees increase, the error rate is going down but it's not going down uniformly and it does depend. See the green, for example, represents the model2, which has mtry=26 and uses a lot of trees. And the node size is 5, which indicates how many cases must be unknown before we stop subdividing it. Prediction is carried, just like with this one. Because it seems, at least from this error plot that the random forest fit2 model is the best. So let's go ahead and use it to predict the labels after testing data, important condition matrix. So by the way, this error rate is on the training data so this is internal training data error rate. When you apply it to the testing, you see it's pretty much the same 65% to 35% accuracy. Not too bad on the false positive and false negative grades here. Kappa, pretty reasonable, sensitivity and specificity, not great, but not extremely poor either. Okay, again, later in Chapter 14 we're gonna see more details about, Model performance. Here is similar ideas about the specific parameters of the random forest that you can play with. So a final thing here would be this discussion of additional things that you might want to try. So for instance, you may wanna try tertiary classification where you compute the CHRONICDISEASESCORE and find a corresponding quantiles, basically split into three regions by one-third of a cell. It's gonna report for you the corresponding quantiles in this case, tree, or you can go ahead and split the chronic disease score are based on any two values into three ranges. So now you have three labels. Mild disease, I think for minor disease, mild disease is your disease. And you can try the very same tessellation and modeling. Again, you should try the 1R, RIPPER, the R part, different things and see which of these algorithms does the best job. Our confusion matrixes are gonna be three by two cases in this case because you have three labels with image. Now, for the C5O, the accuracy is definitely reduced for a one hour algorithm. It turns out if you got from the charts and score to the interview date. Interestingly, that is the best single rule to split the data into three categories as shown here. And this accuracy is not too bad. Almost 65%. Almost as good as we had before. If you try to predict, however, on the testing data. Notice how this actually classifies everybody as mild disease. Very, very few cases are in the severe. And none are in the minor. So that's why this accuracy of 65% or so cannot reflect too well and obviously the account base. The JRip similarly, you can expect it says number of rules is fine. Maybe a little bit better 67% or so [INAUDIBLE] accuracy. That's much more reasonable, except that again the bulk of the individuals are in the mild case. And you have quite a few that are labeled as mild when in fact they are severe. Okay, JRip also, have a look at these examples here. Here's a little bit of a summary. You should try some of the outer data sets that we have in our case study collection with the same data set. So just to summarize, we looked at divine and conquer classification trait last time. Now we looked at the regression trees. And next, what we're going to be covering is support vector machines, dspa,predictive.space or just predictive.space gets you to the main course website. As we indicated before, once you're done with looking at the class notes, viewing the videos that are available online, playing with the R code, try to complete the assignments.
DSPA Chapter 8 Decision Tree Classification
From Tina Chang July 17th, 2017
100 plays
100
0 comments
0
You unliked the media.
Related Media
DSPA Chapter 8 Decision Tree Classification
In this chapter, we will (1) see a simple motivational example of decision trees based on the Iris data, (2) describe decision-tree divide and conquer methods, (3) examine certain measures quantifying classification accuracy, (4) show strategies for pruning decision trees, (5) work through a Quality of Life in Chronic Disease case-study, and (6) review the One Rule and RIPPER algorithms.
…Read more
Less…
In this chapter, we will (1) see a simple motivational example of decision trees based on the Iris data, (2) describe decision-tree divide and conquer methods, (3) examine certain measures quantifying classification accuracy, (4) show strategies for pruning decision trees, (5) work through a Quality of Life in Chronic Disease case-study, and (6) review the One Rule and RIPPER algorithms.
- Tags
- Stewardship
- SOCR Resource, University of Michigan
- Rights
- CC-BY
- Credits
- SOCR, MIDAS, HBBS, DCMB, University of Michigan
- Creation Date
- August 1st, 2017
- Appears In
Link to Media Page
Loading
Add a comment