Trevor Stephens

Committed to Open Source… Again

2015-05-20T00:00:00+00:00

This one’s been in the hopper for a long time! But now, you can grab it on the bleeding-edge version and make your decision tree visualizations totally legit in scikit-learn!

As many of you who head to my blog know, I’m quite the fan of tree-based learning, but have you ever seen scikit-learn’s D-Tree rendering? It used to be pretty basic. Usable and good for introspection of your models, sure. But would you want to show this to a customer (let’s imagine that the customer is a cruise liner from the early 1900’s)?

Fear not! You can now draw some pretty nice trees in scikit-learn using the newly overhauled export_graphviz() function as documented here.

And, with a few options tweaked, the above plot becomes something like this:

Enjoy!

Committed to Open Source

2015-01-25T00:00:00+00:00

scikit-learn is one of my most-used tools, be it at work, or playing in ML competitions. I thought it was high time that I contribute back to this awesome project, and last week one of my pull requests was merged into the master branch!

In sklearn 0.16 (coming soon) you will now be able to automatically weight your samples based on class. Sure you could do this manually before, but now it is also grid-searchable:

X, y = make_classification(n_samples=1000,
                           n_features=20,
                           n_informative=10,
                           weights=[0.8, 0.2])
parameters = {'class_weight': [{0: i + 1., 1: 10. - i} for i in range(10)]}
clf = RandomForestClassifier(n_estimators=100)
grid = GridSearchCV(clf, parameters)
grid.fit(X, y)

This dataset is quite messy, and unbalanced, so the weighting scheme for best performance may be a bit unclear. This particular grid-search iterates the sample weights by class from 10:1 to 1:10 and declares the winner:

'class_weight': {0: 7.0, 1: 4.0}

Of course, for the lazy or mega-huge-ensemble wielders, you also have a couple of presets to choose from: 'auto' and 'subsample' which will weight samples inversely proportional to the class frequencies. The 'auto' mode performs this (once) over the entire dataset, while the 'subsample' mode calculates the inverse frequencies of the classes in the bootstrap sample fed to the individual tree estimators (n_estimators times of course).

My pull request had the great, albeit unusual, pleasure of three code reviews. No doubt that Random Forests are one of the go-to classifiers out there, so I don’t blame them for a bit of caution with a new feature! Having my code picked apart by some of the very talented core contributors was a great experience and I learnt a lot from those guys, but the idea that hundreds of thousands of users may one day be running some lines of code that I wrote is a whole other level. It’s an amazing feeling. Now to figure out how to split up time between committing-to and consumption-of the code-base!

If you can’t wait for the 0.16 public release, feel free to grab the development branch code. There’s a ton of other excellent goodies from other contributors in there too.

I truly hope that you can get a bit more out of your ensembles now, let me know if you used it in the comments!

Box-Plots for Education Recap

2015-01-11T00:00:00+00:00

I stumbled across a brand new ML competition platform a couple of months ago, DrivenData, which describes itself as hosting “data science competitions to save the world”. Basically think Kaggle for non-profits. They had launched their first prize awarding comp, Box-Plots for Education which aimed to automatically classify education expenses into various categories. I took a look at the data and decided to give it a crack.

The data and problem were rather complex! There were 104 labels to assign, split over 9 different categories. Each one required submitting a probability which made for some massive submission files in excess of 100MB each! As predictors we were given the total cost of an expense, and its full-time fraction if it was for an employee. Additionally there were 14 columns of free-text to play with, mostly being short descriptions of the expense.

To get a feel for how to break down this somewhat intimidating structure, I started with a naive baseline. The provided baseline applied a straight probability to every label. But some labels appeared a lot more than others, so I figured to get moving I would assign each label a probability based on its frequency in the category. This approach got to 1.3913 on the leaderboard, up from the 2.0455 of the provided solution. I was winning! But now to get more serious as I suspected some serious contenders were on their way, and they were…

The first task was to clean up the data. I broke the two numeric columns from all of the text ones first, as I was planning on handling the two types separately. The target columns needed some work too, I re-encoded the targets into two forms, an integer-encoded version that just replaced the text labels with numbers, and a wide format that had 104 columns with one-hot encoding of the individual labels.

There were several different missing value types that I discovered in the data: a true missing value, ie no entry in the CSV; an explicit “n/a” text string; and “(blank)”. I decided that these different types of missings could potentially have different meanings, so created some new columns to tack on to the two existing numeric columns, representing the original text as 0 for present, and (1, 2, 3) for the different missing value types. In the text itself, I replaced all three missing types with the literal string “missing”. For the remaining numeric columns I simply filled the missings with -999.

I had also noticed some of the text was capitalized while others were lower-case. I figured that this may give a clue as to what system had generated it. Some new numeric columns again represented each document as a 0 or 1 for whether it was capitalized or not.

For the raw text, I created 14 bag-of-words corpora for each of the original text columns using scikit-learn’s TfidfVectorizer gathering both uni-grams and bi-grams. This was fit on the training corpus, and then transformed onto the test corpus.

So now I have a dataframe filled with the original Total and FTE numeric data and all my binary engineered features, as well as a bunch of corpora representing the text. How to stitch this mess together?

I decided to apply a LogisticRegression model to each corpus, and then combine my numeric features with the predictions from each of these linear models as features for a second-stage non-linear RandomForestClassifier

Since my model would use two stages, I didn’t want the training set prediction-features to be based on in-sample predictions while the test set would be out-of-sample. So I broke the training data into two folds and ran my logit model over each fold separately to create my training set features, and then ran it again for the full training set to create the test set features. I then applied the Random Forest with a few hundred trees to the new datasets and got a result.

As I had two different forms of my targets, I ran this model in two different ways.

The first couple of models I made treated each label as a separate problem: I trained a set of logit models for each of the 104 target columns… That’s right those of you paying attention… I trained 104 (targets) * 14 (corpora) * 3 (training & test) = 4,368 separate linear models, and also applied 104 (targets) * 3 (CV & test) = 312 Random Forest models to those predictions! This was computationally expensive (understatement of the year) and took one to two days to crunch through all the numbers to get a single prediction file out.

Using this method I made a couple of pretty decent submissions:

0.6355 using a L2 parameter of C=0.1, “auto” class weighting, and omitting my engineered features for the forest
0.5727 using a L2 parameter of C=10, “auto” class weighting, and pre-filtering the corpora using SelectKBest with the chi-squared statistic to use only the top 95% of the words in each corpus

I also trained a model that treated the problem more as the multi-label classification that it was, while still learning each target category separately. Now I would train the logistic regression to generate a column for each possible label in a category, for each corpus. While the number of linear models was lower with 9 (categories) * 14 (corpora) * 3 (training & test) = 378, the datasets that my Random Forest saw were much larger. This version also did pretty well:

0.6433 using a L2 parameter of C=10, “auto” class weighting, and pre-filtering the corpora as before

The big move though was taking a equally weighted blend of my three best models. Taking the output files from each of these, and simply taking the average of the three brought me into fourth place:

0.5062 using an average of the three models above

A big gain from blending in this competition! While each model used a very similar representation of the data, the differences in the way I had structured the problem coupled with the fairly big differences in hyper-parameters seemed to make them err in different ways.

I made a couple of tiny gains after that by replacing predictions that identically matched a good number of training-set observations, but the improvements were insufficient to take me to the top 3 in this competition unfortunately. What truly amazed me was how no one moved an inch after closing, the final standings were remarkably stable to the public leaderboard’s indication:

Huge congrats to the winners and really looking forward to the next comp on DrivenData!

Armchair Particle Physicist

2014-09-23T00:00:00+00:00

The Higgs Boson Machine Learning Challenge was probably the most intense competition I’ve worked on, and most likely the one I’m most proud of my solution for too. Sure, there’s a lot that I could have done better, but that’s why I work on Kaggle competitions. I want to continuously improve on how I work this ML-mojo even more effectively, and to that end I feel that I’m getting there; but still have plenty to learn. And that’s a great thing.

I felt the glow of the top 10 for a few days towards the end of the competition. To have hit that point on the most popular competition ever run on Kaggle (Titanic aside) is a very nice thing to have experienced, I highly recommend it. But in the end, I probably overfit to the public LB a little bit, and my time ran out before I could feel truly comfortable with my cut-off choices. Not to take away from the winners, but I also believe there was a pretty hefty element of luck in this comp if you focused on improving your model rather than squaring away any cross-validation instability. Take a look at the massive fluctuations from a few folds of one of my model’s CV below and you can perhaps imagine my worries about where I would end up.

I am very impressed with the Kagglers that managed to CV this competition to a high degree of certainty. Some on the forums were talking about 40-70 folds to get a good idea of their model cut-off errors. I generally work comps locally on my little 4-core 8GB laptop, and at 12-36 hours run time per submission on this competition, I really couldn’t afford much more rigor on my already expensive models.

So anyhow, here’s a high-level look at what I did for Higgs, and how I briefly approached the top of the leaderboard… It basically comes down to my new favorite feature engineering pipeline:

Treated PRI_jet_num as categorical from the get-go
Automated feature engineering:
- Normalized all variables, applied +, -, * operators
- Normalized all variables, took their absolute value and summed
- Raw momentum/mass variables, applied +, -, *, / operators
- Raw angle-type variables, applied +, - operators
- Raw momentum/mass variables, *, / by number of jets

I rolled these variables out incrementally over the 2 months I worked on Higgs. Obviously I had a dimensionality problem here. With 33 variables, that’s several hundred new variables each time you look at a new set, most of which is garbage. So, here’s the pipeline I used to throw out the chaff:

Generate the automated variables for the upper triangle
Calculate the Spearman correlation (I was using tree-based models) between each auto-variable and throw out those that are greater than 99% correlated with any other auto-variable
For each variable set, run scikit-learn’s GBM with 200 trees, depth of 6, and learning rate of 0.05 over two folds
Note the variables that are greater than 1% of the importance of the top variable for the fold
Keep only variables that meet this criteria for both folds

It was my hope that this would get rid of spurious variables. I think I probably could have cut deeper as the 1% thresholds are clearly arbitrarily chosen. This would have saved some training time for the next phases too… And thus I could have CV’d harder. Oh well.

Now, I created a new dataset with all the surviving variables from each auto-variable set. And then…

Calculate the Spearman correlation for the new dataset and throw out the 1% most correlated (between rather than within variable sets this time).
Run a GBM with 500 trees, depth of 6, and learning rate of 0.05 over two folds.
Note the variable that are greater than 0.5% of the importance of the top variable for the fold.
Keep only variables that meet this criteria for both folds.

After this process I usually had a list of 250-300 variables to run with depending on which sets I was combining. Here’s a glance at the top 10 variables from one fold. Note that the scaled interactions are simply marked with the operator, while raw quantities explicitly call out that fact with “/raw” for example:

feature	importance
PRI_tau_phi_-_PRI_met_phi	0.029825
DER_mass_MMC	0.029003
DER_mass_transverse_met_lep_/raw_PRI_tau_pt	0.023429
DER_mass_vis_abs_PRI_lep_eta	0.022919
DER_mass_vis_abs_PRI_tau_eta	0.021729
PRI_tau_eta_*_PRI_jet_leading_eta	0.019790
DER_mass_transverse_met_lep_-_PRI_tau_pt	0.019458
DER_mass_vis	0.018423
DER_pt_ratio_lep_tau_-_PRI_lep_pt	0.017572
DER_mass_vis_*_PRI_met_sumet	0.016613

I am no high-energy physicist, but maybe someone out there may care to comment as to why these might have been important to discriminating background from signal. As should be truly evident by now, I cared not why, just that it was working.

Glancing at my CV and LB scores, I’m thinking that most of the bump I saw was from the normalized variables. Why? Well, at a guess, trees are greedy and make decisions at each node without looking forward or back, they will miss the importance of interactions at each and every node as that is never inspected. Generating scaled interaction variables seems to me to go hand-in-hand with tree-based learning. “Is this variable ‘big’ and that one ‘small’?”. These automated variables pick that right up in a single node, where you might waste two or more to get there in a raw tree. It worked nicely for this comp and I’ll be keeping it in my toolkit for the next one.

The rest of my pipeline was pretty standard and rested on XGBoost’s shoulders with a 4-fold CV that was clearly too loose for this evaluation metric. I experimented with a fairly wide range of hyper-parameters, but it was hard to discern any significant difference between them due to the CV instabilities.

Anyhow, after Higgs ended I finally hit the top 1000 of Kagglers and scored my third top 10%. The next milestone is quite obvious now, the Kaggle Master golden jersey must be mine!

See you on the leaderboards.

Catch-22: Visualized

2014-05-18T00:00:00+00:00

Find the app here.

The Dataset

Catch 22 by Joseph Heller is my favorite novel. I recently finished reading it (again) and love the creative use of language and the ridiculous characters’ interactions throughout the book. For my visualization class, it was an easy choice to select the text as my final project “dataset”. The text has around 175,000 words, divided into 42 chapters. I found a raw text version of the book on the web and got to work.

I parsed the text in Python using a combination of regexes and simple string matches. The code for the iPython Notebook I built may be viewed here

The script first extracts the text from each chapter, converted to lower-case and punctuation-stripped. I then loop through a list of the main characters who appear more than 50 times and look for mentions of their name. I defined the “time” of their appearances the percentile of the chapter, ie. a character that occurs at the 10th percentile of chapter three would be stored as 3.10. This dataset formed the basis for both the “Character Appearances” and “Character Co-Occurrences” plots.

A similar process was used to extract the occurrences of the various Mediterranean locations mentioned in the book. I found a fan site where all of the locations were listed, and searched for these names in the text. Once identified, the cities were geocoded using the geopy package in Python. This dataset was used for the “Mediterranean Travels” visualization.

The final dataset was created by scanning the chapters for mentions of Yossarian. A window of the 25 words either side of these mentions was then collected and tagged with the chapter locations where they occurred. The nltk package in Python was used to both clean the word list of stopwords and then do part-of-speech tagging on the remainder. This dataset formed the basis of the “Characteristic Words” visualization.

I visualized these data sets interactively with shiny in R. You can find the code at my GitHub repository. Wherever possible, I linked the interactivity between plots so that you can zoom into areas of interest.

Mediterranean Travels

This visualization maps the mentions of the locations around the Mediterranean that are found throughout the book. Aesthetically, I was very happy with the final product, the low-resolution border data from the maps package had an unexpected synergy with the original text, namely the jagged outline of a jumping soldier (Yossarian) on the cover. Hence I stuck with the low quality map data as I really liked the effect.

Clearly there is a temporal element to these journeys too, I indicate time by applying a linear fade in transparency which gives a hint as to the sequence of the mentions. The edges are plotted by their great circles which was really quite easy to implement using the geosphere package in R.

I feel that the visualization gives information about the book that may not have been totally clear during its reading. While I vividly remembered Milo’s frantic trips about various merchant ports in Chapter 22, it is extremely interesting to see exactly how frantic it was, and feel sorry for Yossarian and Orr who were dragged along and deprived of sleep for days.

Character Appearances

This plot essentially represents a time series of when different characters were mentioned in the book. It was inspired by the “Character Mentions” plot by Jeff Clark’s Novel Views: Les Miserables series.

I plotted the data as a standard scatterplot with chapter as x-axis (since it is similar to time) and the characters as a discrete y-axis with a vertical bar as the marker.

I feel that every pixel in this visualization works hard. The lack of a “dot” speaks just as loudly as its presence, especially for the prominent characters in the novel. I’m very happy to see patterns such as chapters heavy with their name-sake, and others almost devoid of the character they are named after. It was also interesting to see how some main characters really only showed up occasionally in the book, even if they left a lasting impression on the reader.

Character Co-Occurences

This plot attempts to show which characters spend a lot of time together throughout the book. It was inspired by the Les Miserables Co-occurrence plot by Mike Bostock.

The data used to build this visualization was the exact same as that used in the previous one, but required substantial transformation to get it into a form that could represent these patterns. Many different methods were examined to find when characters were co-located, but the one I settled on was when two characters were both merely present in a chapter. Looking for when characters were prominent (ie. both mentioned many times) ended up creating an extremely lop-sided view with more frequently appearing characters totally dominating the data. In fact, the co-location cell for Yossarian with himself was orders of magnitude higher than any other when other approaches were used.

Clustering added another dimension to this plot. A hierarchical clustering scheme is applied over the entire book to try and find communities among the characters. Again, presence in a chapter (1 for present, 0 for not) was used and a 42-dimensional euclidean distance used to cluster the characters with a complete-link AGNES algorithm. Manual inspection of the dendograms for different clustering schemes and distance measures found that this was the most “level”, in that more frequently appearing characters dominated the scheme the least. Here’s the dendogram for six clusters:

When the user chooses to colour the plot by clusters, cells for the co-location of characters sharing the same cluster are filled with a unique colour while those cells showing co-location of characters from different communities are shaded grey. It should be noted that clustering is performed over the entire text, not the chapters being zoomed into by the user of the application. I felt that changing the clustering dynamically would be too distracting.

Alphabetic or frequency sorting “explodes” the clusters into an unrecognizable space, but sorting by cluster brings them into their tight communities and lets the viewer see some of the interaction between clusters too.

My encoding of the co-location and mapping of shades applied to each cell would certainly be open to debate, and other methods of clustering result in very different communities being found. That said, qualitatively, I spent a lot of time evaluating the results with my own knowledge of the text and found that the current implementation was more satisfactory than any other I tested.

I found it incredibly interesting that every main character in the book at some time interacts with almost every other character. I wouldn’t have expected so much overlap. Compared to Les Mis, the plot is much more dense, I suspect this is due to the 10-fold difference in the number of chapters being clustered over.

Characteristic Words

This plot is probably the most conventional of the four plots, but potentially shows a lot of insights into the text all the same. As the word window surrounded mentions of Yossarian, I hoped that it would capture some of the changing emotions that our main character was feeling at different points in the book.

I used nltk’s simplified part-of-speech tagging in order to get the most rich binning of word types, and allow the user to select the part-of-speech that they want to examine. Only the top few words in the word-type are shown, but this is recalculated as a chapter range is zoomed into. In this way, the words change as the user explores.

The word frequencies were normalized by the number of mentions Yossarian received in a given chapter. For example, if the word “war” occurred 20 times in chapter 1, and Yossarian received 40 mentions in that chapter, a value of 0.5 would be assigned to that word. Thus, chapters with exceedingly high mentions of our main character would not receive a bias and more important words would rise to the top.

I give the option of choosing either a stacked bar or a stacked area plot for this visualization. I like how the stacked area plot better shows consecutive chapters where a word was prominent, but acknowledge that the triangular forms can distort the relationship when there is high variability between chapters. I do quite like the aesthetic of the landscape it presents though.

The dynamic recalculation of top words makes the plot different for every setting, and can give a wide range of insights depending on the selections. While many of the top words are uninteresting, and some are even misclassified by nltk, there were several times when I found some very interesting words being prominent.

Conclusion

I learnt a lot through this process, both in terms of using shiny, as well as about the book itself. I’m extremely happy with the diverse visualizations that I created here, and hope that they are able to help others appreciate the book even more!

Introducing the Kaggle Rank-O-Tron

2014-04-11T00:00:00+00:00

I had a rare couple of days with not (too) much work this past weekend. I had been sandbagging an idea of visualizing the entire Kaggle leaderboard for some time, instead of just user rankings within a single competition.

I began by scraping the Kaggle community pages using BeautifulSoup in Python, it was good to refresh my HTML scraping and regex skills to get all the data I wanted. Once implemented, this was a fairly trivial exercise and the whole process is completed in around 15 minutes. Unfortunately though, the user tiers (Kaggler, Master) is not encoded in these pages, and I wanted to show where the Kaggle elite stood on the curve as well. So I had to implement a second pass where I visited each user’s profile page and extracted their tier there, this meant visiting thousands of pages which takes over 10 hours to perform.

My Data Visualization class led by the talented Sophie Engle has equipped me with a lot of fun new toys to play with, most notably Shiny from RStudio. Thus a mere graph and static blog comment was no longer enough for me. Instead, I decided to make an interactive tool that can show you (approximately) where you will rank in the future given an outcome of a pending competition deadline.

The Kaggle Rank-O-Tron is now live, and is hosted on one of RStudio’s beta servers. You can input your current Kaggle points, and where you expect to finish in your next comp, it will then show you roughly how you will progress.

This is based solely on the equation used by Kaggle to calculate your points. Right now, there is no consideration of several factors that can jostle the users around you. But as I update the database after each competition ends, I’ll try to look for ways to improve the predictions using, well, machine learning of course!

I hope you have some fun with it, but remember, submissions come first!

Kaggle’s Titanic: Getting Started With R - Addendum & Chocolate

2014-02-22T00:00:00+00:00

One of our MSAN professors, Nick Ross, just loves his trivia. Each time we have our Business Strategies class we get a little dose of fun facts at half-time, and last week we learnt that Milton S. Hershey, the founder of the famous chocolate company, had paid a pretty handsome deposit to board the Titanic with his wife (FWIW, I’m more of a Cadbury guy, being Australian and all… but still, chocolate is chocolate).

As it turned out, he never did board the unsinkable ship and gave up several thousand dollars in today’s money that he paid as a deposit for a plush first class cabin. Given the pretty remarkable coincidence given my recent posts, I thought it might be fun to see what the conditional inference tree model predicted would have happened to him if he had ended up on board.

So, what do we know? Well, we have the name “Milton S. Hershey”, or in the dataset’s terms, “Hershey, Mr. Milton S.”. We also know that he booked a first class cabin for himself and his wife and paid a $300 deposit:

Let’s assume that this was a 50% down-payment for two tickets, so $300 could be used for his fare. A little wikipedia tells us he was born on September 13, 1857, which would mean he was 54 years old when the ship left port on April 10, 1912. He was also trying to get from England to the US, so let’s assume that he was sailing from Southampton, though I have been unable to find the exact port he was planning to embark at.

Since we never used the ticket number, or cabin number, for our predictions, we can just leave these as NA values. So let’s build a special Hershey dataframe and combine it to the combi dataframe we built in the tutorial (before we transformed it to build the engineered variables):

> Hershey <- data.frame(Pclass=1, Sex='Male', Age=54, SibSp=1, Parch=0,
                        Fare=300, Embarked='S', PassengerId=NA, Survived=NA,                      
                        Name='Hershey, Mr. Milton S.', Ticket=NA, Cabin=NA)
> combi <- rbind(train, test, Hershey)

Okay. So now we run through the rest of the tutorial and make our engineered variables, but this time, when we split it back up into the train and test sets, we also break out the Hershey dataset:

> Hershey <- combi[1310,]

We then train our model as before, and finally make our prediction on whether he was a lucky guy or not:

> predict(fit, Hershey, OOB=TRUE, type = "response")
[1] 0

Oh dear! Imagine a world without kisses!

So, sadly, our model tells us that Hershey would have perished in the Titanic disaster. Perhaps you would like to dig into whether some of the other famous people who were almost aboard the famous boat would have escaped or not?

Titanic: Getting Started With R - Part 5: Random Forests

2014-01-19T00:00:00+00:00

Tutorial index

Seems fitting to start with a definition,

en-sem-ble

A unit or group of complementary parts that contribute to a single effect, especially:

A coordinated outfit or costume.

A coordinated set of furniture.

A group of musicians, singers, dancers, or actors who perform together

While I won’t be teaching about how to best coordinate your work attire or living room, I think the musician metaphor works here. In an ensemble of talented instrumentalists, the issues one might have with an off-note are overpowered by the others in the group.

The same goes for machine learning. Take a large collection of individually imperfect models, and their one-off mistakes are probably not going to be made by the rest of them. If we average the results of all these models, we can sometimes find a superior model from their combination than any of the individual parts. That’s how ensemble models work, they grow a lot of different models, and let their outcomes be averaged or voted across the group.

We are now well aware of the overfitting problems with decision trees. But if we grow a whole lot of them and have them vote on the outcome, we can get passed this limitation. Let’s build a very small ensemble of three simple decision trees to illustrate:

Each of these trees make their classification decisions based on different variables. So let’s imagine a female passenger from Southampton who rode in first class. Tree one and two would vote that she survived, but tree three votes that she perishes. If we take a vote, it’s 2 to 1 in favour of her survival, so we would classify this passenger as a survivor.

Random Forest models grow trees much deeper than the decision stumps above, in fact the default behaviour is to grow each tree out as far as possible, like the overfitting tree we made in lesson three. But since the formulas for building a single decision tree are the same every time, some source of randomness is required to make these trees different from one another. Random Forests do this in two ways.

The first trick is to use bagging, for bootstrap aggregating. Bagging takes a randomized sample of the rows in your training set, with replacement. This is easy to simulate in R using the sample function. Let’s say we wanted to perform bagging on a training set with 10 rows.

> sample(1:10, replace = TRUE)
 [1]  3  1  9  1  7 10 10  2  2  9

In this simulation, we would still have 10 rows to work with, but rows 1, 2, 9 and 10 are each repeated twice, while rows 4, 5, 6 and 8 are excluded. If you run this command again, you will get a different sample of rows each time. On average, around 37% of the rows will be left out of the bootstrapped sample. With these repeated and omitted rows, each decision tree grown with bagging would evolve slightly differently. If you have very strong features such as gender in our example though, that variable will probably still dominate the first decision in most of your trees.

The second source of randomness gets past this limitation though. Instead of looking at the entire pool of available variables, Random Forests take only a subset of them, typically the square root of the number available. In our case we have 10 variables, so using a subset of three variables would be reasonable. The selection of available variables is changed for each and every node in the decision trees. This way, many of the trees won’t even have the gender variable available at the first split, and might not even see it until several nodes deep.

Through these two sources of randomness, the ensemble contains a collection of totally unique trees which all make their classifications differently. As with our simple example, each tree is called to make a classification for a given passenger, the votes are tallied (with perhaps many hundreds, or thousands of trees) and the majority decision is chosen. Since each tree is grown out fully, they each overfit, but in different ways. Thus the mistakes one makes will be averaged out over them all.

R’s Random Forest algorithm has a few restrictions that we did not have with our decision trees. The big one has been the elephant in the room until now, we have to clean up the missing values in our dataset. rpart has a great advantage in that it can use surrogate variables when it encounters an NA value. In our dataset there are a lot of age values missing. If any of our decision trees split on age, the tree would search for another variable that split in a similar way to age, and use them instead. Random Forests cannot do this, so we need to find a way to manually replace these values.

A method we implicitly used in part 2 when we defined the adult/child age buckets was to assume that all missing values were the mean or median of the remaining data. Since then we’ve learned a lot of new skills though, so let’s use a decision tree to fill in those values instead. Let’s pick up where we left off last lesson, and take a look at the combined dataframe’s age variable to see what we’re up against:

> summary(combi$Age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
   0.17   21.00   28.00   29.88   39.00   80.00     263

263 values out of 1309 were missing this whole time, that’s a whopping 20%! A few new pieces of syntax to use. Instead of subsetting by boolean logic, we can use the R function is.na(), and it’s reciprocal !is.na() (the bang symbol represents “not”). This subsets on whether a value is missing or not. We now also want to use the method="anova" version of our decision tree, as we are not trying to predict a category any more, but a continuous variable. So let’s grow a tree on the subset of the data with the age values available, and then replace those that are missing:

> Agefit <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize,
                  data=combi[!is.na(combi$Age),], 
                  method="anova")
> combi$Age[is.na(combi$Age)] <- predict(Agefit, combi[is.na(combi$Age),])

I left off the family size and family IDs here as I didn’t think they’d have much impact on predicting age. You can go ahead and inspect the summary again, all those NA values are gone.

Let’s take a look at the summary of the entire dataset now to see if there are any other problem variables that we hadn’t noticed before:

> summary(combi)

Two jump out as a problem, though no where near as bad as Age, Embarked and Fare both are lacking values in two different ways.

> summary(combi$Embarked)
      C   Q   S
  2 270 123 914

Embarked has a blank for two passengers. While a blank wouldn’t be a problem for our model like an NA would be, since we’re cleaning anyhow, let’s get rid of it. Because it’s so few observations and such a large majority boarded in Southampton, let’s just replace those two with “S”. First we need to find out who they are though! We can use which for this:

> which(combi$Embarked == '')
[1]  62 830

This gives us the indexes of the blank fields. Then we simply replace those two, and encode it as a factor:

> combi$Embarked[c(62,830)] = "S"
> combi$Embarked <- factor(combi$Embarked)

The other naughty variable was Fare, let’s take a look:

> summary(combi$Fare)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
  0.000   7.896  14.450  33.300  31.280 512.300       1

It’s only one passenger with a NA, so let’s find out which one it is and replace it with the median fare:

> which(is.na(combi$Fare))
[1] 1044
> combi$Fare[1044] <- median(combi$Fare, na.rm=TRUE)

Okay. Our dataframe is now cleared of NAs. Now on to restriction number two: Random Forests in R can only digest factors with up to 32 levels. Our FamilyID variable had almost double that. We could take two paths forward here, either change these levels to their underlying integers (using the unclass() function) and having the tree treat them as continuous variables, or manually reduce the number of levels to keep it under the threshold.

Let’s take the second approach. To do this we’ll copy the FamilyID column to a new variable, FamilyID2, and then convert it from a factor back into a character string with as.character(). We can then increase our cut-off to be a “Small” family from 2 to 3 people. Then we just convert it back to a factor and we’re done:

> combi$FamilyID2 <- combi$FamilyID
> combi$FamilyID2 <- as.character(combi$FamilyID2)
> combi$FamilyID2[combi$FamilySize <= 3] <- 'Small'
> combi$FamilyID2 <- factor(combi$FamilyID2)

Okay, we’re down to 22 levels so we’re good to split the test and train sets back up as we did last lesson and grow a Random Forest. Install and load the package randomForest:

> install.packages('randomForest')
> library(randomForest)

Because the process has the two sources of randomness that we discussed earlier, it is a good idea to set the random seed in R before you begin. This makes your results reproducible next time you load the code up, otherwise you can get different classifications for each run.

> set.seed(415)

The number inside isn’t important, you just need to ensure you use the same seed number each time so that the same random numbers are generated inside the Random Forest function.

Now we’re ready to run our model. The syntax is similar to decision trees, but there’s a few extra options.

> fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare +
                                            Embarked + Title + FamilySize + FamilyID2,
                      data=train, 
                      importance=TRUE, 
                      ntree=2000)

Instead of specifying method="class" as with rpart, we force the model to predict our classification by temporarily changing our target variable to a factor with only two levels using as.factor(). The importance=TRUE argument allows us to inspect variable importance as we’ll see, and the ntree argument specifies how many trees we want to grow.

If you were working with a larger dataset you may want to reduce the number of trees, at least for initial exploration, or restrict the complexity of each tree using nodesize as well as reduce the number of rows sampled with sampsize. You can also override the default number of variables to choose from with mtry, but the default is the square root of the total number available and that should work just fine. Since we only have a small dataset to play with, we can grow a large number of trees and not worry too much about their complexity, it will still run pretty fast.

So let’s look at what variables were important:

> varImpPlot(fit)

Remember with bagging how roughly 37% of our rows would be left out? Well Random Forests doesn’t just waste those “out-of-bag” (OOB) observations, it uses them to see how well each tree performs on unseen data. It’s almost like a bonus test set to determine your model’s performance on the fly.

There’s two types of importance measures shown above. The accuracy one tests to see how worse the model performs without each variable, so a high decrease in accuracy would be expected for very predictive variables. The Gini one digs into the mathematics behind decision trees, but essentially measures how pure the nodes are at the end of the tree. Again it tests to see the result if each variable is taken out and a high score means the variable was important.

Unsurprisingly, our Title variable was at the top for both measures. We should be pretty happy to see that the remaining engineered variables are doing quite nicely too. Anyhow, enough delay, let’s see how it did!

The prediction function works similarly to decision trees, and we can build our submission file in exactly the same way. It will take a bit longer though, as all 2000 trees need to make their classifications and then discuss who’s right:

> Prediction <- predict(fit, test)
> submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
> write.csv(submit, file = "firstforest.csv", row.names = FALSE)

Hrmm, well this actually worked out exactly the same as Kaggle’s Python random forest tutorial. I wouldn’t take that as the expected result from any forest though, this may just be pure coincidence. It’s relatively poor performance does go to show that on smaller datasets, sometimes a fancier model won’t beat a simple one. Besides that, there’s also the private leaderboard as only 50% of the test data is evaluated for our public scores.

But let’s not give up yet. There’s more than one ensemble model. Let’s try a forest of conditional inference trees. They make their decisions in slightly different ways, using a statistical test rather than a purity measure, but the basic construction of each tree is fairly similar.

So go ahead and install and load the party package.

> install.packages('party')
> library(party)

We again set the seed for consistent results and build a model in a similar way to our Random Forest:

> set.seed(415)
> fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare +
                                       Embarked + Title + FamilySize + FamilyID,
                 data = train, 
                 controls=cforest_unbiased(ntree=2000, mtry=3))

Conditional inference trees are able to handle factors with more levels than Random Forests can, so let’s go back to out original version of FamilyID. You may have also noticed a few new arguments. Now we have to specify the number of trees inside a more complicated command, as arguments are passed to cforest differently. We also have to manually set the number of variables to sample at each node as the default of 5 is pretty high for our dataset. Okay, let’s make another prediction:

> Prediction <- predict(fit, test, OOB=TRUE, type = "response")

The prediction function requires a few extra nudges for conditional inference forests as you see. Let’s write a submission and submit it!

Congratulations! At the time of writing you are now in the top 5% of a Kaggle competition!

You’ve come a long way, from the bottom of the Kaggle leaderboard to the top! There may be a few more insights to wring from this dataset yet though. We never did look at the ticket or cabin numbers, so take a crack at extracting some insights from them to see if any more gains are possible. Maybe extracting the cabin letter (deck) or number (location) and extrapolating to the rest of the passengers’9 family if they’re missing might be worth a try?

While there’s just no way that I could introduce all the R syntax you’ll need to navigate dataframes in different situations, I hope that I’ve given you a good start. I linked to some good R guides way back in the introduction that should help you learn more, here and here, as well as an excellent book The Art of R Programming: A Tour of Statistical Software Design to continue learning about programming in R.

Well that’s it for the tutorial series. I really hope that you can exceed the benchmark I’ve posted here. If you find some new ideas that develop the base that I’ve presented, be sure to contribute back to the community through the Kaggle forums, or comment on the blog.

I hope you found the tutorials interesting and informative, and that they gave you a taste for machine learning that will spur you to compete in the prize-eligible Kaggle competitions! I hope to see you on the leaderboards out in the wild. Good luck and happy learning!

All code from this tutorial is available on my Github repository

Titanic: Getting Started With R - Part 4: Feature Engineering

2014-01-16T00:00:00+00:00

Tutorial index

Feature engineering is so important to how your model performs, that even a simple model with great features can outperform a complicated algorithm with poor ones. In fact, feature engineering has been described as easily the most important factor in determining the success or failure of your predictive model. Feature engineering really boils down to the human element in machine learning. How much you understand the data, with your human intuition and creativity, can make the difference.

So what is feature engineering? It can mean many things to different problems, but in the Titanic competition it could mean chopping, and combining different attributes that we were given by the good folks at Kaggle to squeeze a little bit more value from them. In general, an engineered feature may be easier for a machine learning algorithm to digest and make rules from than the variables it was derived from.

The initial suspects for gaining more machine learning mojo from are the three text fields that we never sent into our decision trees last time. While the ticket number, cabin, and name were all unique to each passenger; perhaps parts of those text strings could be extracted to build a new predictive attribute. Let’s start with the name field. If we take a glance at the first passenger’s name we see the following:

> train$Name[1]
[1] Braund, Mr. Owen Harris
891 Levels: Abbing, Mr. Anthony Abbott, Mr. Rossmore Edward ... Zimmerman, Mr. Leo

Previously we have only accessed passenger groups by subsetting, now we access an individual by using the row number, 1, as an index instead. Okay, no one else on the boat had that name, that’s pretty much certain, but what else might they have shared? Well, I’m sure there were plenty of Mr’s aboard. Perhaps the persons title might give us a little more insight.

If we scroll through the dataset we see many more titles including Miss, Mrs, Master, and even the Countess! The title “Master” is a bit outdated now, but back in these days, it was reserved for unmarried boys. Additionally, the nobility such as our Countess would probably act differently to the lowly proletariat too. There seems to be a fair few possibilities of patterns in this that may dig deeper than the combinations of age, gender, etc that we looked at before.

In order to extract these titles to make new variables, we’ll need to perform the same actions on both the training and testing set, so that the features are available for growing our decision trees, and making predictions on the unseen testing data. An easy way to perform the same processes on both datasets at the same time is to merge them. In R we can use rbind, which stands for row bind, so long as both dataframes have the same columns as each other. Since we obviously lack the Survived column in our test set, let’s create one full of missing values (NAs) and then row bind the two datasets together:

> test$Survived <- NA
> combi <- rbind(train, test)

Now we have a new dataframe called “combi” with all the same rows as the original two datasets, stacked in the order in which we specified: train first, and test second.

If you look back at the output of our inquiry on Owen, his name is still encoded as a factor. As we mentioned earlier in the tutorial series, strings are automatically imported as factors in R, even if it doesn’t make sense. So we need to cast this column back into a text string. To do this we use as.character. Let’s do this and then take another look at Owen:

> combi$Name <- as.character(combi$Name)
> combi$Name[1]
[1] "Braund, Mr. Owen Harris"

Excellent, no more levels, now it’s just pure text. In order to break apart a string, we need some hooks to tell the program to look for. Nicely, we see that there is a comma right after the person’s last name, and a full stop after their title. We can easily use the function strsplit, which stands for string split, to break apart our original name over these two symbols. Let’s try it out on Mr. Braund:

> strsplit(combi$Name[1], split='[,.]')
[[1]]
[1] "Braund"       " Mr"          " Owen Harris"

Okay, good. Here we have sent strsplit the cell of interest, and given it some symbols to chose from when splitting the string up, either a comma or period. Those symbols in the square brackets are called regular expressions, though this is a very simple one, and if you plan on working with a lot of text I would certainly recommend getting used to using them!

We see that the title has been broken out on its own, though there’s a strange space before it begins because the comma occurred at the end of the surname. But how do we get to that title piece and clear out the rest of the stuff we don’t want? An index [[1]] is printed before the text portions. Let’s try to dig into this new type of container by appending all those square brackets to the original command:

> strsplit(combi$Name[1], split='[,.]')[[1]]
[1] "Braund"       " Mr"          " Owen Harris"

Getting there! String split uses a doubly stacked matrix because it can never be sure that a given regex will have the same number of pieces. If there were more commas or periods in the name, it would create more segments, so it hides them a level deeper to maintain the rectangular types of containers that we are used to in things like spreadsheets, or now dataframes! Let’s go a level deeper into the indexing mess and extract the title. It’s the second item in this nested list, so let’s dig in to index number 2 of this new container:

> strsplit(combi$Name[1], split='[,.]')[[1]][2]
[1] " Mr"

Great. We have isolated the title we wanted at last. But how to apply this transformation to every row of the combined train/test dataframe? Luckily, R has some extremely useful functions that apply more complicated functions one row at a time. As we had to dig into this container to get the title, simply trying to run combi$Title <- strsplit(combi$Name, split='[,.]')[[1]][2] over the whole name vector would result in all of our rows having the same value of Mr., so we need to work a bit harder. Unsurprisingly applying a function to a lot of cells in a dataframe or vector uses the apply suite of functions of R:

> combi$Title <- sapply(combi$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})

R’s apply functions all work in slightly different ways, but sapply will work great here. We feed sapply our vector of names and our function that we just came up with. It runs through the rows of the vector of names, and sends each name to the function. The results of all these string splits are all combined up into a vector as output from the sapply function, which we then store to a new column in our original dataframe, called Title.

Finally, we may wish to strip off those spaces from the beginning of the titles. Here we can just substitute the first occurrence of a space with nothing. We can use sub for this (gsub would replace all spaces, poor “the Countess” would look strange then though):

> combi$Title <- sub(' ', '', combi$Title)

Alright, we now have a nice new column of titles, let’s have a look at it:

> table(combi$Title)
        Capt          Col          Don         Dona           Dr     Jonkheer         Lady
           1            4            1            1            8            1            1
       Major       Master         Miss         Mlle          Mme           Mr          Mrs
           2           61          260            2            1          757          197
          Ms          Rev          Sir the Countess
           2            8            1            1

Hmm, there are a few very rare titles in here that won’t give our model much to work with, so let’s combine a few of the most unusual ones. We’ll begin with the French. Mademoiselle and Madame are pretty similar (so long as you don’t mind offending) so let’s combine them into a single category:

> combi$Title[combi$Title %in% c('Mme', 'Mlle')] <- 'Mlle'

What have we done here? The %in% operator checks to see if a value is part of the vector we’re comparing it to. So here we are combining two titles, “Mme” and “Mlle”, into a new temporary vector using the c() operator and seeing if any of the existing titles in the entire Title column match either of them. We then replace any match with “Mlle”.

Let’s keep looking for redundancy. It seems the very rich are a bit of a problem for our set here too. For the men, we have a handful of titles that only one or two have been blessed with: Captain, Don, Major and Sir. All of these are either military titles, or rich fellas who were born with vast tracts of land.

For the ladies, we have Dona, Lady, Jonkheer (*see comments below), and of course our Countess. All of these are again the rich folks, and may have acted somewhat similarly due to their noble birth. Let’s combine these two groups and reduce the number of factor levels to something that a decision tree might make sense of:

> combi$Title[combi$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
< combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'

Our final step is to change the variable type back to a factor, as these are essentially categories that we have created:

> combi$Title <- factor(combi$Title)

Alright. We’re done with the passenger’s title now. What else can we think up? Well, there’s those two variables SibSb and Parch that indicate the number of family members the passenger is travelling with. Seems reasonable to assume that a large family might have trouble tracking down little Johnny as they all scramble to get off the sinking ship, so let’s combine the two variables into a new one, FamilySize:

> combi$FamilySize <- combi$SibSp + combi$Parch + 1

Pretty simple! We just add the number of siblings, spouses, parents and children the passenger had with them, and plus one for their own existence of course, and have a new variable indicating the size of the family they travelled with.

Anything more? Well we just thought about a large family having issues getting to lifeboats together, but maybe specific families had more trouble than others? We could try to extract the Surname of the passengers and group them to find families, but a common last name such as Johnson might have a few extra non-related people aboard. In fact there are three Johnsons in a family with size 3, and another three probably unrelated Johnsons all travelling solo.

Combining the Surname with the family size though should remedy this concern. No two family-Johnson’s should have the same FamilySize variable on such a small ship. So let’s first extract the passengers’ last names. This should be a pretty simple change from the title extraction code we ran earlier, now we just want the first part of the strsplit output:

> combi$Surname <- sapply(combi$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})

We then want to append the FamilySize variable to the front of it, but as we saw with factors, string operations need strings. So let’s convert the FamilySize variable temporarily to a string and combine it with the Surname to get our new FamilyID variable:

combi$FamilyID <- paste(as.character(combi$FamilySize), combi$Surname, sep="")

We used the function paste to bring two strings together, and told it to separate them with nothing through the sep argument. This was stored to a new column called FamilyID. But those three single Johnsons would all have the same Family ID. Given we were originally hypothesising that large families might have trouble sticking together in the panic, let’s knock out any family size of two or less and call it a “small” family. This would fix the Johnson problem too.

> combi$FamilyID[combi$FamilySize <= 2] <- 'Small'

Let’s see how we did for identifying these family groups:

> table(combi$FamilyID)
           11Sage           3Abbott         3Appleton         3Beckwith           3Boulos
               11                 3                 1                 2                 3
          3Bourke            3Brown         3Caldwell          3Christy          3Collyer
                3                 4                 3                 2                 3
         3Compton          3Cornell           3Coutts           3Crosby           3Danbom
                3                 1                 3                 3                 3 . . .

Hmm, a few seemed to have slipped through the cracks here. There’s plenty of FamilyIDs with only one or two members, even though we wanted only family sizes of 3 or more. Perhaps some families had different last names, but whatever the case, all these one or two people groups is what we sought to avoid with the three person cut-off. Let’s begin to clean this up:

> famIDs <- data.frame(table(combi$FamilyID))

Now we have stored the table above to a dataframe. Yep, you can store most tables to a dataframe if you want to, so let’s take a look at it by clicking on it in the explorer:

Here we see again all those naughty families that didn’t work well with our assumptions, so let’s subset this dataframe to show only those unexpectedly small FamilyID groups.

famIDs <- famIDs[famIDs$Freq <= 2,]

We then need to overwrite any family IDs in our dataset for groups that were not correctly identified and finally convert it to a factor:

> combi$FamilyID[combi$FamilyID %in% famIDs$Var1] <- 'Small'
> combi$FamilyID <- factor(combi$FamilyID)

We are now ready to split the test and training sets back into their original states, carrying our fancy new engineered variables with them. The nicest part of what we just did is how the factors are treated in R. Behind the scenes, factors are basically stored as integers, but masked with their text names for us to look at. If you create the above factors on the isolated test and train sets separately, there is no guarantee that both groups exist in both sets.

For instance, the family “3Johnson” previously discussed does not exist in the test set. We know that all three of them survive from the training set data. If we had built our factors in isolation, there would be no factor “3Johnson” for the test set. This would upset any machine learning model because the factors between the training set used to build the model and the test set it is asked to predict for are not consistent. ie. R will throw errors at you if you try.

Because we built the factors on a single dataframe, and then split it apart after we built them, R will give all factor levels to both new dataframes, even if the factor doesn’t exist in one. It will still have the factor level, but no actual observations of it in the set. Neat trick right? Let me assure you that manually updating factor levels is a pain.

So let’s break them apart and do some predictions on our new fancy engineered variables:

> train <- combi[1:891,]
> test <- combi[892:1309,]

Here we introduce yet another subsetting method in R; there are many depending on how you want to chop up your data. We have isolated certain ranges of rows of the combi dataset based on the sizes of the original train and test sets. The comma after that with no numbers following it indicates that we want to take ALL columns with this subset and store it to the assigned dataframe. This gives us back our original number of rows, as well as all our new variables including the consistent factor levels.

Time to do our predictions! We have a bunch of new variables, so let’s send them to a new decision tree. Last time the default complexity worked out pretty well, so let’s just grow a tree with the vanilla controls and see what it can do:

> fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID,
               data=train, 
               method="class")

Interestingly our new variables are basically governing our tree. Here’s another drawback with decision trees that I didn’t mention last time: they are biased to favour factors with many levels. Look at how our 61-level FamilyID factor is so prominent here, and the tree picked out all the families that are biased one way more than the others. This way the decision node can chop and change the data into the best way possible combination for purity of the following nodes.

But all that aside, you know should know how to create a submission from a decision tree, so let’s see how it performed!

Awesome, we just almost halved our rank! All by squeezing a bit more value out of what we already had. And this is just a sample of what you might be able to find in this dataset.

Go ahead and try and create some more engineered variables! As before, I also really encourage you to play around with the complexity parameters and maybe try trimming some deeper trees to see if it helps or hinders your rank. You may even consider excluding some variables from the tree to see if that changes anything too.

In most cases though, the title or gender variables will govern the first decision due to the greedy nature of decision trees. The bias towards many-levelled factors won’t go away either, and the overfitting problem can be difficult to gauge without actually sending in submissions, but good judgement can help.

Next lesson, we will overcome these limitations by building an ensemble of decision trees with the powerful Random Forest algorithm. Go there now!

All code from this tutorial is available on my Github repository

Titanic: Getting Started With R - Part 3: Decision Trees

2014-01-13T00:00:00+00:00

Tutorial index

Last lesson we sliced and diced the data to try and find subsets of the passengers that were more, or less, likely to survive the disaster. We climbed up the leaderboard a great deal, but it took a lot of effort to get there. To find more fine-grained subsets with predictive ability would require a lot of time to adjust our bin sizes and look at the interaction of many different variables. Luckily there is a simple and elegant algorithm that can do this work for us. Today we’re going to use machine learning to build decision trees to do the heavy lifting for us.

Decision trees have a number of advantages. They are what’s known as a glass-box model, after the model has found the patterns in the data you can see exactly what decisions will be made for unseen data that you want to predict. They are also intuitive and can be read by people with little experience in machine learning after a brief explanation. Finally, they are the basis for some of the most powerful and popular machine learning algorithms.

I won’t get into the mathematics here, but conceptually, the algorithm starts with all of the data at the root node (drawn at the top) and scans all of the variables for the best one to split on. The way it measures this is to make the split on the variable that results in the most pure nodes below it, ie with either the most 1’s or the most 0’9s in the resulting buckets. But let’s look at something more familiar to get the idea. Here we draw a decision tree for only the gender variable, and some familiar numbers jump out:

Let’s decode the numbers shown on this new representation of our original manual gender-based model. The root node, at the top, shows our tutorial one insights, 62% of passengers die, while 38% survive. The number above these proportions indicates the way that the node is voting (recall we decided at this top level that everyone would die, or be coded as zero) and the number below indicates the proportion of the population that resides in this node, or bucket (here at the top level it is everyone, 100%).

So far, so good. Now let’s travel down the tree branches to the next nodes down the tree. If the passenger was a male, indicated by the boolean choice below the node, you move left, and if female, right. The survival proportions exactly match those we found in tutorial two through our proportion tables. If the passenger was male, only 19% survive, so the bucket votes that everyone here (65% of passengers) perish, while the female bucket votes in the opposite manner, most of them survive as we saw before. In fact, the above decision tree is an exact representation of our gender model from last lesson.

The final nodes at the bottom of the decision tree are known as terminal nodes, or sometimes as leaf nodes. After all the boolean choices have been made for a given passenger, they will end up in one of the leaf nodes, and the majority vote of all passengers in that bucket determine how we will predict for new passengers with unknown fates.

But you can keep going, and this is what I alluded to at the end of the last lesson. We can grow this tree until every passenger is classified and all the nodes are marked with either 0% or 100% chance of survival… All that chopping and comparing of subsets is taken care of for us in the blink of an eye!

Decision trees do have some drawbacks though, they are greedy. They make the decision on the current node which appear to be the best at the time, but are unable to change their minds as they grow new nodes. Perhaps a better, more pure, tree would have been grown if the gender split occurred later? It is really hard to tell, there are a huge number of decisions that could be made, and exploring every possible version of a tree is extremely computationally expensive. This is why the greedy algorithm is used.

As an example, imagine a cashier in a make-believe world with a currency including 25c, 15c and 1c coins. The cashier must make change for 30c using the smallest number of coins possible. A greedy algorithm would start with the coin that leaves the smallest amount of change left to pay:

Greedy: 25 + 1 + 1 + 1 + 1 + 1 = 30c, with 6 coins
Optimal: 15 + 15 = 30c, with 2 coins

Clearly the greedy cashier algorithm failed to find the best solution here, and the same is true with decision trees. Though they usually do a great job given their speed and the other advantages we already mentioned, the optimal solution is not guaranteed. Decision trees are also prone to overfitting which requires us to use caution with how deep we grow them as we’ll see later.

So, let’s get started with our first real algo! Now we start to open up the power of R: its packages. R is extremely extensible, you’d be hard pressed to find a package that doesn’t automatically do what you need. There’s thousands of options out there written by people who needed the functionality and published their work. You can easily add these packages within R with just a couple of commands.

The one we’ll need for this lesson comes with R. It’s called rpart for “Recursive Partitioning and Regression Trees” and uses the CART decision tree algorithm. While rpart comes with base R, you still need to import the functionality each time you want to use it. Go ahead:

> library(rpart)

Now let’s build our first model. Let’s take a quick review of the possible variables we could look at. Last time we used aggregate and proportion tables to compare gender, age, class and fare. But we never did investigate SibSp, Parch or Embarked. The remaining variables of passenger name, ticket number and cabin number are all unique identifiers for now; they don’t give any new subsets that would be interesting for a decision tree. So let’s build a tree off everything else.

The format of the rpart command works similarly to the aggregate function we used in tutorial 2. You feed it the equation, headed up by the variable of interest and followed by the variables used for prediction. You then point it at the data, and for now, follow with the type of prediction you want to run (see ?rpart for more info). If you wanted to predict a continuous variable, such as age, you may use method="anova". This would run generate decimal quantities for you. But here, we just want a one or a zero, so method="class" is appropriate:

> fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
               data=train,
               method="class")

Let’s examine the tree. There are a lot of ways to do this, and the built-in version requires running

> plot(fit)
> text(fit)

Hmm, not very pretty or insightful. To get some more informative graphics, you will need to install some external packages. As I mentioned, tons of world-class developers donate their time and energy to the R project by contributing powerful packages to CRAN, free of charge. You can install them from within R using install.packages(), and load them as before with library(). Here are the ones we need for some better graphics for rpart:

> install.packages('rattle')
> install.packages('rpart.plot')
> install.packages('RColorBrewer')
> library(rattle)
> library(rpart.plot)
> library(RColorBrewer)

Let’s try rendering this tree a bit nicer with fancyRpartPlot (of course).

> fancyRpartPlot(fit)

Okay, now we’ve got somewhere readable. The decisions that have been found go a lot deeper than what we saw last time when we looked for them manually. Decisions have been found for the SipSp variable, as well as the port of embarkation one that we didn’t even look at. And on the male side, the kids younger than 6 years old have a better chance of survival, even if there weren’t too many aboard. That resonates with the famous naval law we mentioned earlier. It all looks very promising, so let’s send another submission into Kaggle!

To make a prediction from this tree doesn’t require all the subsetting and overwriting we did last lesson, it’s actually a lot easier.

> Prediction <- predict(fit, test, type = "class")
> submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
< write.csv(submit, file = "myfirstdtree.csv", row.names = FALSE)

Here we have called rpart’s predict function. Here we point the function to the model’s fit object, which contains all of the decisions we see above, and tell it to work its magic on the test dataframe. No need to tell it which variables we originally used in the model-building phase, it automatically looks for them and will certainly let you know if something is wrong. Finally we tell it to again use the class method (for ones and zeros output) and as before write the output to a dataframe and submission file.

Let’s send it in and see how our algorithm performed!

Nice! We just jumped hundreds of spots with only an extra 0.5% increase in accuracy! Are you getting the picture here? The higher you climb in a Kaggle leaderboard, the more important these fractional percentage bumps become.

The rpart package automatically caps the depth that the tree grows by using a metric called complexity which stops the resulting model from getting too out of hand. But we already saw that a more complex model than what we made ourselves did a bit better, so why not go all out and override the defaults? Let’s do it.

You can find the default limits by typing ?rpart.control. The first one we want to unleash is the cp parameter, this is the metric that stops splits that aren’t deemed important enough. The other one we want to open up is minsplit which governs how many passengers must sit in a bucket before even looking for a split. Let’s max both out and reduce cp to zero and minsplit to 2 (no split would obviously be possible for a single passenger in a bucket):

> fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
               data=train,
               method="class", 
               control=rpart.control(minsplit=2, cp=0))
> fancyRpartPlot(fit)

Okay, I can’t even see what’s going on here, but with that much subsetting and mining for tiny nuggets of truth, how could we go wrong? Let’s make a sub from this model and get to the top of the leaderboard!

Even our simple gender model did better! What went wrong? Welcome to overfitting.

Overfitting is technically defined as a model that performs better on a training set than another simpler model, but does worse on unseen data, as we saw here. We went too far and grew our decision tree out to encompass massively complex rules that may not generalize to unknown passengers. Perhaps that 34 year old female in third class who paid $20.17 for a ticket from Southampton with a sister and mother aboard may have been a bit of a rare case after all.

The point of this exercise was that you must use caution with decision trees. While this particular tree may have been 100% accurate on the data that you trained it on, even a trivial tree with only one rule could beat it on unseen data. You just overfit big time!

Use caution with decision trees, and any other algorithm actually, or you can find yourself making rules from the noise you’ve mistaken for signal!

Before moving on, I encourage you to have a play with the various control parameters we saw in the rpart.control help file. Perhaps you can find a tree that does a little better by either growing it out further, or reigning it in. You can also manually trim trees in R with these commands:

> fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
               data=train,
               method="class",
               control=rpart.control( your controls ))
> new.fit <- prp(fit,snip=TRUE)$obj
> fancyRpartPlot(new.fit)

An interactive version of the decision tree will appear in the plot tab where you simply click on the nodes that you want to kill. Once you’re satisfied with the tree, hit “quit” and it will be stored to the new.fit object. Try to look for overly complex decisions being made, and kill the nodes that appear to go to far.

Next lesson, we will push the envelope further by introducing some feature engineering concepts. Go there now!

All code from this tutorial is available on my Github repository