Armchair Particle Physicist

The Higgs Boson Machine Learning Challenge was probably the most intense competition I’ve worked on, and most likely the one I’m most proud of my solution for too. Sure, there’s a lot that I could have done better, but that’s why I work on Kaggle competitions. I want to continuously improve on how I work this ML-mojo even more effectively, and to that end I feel that I’m getting there; but still have plenty to learn. And that’s a great thing.

I felt the glow of the top 10 for a few days towards the end of the competition. To have hit that point on the most popular competition ever run on Kaggle (Titanic aside) is a very nice thing to have experienced, I highly recommend it. But in the end, I probably overfit to the public LB a little bit, and my time ran out before I could feel truly comfortable with my cut-off choices. Not to take away from the winners, but I also believe there was a pretty hefty element of luck in this comp if you focused on improving your model rather than squaring away any cross-validation instability. Take a look at the massive fluctuations from a few folds of one of my model’s CV below and you can perhaps imagine my worries about where I would end up.

I am very impressed with the Kagglers that managed to CV this competition to a high degree of certainty. Some on the forums were talking about 40-70 folds to get a good idea of their model cut-off errors. I generally work comps locally on my little 4-core 8GB laptop, and at 12-36 hours run time per submission on this competition, I really couldn’t afford much more rigor on my already expensive models.

So anyhow, here’s a high-level look at what I did for Higgs, and how I briefly approached the top of the leaderboard… It basically comes down to my new favorite feature engineering pipeline:

  • Treated PRI_jet_num as categorical from the get-go
  • Automated feature engineering:
    • Normalized all variables, applied +, -, * operators
    • Normalized all variables, took their absolute value and summed
    • Raw momentum/mass variables, applied +, -, *, / operators
    • Raw angle-type variables, applied +, - operators
    • Raw momentum/mass variables, *, / by number of jets

I rolled these variables out incrementally over the 2 months I worked on Higgs. Obviously I had a dimensionality problem here. With 33 variables, that’s several hundred new variables each time you look at a new set, most of which is garbage. So, here’s the pipeline I used to throw out the chaff:

  • Generate the automated variables for the upper triangle.
  • Calculate the Spearman correlation (I was using tree-based models) between each auto-variable and throw out those that are greater than 99% correlated with any other auto-variable.
  • For each variable set, run SKLearn’s GBM with 200 trees, depth of 6, and learning rate of 0.05 over two folds.
  • Note the variables that are greater than 1% of the importance of the top variable for the fold.
  • Keep only variables that meet this criteria for both folds.

It was my hope that this would get rid of spurious variables. I think I probably could have cut deeper as the 1% thresholds are clearly arbitrarily chosen. This would have saved some training time for the next phases too… And thus I could have CV’d harder. Oh well.

Now, I created a new dataset with all the surviving variables from each auto-variable set. And then…

  • Calculate the Spearman correlation for the new dataset and throw out the 1% most correlated (between rather than within variable sets this time).
  • Run a GBM with 500 trees, depth of 6, and learning rate of 0.05 over two folds.
  • Note the variable that are greater than 0.5% of the importance of the top variable for the fold.
  • Keep only variables that meet this criteria for both folds.

After this process I usually had a list of 250-300 variables to run with depending on which sets I was combining. Here’s a glance at the top 10 variables from one fold. Note that the scaled interactions are simply marked with the operator, while raw quantities explicitly call out that fact with ‘/raw’ for example:

feature importance
PRI_tau_phi_-_PRI_met_phi 0.029825
DER_mass_MMC 0.029003
DER_mass_transverse_met_lep_/raw_PRI_tau_pt 0.023429
DER_mass_vis_abs_PRI_lep_eta 0.022919
DER_mass_vis_abs_PRI_tau_eta 0.021729
PRI_tau_eta_*_PRI_jet_leading_eta 0.019790
DER_mass_transverse_met_lep_-_PRI_tau_pt 0.019458
DER_mass_vis 0.018423
DER_pt_ratio_lep_tau_-_PRI_lep_pt 0.017572
DER_mass_vis_*_PRI_met_sumet 0.016613

I am no high-energy physicist, but maybe someone out there may care to comment as to why these might have been important to discriminating background from signal. As should be truly evident by now, I cared not why, just that it was working.

Glancing at my CV and LB scores, I’m thinking that most of the bump I saw was from the normalized variables. Why? Well, at a guess, trees are greedy and make decisions at each node without looking forward or back, they will miss the importance of interactions at each and every node as that is never inspected. Generating scaled interaction variables seems to me to go hand-in-hand with tree-based learning. “Is this variable ‘big’ and that one ‘small’?”. These automated variables pick that right up in a single node, where you might waste two or more to get there in a raw tree. It worked nicely for this comp and I’ll be keeping it in my toolkit for the next one.

The rest of my pipeline was pretty standard and rested on XGBoost's shoulders with a 4-fold CV that was clearly too loose for this evaluation metric. I experimented with a fairly wide range of hyper-parameters, but it was hard to discern any significant difference between them due to the CV instabilities.

Anyhow, after Higgs ended I finally hit the top 1000 of Kagglers and scored my third top 10%. The next milestone is quite obvious now, the Kaggle Master golden jersey must be mine!

See you on the leaderboards.

Catch-22: Visualized

Find the app here.

The Dataset

Catch 22 by Joseph Heller is my favorite novel. I recently finished reading it (again) and love the creative use of language and the ridiculous characters’ interactions throughout the book. For my visualization class, it was an easy choice to select the text as my final project ‘dataset’. The text has around 175,000 words, divided into 42 chapters. I found a raw text version of the book on the web and got to work.

I parsed the text in Python using a combination of regexes and simple string matches. The code for the iPython Notebook I built may be viewed here.

The script first extracts the text from each chapter, converted to lower-case and punctuation-stripped. I then loop through a list of the main characters who appear more than 50 times and look for mentions of their name. I defined the ‘time’ of their appearances the percentile of the chapter, ie. a character that occurs at the 10th percentile of chapter three would be stored as 3.10. This dataset formed the basis for both the ‘Character Appearances’ and ‘Character Co-Occurrences’ plots.

A similar process was used to extract the occurrences of the various Mediterranean locations mentioned in the book. I found a fan site where all of the locations were listed, and searched for these names in the text. Once identified, the cities were geocoded using the geopy package in Python. This dataset was used for the ‘Mediterranean Travels’ visualization.

The final dataset was created by scanning the chapters for mentions of Yossarian. A window of the 25 words either side of these mentions was then collected and tagged with the chapter locations where they occurred. The nltk package in Python was used to both clean the word list of stopwords and then do part-of-speech tagging on the remainder. This dataset formed the basis of the ‘Characteristic Words’ visualization.

I visualized these data sets interactively with shiny in R. You can find the code at my GitHub repository. Wherever possible, I linked the interactivity between plots so that you can zoom into areas of interest.

Mediterranean Travels

This visualization maps the mentions of the locations around the Mediterranean that are mentioned throughout the book. Aesthetically, I was very happy with the final product, the low-resolution border data from the maps package had an unexpected synergy with the original text, namely the jagged outline of a jumping soldier (Yossarian) on the cover. Hence I stuck with the low quality map data as I really liked the effect.

Clearly there is a temporal element to these journeys too, I indicate time by applying a linear fade in transparency which gives a hint as to the sequence of the mentions. The edges are plotted by their great circles which was really quite easy to implement using the geosphere package in R.

I feel that the visualization gives information about the book that may not have been totally clear during its reading. While I vividly remembered Milo’s frantic trips about various merchant ports in Chapter 22, it is extremely interesting to see exactly how frantic it was, and feel sorry for Yossarian and Orr who were dragged along and deprived of sleep for days.

Character Appearances

This plot essentially represents a time series of when different characters were mentioned in the book. It was inspired by the ‘Character Mentions’ plot by Jeff Clark’s Novel Views: Les Miserables series.

I plotted the data as a standard scatterplot with chapter as x-axis (since it is similar to time) and the characters as a discrete y-axis with a vertical bar as the marker.

I feel that every pixel in this visualization works hard. The lack of a ‘dot’ speaks just as loudly as its presence, especially for the prominent characters in the novel. I’m very happy to see patterns such as chapters heavy with their name-sake, and others almost devoid of the character they are named after. It was also interesting to see how some main characters really only showed up occasionally in the book, even if they left a lasting impression on the reader.

Character Co-Occurences

This plot attempts to show which characters spend a lot of time together throughout the book. It was inspired by the Les Misérables Co-occurrence plot by Mike Bostock.

The data used to build this visualization was the exact same as that used in the previous one, but required substantial transformation to get it into a form that could represent these patterns. Many different methods were examined to find when characters were co-located, but the one I settled on was when two characters were both merely present in a chapter. Looking for when characters were prominent (ie. both mentioned many times) ended up creating an extremely lop-sided view with more frequently appearing characters totally dominating the data. In fact, the co-location cell for Yossarian with himself was orders of magnitude higher than any other when other approaches were used.

Clustering added another dimension to this plot. A hierarchical clustering scheme is applied over the entire book to try and find communities among the characters. Again, presence in a chapter (1 for present, 0 for not) was used and a 42-dimensional euclidean distance used to cluster the characters with a complete-link AGNES algorithm. Manual inspection of the dendograms for different clustering schemes and distance measures found that this was the most ‘level’, in that more frequently appearing characters dominated the scheme the least. Here’s the dendogram for six clusters:

When the user chooses to colour the plot by clusters, cells for the co-location of characters sharing the same cluster are filled with a unique colour while those cells showing co-location of characters from different communities are shaded grey. It should be noted that clustering is performed over the entire text, not the chapters being zoomed into by the user of the application. I felt that changing the clustering dynamically would be too distracting.

Alphabetic or frequency sorting ‘explodes’ the clusters into an unrecognizable space, but sorting by cluster brings them into their tight communities and lets the viewer see some of the interaction between clusters too.

My encoding of the co-location and mapping of shades applied to each cell would certainly be open to debate, and other methods of clustering result in very different communities being found. That said, qualitatively, I spent a lot of time evaluating the results with my own knowledge of the text and found that the current implementation was more satisfactory than any other I tested.

I found it incredibly interesting that every main character in the book at some time interacts with almost every other character. I wouldn’t have expected so much overlap. Compared to Les Mis, the plot is much more dense, I suspect this is due to the 10-fold difference in the number of chapters being clustered over.

Characteristic Words

This plot is probably the most conventional of the four plots, but potentially shows a lot of insights into the text all the same. As the word window surrounded mentions of Yossarian, I hoped that it would capture some of the changing emotions that our main character was feeling at different points in the book.

I used nltk's simplified POS tagging in order to get the most rich binning of word types, and allow the user to select the part-of-speech that they want to examine. Only the top few words in the word-type are shown, but this is recalculated as a chapter range is zoomed into. In this way, the words change as the user explores.

The word frequencies were normalized by the number of mentions Yossarian received in a given chapter. For example, if the word ‘war’ occurred 20 times in chapter 1, and Yossarian received 40 mentions in that chapter, a value of 0.5 would be assigned to that word. Thus, chapters with exceedingly high mentions of our main character would not receive a bias and more important words would rise to the top.

I give the option of choosing either a stacked bar or a stacked area plot for this visualization. I like how the stacked area plot better shows consecutive chapters where a word was prominent, but acknowledge that the triangular forms can distort the relationship when there is high variability between chapters. I do quite like the aesthetic of the landscape it presents though.

The dynamic recalculation of top words makes the plot different for every setting, and can give a wide range of insights depending on the selections. While many of the top words are uninteresting, and some are even misclassified by nltk, there were several times when I found some very interesting words being prominent.

Conclusion

I learnt a lot through this process, both in terms of using shiny, as well as about the book itself. I’m extremely happy with the diverse visualizations that I created here, and hope that they are able to help others appreciate the book even more!

Introducing the Kaggle Rank-O-Tron

I had a rare couple of days with not (too) much work this past weekend. I had been sandbagging an idea of visualizing the entire Kaggle leaderboard for some time, instead of just user rankings within a single competition.

I began by scraping the Kaggle community pages using BeautifulSoup in Python, it was good to refresh my HTML scraping and regex skills to get all the data I wanted. Once implemented, this was a fairly trivial exercise and the whole process is completed in around 15 minutes. Unfortunately though, the user tiers (Kaggler, Master) is not encoded in these pages, and I wanted to show where the Kaggle elite stood on the curve as well. So I had to implement a second pass where I visited each user’s profile page and extracted their tier there, this meant visiting thousands of pages which takes over 10 hours to perform.

My Data Visualization class led by the talented Sophie Engle has equipped me with a lot of fun new toys to play with, most notably Shiny from RStudio. Thus a mere graph and static blog comment was no longer enough for me. Instead, I decided to make an interactive tool that can show you (approximately) where you will rank in the future given an outcome of a pending competition deadline.

The Kaggle Rank-O-Tron is now live, and is hosted on one of RStudio’s beta servers. You can input your current Kaggle points, and where you expect to finish in your next comp, it will then show you roughly how you will progress.

This is based solely on the equation used by Kaggle to calculate your points. Right now, there is no consideration of several factors that can jostle the users around you. But as I update the database after each competition ends, I’ll try to look for ways to improve the predictions using, well, machine learning of course.

I hope you have some fun with it, but remember, submissions come first!

Titanic: Getting Started With R - Addendum & Chocolate

One of our MSAN professors, Nick Ross, just loves his trivia. Each time we have our Business Strategies class we get a little dose of fun facts at half-time, and last week we learnt that Milton S. Hershey, the founder of the famous chocolate company, had paid a pretty handsome deposit to board the Titanic with his wife (FWIW, I’m more of a Cadbury guy, being Australian and all… but still, chocolate is chocolate).

As it turned out, he never did board the unsinkable ship and gave up several thousand dollars in today’s money that he paid as a deposit for a plush first class cabin. Given the pretty remarkable coincidence given my recent posts, I thought it might be fun to see what the conditional inference tree model predicted would have happened to him if he had ended up on board.

So, what do we know? Well, we have the name “Milton S. Hershey”, or in the dataset’s terms, “Hershey, Mr. Milton S.”. We also know that he booked a first class cabin for himself and his wife and paid a $300 deposit:

Let’s assume that this was a 50% down-payment for two tickets, so $300 could be used for his fare. A little wikipedia tells us he was born on September 13, 1857, which would mean he was 54 years old when the ship left port on April 10, 1912. He was also trying to get from England to the US, so let’s assume that he was sailing from Southampton, though I have been unable to find the exact port he was planning to embark at.

Since we never used the ticket number, or cabin number, for our predictions, we can just leave these as NA values. So let’s build a special Hershey dataframe and combine it to the combi dataframe we built in the tutorial (before we transformed it to build the engineered variables):

Hershey <- data.frame(Pclass=1, Sex='Male', Age=54, SibSp=1, Parch=0, 
                      Fare=300, Embarked='S',
                      PassengerId=NA, Survived=NA,
                      Name='Hershey, Mr. Milton S.', Ticket=NA, Cabin=NA)

combi <- rbind(train, test, Hershey)


Okay. So now we run through the rest of the tutorial and make our engineered variables, but this time, when we split it back up into the train and test sets, we also break out the Hershey dataset:

Hershey <- combi[1310,]


We then train our model as before, and finally make our prediction on whether he was a lucky guy or not:

predict(fit, Hershey, OOB=TRUE, type = "response")
[1] 0


Oh dear! Imagine a world without kisses!

So, sadly, our model tells us that Hershey would have perished in the Titanic disaster. Perhaps you would like to dig into whether some of the other famous people who were almost aboard the famous boat would have escaped or not?

Titanic: Getting Started With R - Part 5: Random Forests

Tutorial index

Seems fitting to start with a definition,

en·sem·ble

A unit or group of complementary parts that contribute to a single effect, especially:

  1. A coordinated outfit or costume.
  2. A coordinated set of furniture.
  3. A group of musicians, singers, dancers, or actors who perform together

While I won’t be teaching about how to best coordinate your work attire or living room, I think the musician metaphor works here. In an ensemble of talented instrumentalists, the issues one might have with an off-note are overpowered by the others in the group.

The same goes for machine learning. Take a large collection of individually imperfect models, and their one-off mistakes are probably not going to be made by the rest of them. If we average the results of all these models, we can sometimes find a superior model from their combination than any of the individual parts. That’s how ensemble models work, they grow a lot of different models, and let their outcomes be averaged or voted across the group.

We are now well aware of the overfitting problems with decision trees. But if we grow a whole lot of them and have them vote on the outcome, we can get passed this limitation. Let’s build a very small ensemble of three simple decision trees to illustrate:

Each of these trees make their classification decisions based on different variables. So let’s imagine a female passenger from Southampton who rode in first class. Tree one and two would vote that she survived, but tree three votes that she perishes. If we take a vote, it’s 2 to 1 in favour of her survival, so we would classify this passenger as a survivor.

Random Forest models grow trees much deeper than the decision stumps above, in fact the default behaviour is to grow each tree out as far as possible, like the overfitting tree we made in lesson three. But since the formulas for building a single decision tree are the same every time, some source of randomness is required to make these trees different from one another. Random Forests do this in two ways.

The first trick is to use bagging, for bootstrap aggregating. Bagging takes a randomized sample of the rows in your training set, with replacement. This is easy to simulate in R using the sample function. Let’s say we wanted to perform bagging on a training set with 10 rows.

> sample(1:10, replace = TRUE)
 [1]  3  1  9  1  7 10 10  2  2  9


In this simulation, we would still have 10 rows to work with, but rows 1, 2, 9 and 10 are each repeated twice, while rows 4, 5, 6 and 8 are excluded. If you run this command again, you will get a different sample of rows each time. On average, around 37% of the rows will be left out of the bootstrapped sample. With these repeated and omitted rows, each decision tree grown with bagging would evolve slightly differently. If you have very strong features such as gender in our example though, that variable will probably still dominate the first decision in most of your trees.

The second source of randomness gets past this limitation though. Instead of looking at the entire pool of available variables, Random Forests take only a subset of them, typically the square root of the number available. In our case we have 10 variables, so using a subset of three variables would be reasonable. The selection of available variables is changed for each and every node in the decision trees. This way, many of the trees won’t even have the gender variable available at the first split, and might not even see it until several nodes deep.

Through these two sources of randomness, the ensemble contains a collection of totally unique trees which all make their classifications differently. As with our simple example, each tree is called to make a classification for a given passenger, the votes are tallied (with perhaps many hundreds, or thousands of trees) and the majority decision is chosen. Since each tree is grown out fully, they each overfit, but in different ways. Thus the mistakes one makes will be averaged out over them all.

R’s Random Forest algorithm has a few restrictions that we did not have with our decision trees. The big one has been the elephant in the room until now, we have to clean up the missing values in our dataset. rpart has a great advantage in that it can use surrogate variables when it encounters an NA value. In our dataset there are a lot of age values missing. If any of our decision trees split on age, the tree would search for another variable that split in a similar way to age, and use them instead. Random Forests cannot do this, so we need to find a way to manually replace these values.

A method we implicitly used in part 2 when we defined the adult/child age buckets was to assume that all missing values were the mean or median of the remaining data. Since then we’ve learned a lot of new skills though, so let’s use a decision tree to fill in those values instead. Let’s pick up where we left off last lesson, and take a look at the combined dataframe’s age variable to see what we’re up against:

> summary(combi$Age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
   0.17   21.00   28.00   29.88   39.00   80.00     263


263 values out of 1309 were missing this whole time, that’s a whopping 20%! A few new pieces of syntax to use. Instead of subsetting by boolean logic, we can use the R function is.na(), and it’s reciprocal !is.na() (the bang symbol represents ‘not’). This subsets on whether a value is missing or not. We now also want to use the method=”anova” version of our decision tree, as we are not trying to predict a category any more, but a continuous variable. So let’s grow a tree on the subset of the data with the age values available, and then replace those that are missing:

Agefit <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize,
                data=combi[!is.na(combi$Age),], method="anova")
combi$Age[is.na(combi$Age)] <- predict(Agefit, combi[is.na(combi$Age),])


I left off the family size and family IDs here as I didn’t think they’d have much impact on predicting age. You can go ahead and inspect the summary again, all those NA values are gone.

Let’s take a look at the summary of the entire dataset now to see if there are any other problem variables that we hadn’t noticed before:

summary(combi)


Two jump out as a problem, though no where near as bad as Age, Embarked and Fare both are lacking values in two different ways.

> summary(combi$Embarked)
      C   Q   S
  2 270 123 914


Embarked has a blank for two passengers. While a blank wouldn’t be a problem for our model like an NA would be, since we’re cleaning anyhow, let’s get rid of it. Because it’s so few observations and such a large majority boarded in Southampton, let’s just replace those two with ‘S’. First we need to find out who they are though! We can use which for this:

> which(combi$Embarked == '')
[1]  62 830


This gives us the indexes of the blank fields. Then we simply replace those two, and encode it as a factor:

combi$Embarked[c(62,830)] = "S"
combi$Embarked <- factor(combi$Embarked)


The other naughty variable was Fare, let’s take a look:

> summary(combi$Fare)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
  0.000   7.896  14.450  33.300  31.280 512.300       1


It’s only one passenger with a NA, so let’s find out which one it is and replace it with the median fare:

> which(is.na(combi$Fare))
[1] 1044
> combi$Fare[1044] <- median(combi$Fare, na.rm=TRUE)


Okay. Our dataframe is now cleared of NAs. Now on to restriction number two: Random Forests in R can only digest factors with up to 32 levels. Our FamilyID variable had almost double that. We could take two paths forward here, either change these levels to their underlying integers (using the unclass() function) and having the tree treat them as continuous variables, or manually reduce the number of levels to keep it under the threshold.

Let’s take the second approach. To do this we’ll copy the FamilyID column to a new variable, FamilyID2, and then convert it from a factor back into a character string with as.character(). We can then increase our cut-off to be a “Small” family from 2 to 3 people. Then we just convert it back to a factor and we’re done:

combi$FamilyID2 <- combi$FamilyID
combi$FamilyID2 <- as.character(combi$FamilyID2)
combi$FamilyID2[combi$FamilySize <= 3] <- 'Small'
combi$FamilyID2 <- factor(combi$FamilyID2)


Okay, we’re down to 22 levels so we’re good to split the test and train sets back up as we did last lesson and grow a Random Forest. Install and load the package randomForest:

install.packages('randomForest')
library(randomForest)


Because the process has the two sources of randomness that we discussed earlier, it is a good idea to set the random seed in R before you begin. This makes your results reproducible next time you load the code up, otherwise you can get different classifications for each run.

set.seed(415)


The number inside isn’t important, you just need to ensure you use the same seed number each time so that the same random numbers are generated inside the Random Forest function.

Now we’re ready to run our model. The syntax is similar to decision trees, but there’s a few extra options.

fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize +
                    FamilyID2, data=train, importance=TRUE, ntree=2000)


Instead of specifying method=”class” as with rpart, we force the model to predict our classification by temporarily changing our target variable to a factor with only two levels using as.factor(). The importance=TRUE argument allows us to inspect variable importance as we’ll see, and the ntree argument specifies how many trees we want to grow.

If you were working with a larger dataset you may want to reduce the number of trees, at least for initial exploration, or restrict the complexity of each tree using nodesize as well as reduce the number of rows sampled with sampsize. You can also override the default number of variables to choose from with mtry, but the default is the square root of the total number available and that should work just fine. Since we only have a small dataset to play with, we can grow a large number of trees and not worry too much about their complexity, it will still run pretty fast.

So let’s look at what variables were important:

varImpPlot(fit)



Remember with bagging how roughly 37% of our rows would be left out? Well Random Forests doesn’t just waste those “out-of-bag” (OOB) observations, it uses them to see how well each tree performs on unseen data. It’s almost like a bonus test set to determine your model’s performance on the fly.

There’s two types of importance measures shown above. The accuracy one tests to see how worse the model performs without each variable, so a high decrease in accuracy would be expected for very predictive variables. The Gini one digs into the mathematics behind decision trees, but essentially measures how pure the nodes are at the end of the tree. Again it tests to see the result if each variable is taken out and a high score means the variable was important.

Unsurprisingly, our Title variable was at the top for both measures. We should be pretty happy to see that the remaining engineered variables are doing quite nicely too. Anyhow, enough delay, let’s see how it did!

The prediction function works similarly to decision trees, and we can build our submission file in exactly the same way. It will take a bit longer though, as all 2000 trees need to make their classifications and then discuss who’s right:

Prediction <- predict(fit, test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = "firstforest.csv", row.names = FALSE)



Hrmm, well this actually worked out exactly the same as Kaggle’s Python random forest tutorial. I wouldn’t take that as the expected result from any forest though, this may just be pure coincidence. It’s relatively poor performance does go to show that on smaller datasets, sometimes a fancier model won’t beat a simple one. Besides that, there’s also the private leaderboard as only 50% of the test data is evaluated for our public scores.

But let’s not give up yet. There’s more than one ensemble model. Let’s try a forest of conditional inference trees. They make their decisions in slightly different ways, using a statistical test rather than a purity measure, but the basic construction of each tree is fairly similar.

So go ahead and install and load the party package.

install.packages('party')
library(party)


We again set the seed for consistent results and build a model in a similar way to our Random Forest:

set.seed(415)
fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID,
               data = train, controls=cforest_unbiased(ntree=2000, mtry=3))


Conditional inference trees are able to handle factors with more levels than Random Forests can, so let’s go back to out original version of FamilyID. You may have also noticed a few new arguments. Now we have to specify the number of trees inside a more complicated command, as arguments are passed to cforest differently. We also have to manually set the number of variables to sample at each node as the default of 5 is pretty high for our dataset. Okay, let’s make another prediction:

Prediction <- predict(fit, test, OOB=TRUE, type = "response")


The prediction function requires a few extra nudges for conditional inference forests as you see. Let’s write a submission and submit it!

Congratulations! At the time of writing you are now in the top 5% of a Kaggle competition!

You’ve come a long way, from the bottom of the Kaggle leaderboard to the top! There may be a few more insights to wring from this dataset yet though. We never did look at the ticket or cabin numbers, so take a crack at extracting some insights from them to see if any more gains are possible. Maybe extracting the cabin letter (deck) or number (location) and extrapolating to the rest of the passengers’ family if they’re missing might be worth a try?

While there’s just no way that I could introduce all the R syntax you’ll need to navigate dataframes in different situations, I hope that I’ve given you a good start. I linked to some good R guides way back in the introduction that should help you learn more, here and here, as well as an excellent book The Art of R Programming: A Tour of Statistical Software Design to continue learning about programming in R.

Well that’s it for the tutorial series. I really hope that you can exceed the benchmark I’ve posted here. If you find some new ideas that develop the base that I’ve presented, be sure to contribute back to the community through the Kaggle forums, or comment on the blog.

I hope you found the tutorials interesting and informative, and that they gave you a taste for machine learning that will spur you to compete in the prize-eligible Kaggle competitions! I hope to see you on the leaderboards out in the wild. Good luck and happy learning!

All code from this tutorial is available on my Github repository.

Titanic: Getting Started With R - Part 4: Feature Engineering

Tutorial index

Feature engineering is so important to how your model performs, that even a simple model with great features can outperform a complicated algorithm with poor ones. In fact, feature engineering has been described as “easily the most important factor” in determining the success or failure of your predictive model. Feature engineering really boils down to the human element in machine learning. How much you understand the data, with your human intuition and creativity, can make the difference.

So what is feature engineering? It can mean many things to different problems, but in the Titanic competition it could mean chopping, and combining different attributes that we were given by the good folks at Kaggle to squeeze a little bit more value from them. In general, an engineered feature may be easier for a machine learning algorithm to digest and make rules from than the variables it was derived from.

The initial suspects for gaining more machine learning mojo from are the three text fields that we never sent into our decision trees last time. While the ticket number, cabin, and name were all unique to each passenger; perhaps parts of those text strings could be extracted to build a new predictive attribute. Let’s start with the name field. If we take a glance at the first passenger’s name we see the following:

> train$Name[1]
[1] Braund, Mr. Owen Harris
891 Levels: Abbing, Mr. Anthony Abbott, Mr. Rossmore Edward ... Zimmerman, Mr. Leo


Previously we have only accessed passenger groups by subsetting, now we access an individual by using the row number, 1, as an index instead. Okay, no one else on the boat had that name, that’s pretty much certain, but what else might they have shared? Well, I’m sure there were plenty of Mr’s aboard. Perhaps the persons title might give us a little more insight.

If we scroll through the dataset we see many more titles including Miss, Mrs, Master, and even the Countess! The title ‘Master’ is a bit outdated now, but back in these days, it was reserved for unmarried boys. Additionally, the nobility such as our Countess would probably act differently to the lowly proletariat too. There seems to be a fair few possibilities of patterns in this that may dig deeper than the combinations of age, gender, etc that we looked at before.

In order to extract these titles to make new variables, we’ll need to perform the same actions on both the training and testing set, so that the features are available for growing our decision trees, and making predictions on the unseen testing data. An easy way to perform the same processes on both datasets at the same time is to merge them. In R we can use rbind, which stands for row bind, so long as both dataframes have the same columns as each other. Since we obviously lack the Survived column in our test set, let’s create one full of missing values (NAs) and then row bind the two datasets together:

test$Survived <- NA
combi <- rbind(train, test)


Now we have a new dataframe called combi with all the same rows as the original two datasets, stacked in the order in which we specified: train first, and test second.

If you look back at the output of our inquiry on Owen, his name is still encoded as a factor. As we mentioned earlier in the tutorial series, strings are automatically imported as factors in R, even if it doesn’t make sense. So we need to cast this column back into a text string. To do this we use as.character. Let’s do this and then take another look at Owen:

> combi$Name <- as.character(combi$Name)
> combi$Name[1]
[1] "Braund, Mr. Owen Harris"


Excellent, no more levels, now it’s just pure text. In order to break apart a string, we need some hooks to tell the program to look for. Nicely, we see that there is a comma right after the person’s last name, and a full stop after their title. We can easily use the function strsplit, which stands for string split, to break apart our original name over these two symbols. Let’s try it out on Mr. Braund:

> strsplit(combi$Name[1], split='[,.]')
[[1]]
[1] "Braund"       " Mr"          " Owen Harris"


Okay, good. Here we have sent strsplit the cell of interest, and given it some symbols to chose from when splitting the string up, either a comma or period. Those symbols in the square brackets are called regular expressions, though this is a very simple one, and if you plan on working with a lot of text I would certainly recommend getting used to using them!

We see that the title has been broken out on its own, though there’s a strange space before it begins because the comma occurred at the end of the surname. But how do we get to that title piece and clear out the rest of the stuff we don’t want? An index [[1]] is printed before the text portions. Let’s try to dig into this new type of container by appending all those square brackets to the original command:

> strsplit(combi$Name[1], split='[,.]')[[1]]
[1] "Braund"       " Mr"          " Owen Harris"


Getting there! String split uses a doubly stacked matrix because it can never be sure that a given regex will have the same number of pieces. If there were more commas or periods in the name, it would create more segments, so it hides them a level deeper to maintain the rectangular types of containers that we are used to in things like spreadsheets, or now dataframes! Let’s go a level deeper into the indexing mess and extract the title. It’s the second item in this nested list, so let’s dig in to index number 2 of this new container:

> strsplit(combi$Name[1], split='[,.]')[[1]][2]
[1] " Mr"


Great. We have isolated the title we wanted at last. But how to apply this transformation to every row of the combined train/test dataframe? Luckily, R has some extremely useful functions that apply more complication functions one row at a time. As we had to dig into this container to get the title, simply trying to run combi$Title <- strsplit(combi$Name, split='[,.]')[[1]][2] over the whole name vector would result in all of our rows having the same value of Mr., so we need to work a bit harder. Unsurprisingly applying a function to a lot of cells in a dataframe or vector uses the apply suite of functions of R:

combi$Title <- sapply(combi$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})


R’s apply functions all work in slightly different ways, but sapply will work great here. We feed sapply our vector of names and our function that we just came up with. It runs through the rows of the vector of names, and sends each name to the function. The results of all these string splits are all combined up into a vector as output from the sapply function, which we then store to a new column in our original dataframe, called Title.

Finally, we may wish to strip off those spaces from the beginning of the titles. Here we can just substitute the first occurrence of a space with nothing. We can use sub for this (gsub would replace all spaces, poor ‘the Countess’ would look strange then though):

combi$Title <- sub(' ', '', combi$Title)


Alright, we now have a nice new column of titles, let’s have a look at it:

> table(combi$Title)
        Capt          Col          Don         Dona           Dr     Jonkheer         Lady
           1            4            1            1            8            1            1
       Major       Master         Miss         Mlle          Mme           Mr          Mrs
           2           61          260            2            1          757          197
          Ms          Rev          Sir the Countess
           2            8            1            1


Hmm, there are a few very rare titles in here that won’t give our model much to work with, so let’s combine a few of the most unusual ones. We’ll begin with the French. Mademoiselle and Madame are pretty similar (so long as you don’t mind offending) so let’s combine them into a single category:

combi$Title[combi$Title %in% c('Mme', 'Mlle')] <- 'Mlle'


What have we done here? The %in% operator checks to see if a value is part of the vector we’re comparing it to. So here we are combining two titles, ‘Mme’ and ‘Mlle’, into a new temporary vector using the c() operator and seeing if any of the existing titles in the entire Title column match either of them. We then replace any match with ‘Mlle’.

Let’s keep looking for redundancy. It seems the very rich are a bit of a problem for our set here too. For the men, we have a handful of titles that only one or two have been blessed with: Captain, Don, Major and Sir. All of these are either military titles, or rich fellas who were born with vast tracts of land.

For the ladies, we have Dona, Lady, Jonkheer (*see comments below), and of course our Countess. All of these are again the rich folks, and may have acted somewhat similarly due to their noble birth. Let’s combine these two groups and reduce the number of factor levels to something that a decision tree might make sense of:

combi$Title[combi$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'


Our final step is to change the variable type back to a factor, as these are essentially categories that we have created:

combi$Title <- factor(combi$Title)


Alright. We’re done with the passenger’s title now. What else can we think up? Well, there’s those two variables SibSb and Parch that indicate the number of family members the passenger is travelling with. Seems reasonable to assume that a large family might have trouble tracking down little Johnny as they all scramble to get off the sinking ship, so let’s combine the two variables into a new one, FamilySize:

combi$FamilySize <- combi$SibSp + combi$Parch + 1


Pretty simple! We just add the number of siblings, spouses, parents and children the passenger had with them, and plus one for their own existence of course, and have a new variable indicating the size of the family they travelled with.

Anything more? Well we just thought about a large family having issues getting to lifeboats together, but maybe specific families had more trouble than others? We could try to extract the Surname of the passengers and group them to find families, but a common last name such as Johnson might have a few extra non-related people aboard. In fact there are three Johnsons in a family with size 3, and another three probably unrelated Johnsons all travelling solo.

Combining the Surname with the family size though should remedy this concern. No two family-Johnson’s should have the same FamilySize variable on such a small ship. So let’s first extract the passengers’ last names. This should be a pretty simple change from the title extraction code we ran earlier, now we just want the first part of the strsplit output:

combi$Surname <- sapply(combi$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})


We then want to append the FamilySize variable to the front of it, but as we saw with factors, string operations need strings. So let’s convert the FamilySize variable temporarily to a string and combine it with the Surname to get our new FamilyID variable:

combi$FamilyID <- paste(as.character(combi$FamilySize), combi$Surname, sep="")


We used the function paste to bring two strings together, and told it to separate them with nothing through the sep argument. This was stored to a new column called FamilyID. But those three single Johnsons would all have the same Family ID. Given we were originally hypothesising that large families might have trouble sticking together in the panic, let’s knock out any family size of two or less and call it a “small” family. This would fix the Johnson problem too.

combi$FamilyID[combi$FamilySize <= 2] <- 'Small'


Let’s see how we did for identifying these family groups:

> table(combi$FamilyID)
           11Sage           3Abbott         3Appleton         3Beckwith           3Boulos
               11                 3                 1                 2                 3
          3Bourke            3Brown         3Caldwell          3Christy          3Collyer
                3                 4                 3                 2                 3
         3Compton          3Cornell           3Coutts           3Crosby           3Danbom
                3                 1                 3                 3                 3 . . .


Hmm, a few seemed to have slipped through the cracks here. There’s plenty of FamilyIDs with only one or two members, even though we wanted only family sizes of 3 or more. Perhaps some families had different last names, but whatever the case, all these one or two people groups is what we sought to avoid with the three person cut-off. Let’s begin to clean this up:

famIDs <- data.frame(table(combi$FamilyID))


Now we have stored the table above to a dataframe. Yep, you can store most tables to a dataframe if you want to, so let’s take a look at it by clicking on it in the explorer:

Here we see again all those naughty families that didn’t work well with our assumptions, so let’s subset this dataframe to show only those unexpectedly small FamilyID groups.

famIDs <- famIDs[famIDs$Freq <= 2,]


We then need to overwrite any family IDs in our dataset for groups that were not correctly identified and finally convert it to a factor:

combi$FamilyID[combi$FamilyID %in% famIDs$Var1] <- 'Small'
combi$FamilyID <- factor(combi$FamilyID)


We are now ready to split the test and training sets back into their original states, carrying our fancy new engineered variables with them. The nicest part of what we just did is how the factors are treated in R. Behind the scenes, factors are basically stored as integers, but masked with their text names for us to look at. If you create the above factors on the isolated test and train sets separately, there is no guarantee that both groups exist in both sets.

For instance, the family ‘3Johnson’ previously discussed does not exist in the test set. We know that all three of them survive from the training set data. If we had built our factors in isolation, there would be no factor ‘3Johnson’ for the test set. This would upset any machine learning model because the factors between the training set used to build the model and the test set it is asked to predict for are not consistent. ie. R will throw errors at you if you try.

Because we built the factors on a single dataframe, and then split it apart after we built them, R will give all factor levels to both new dataframes, even if the factor doesn’t exist in one. It will still have the factor level, but no actual observations of it in the set. Neat trick right? Let me assure you that manually updating factor levels is a pain.

So let’s break them apart and do some predictions on our new fancy engineered variables:

train <- combi[1:891,]
test <- combi[892:1309,]


Here we introduce yet another subsetting method in R; there are many depending on how you want to chop up your data. We have isolated certain ranges of rows of the combi dataset based on the sizes of the original train and test sets. The comma after that with no numbers following it indicates that we want to take ALL columns with this subset and store it to the assigned dataframe. This gives us back our original number of rows, as well as all our new variables including the consistent factor levels.

Time to do our predictions! We have a bunch of new variables, so let’s send them to a new decision tree. Last time the default complexity worked out pretty well, so let’s just grow a tree with the vanilla controls and see what it can do:

fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID,
             data=train, method="class")



Interestingly our new variables are basically governing our tree. Here’s another drawback with decision trees that I didn’t mention last time: they are biased to favour factors with many levels. Look at how our 61-level FamilyID factor is so prominent here, and the tree picked out all the families that are biased one way more than the others. This way the decision node can chop and change the data into the best way possible combination for purity of the following nodes.

But all that aside, you know should know how to create a submission from a decision tree, so let’s see how it performed!

Awesome, we just almost halved our rank! All by squeezing a bit more value out of what we already had. And this is just a sample of what you might be able to find in this dataset.

Go ahead and try and create some more engineered variables! As before, I also really encourage you to play around with the complexity parameters and maybe try trimming some deeper trees to see if it helps or hinders your rank. You may even consider excluding some variables from the tree to see if that changes anything too.

In most cases though, the title or gender variables will govern the first decision due to the greedy nature of decision trees. The bias towards many-levelled factors won’t go away either, and the overfitting problem can be difficult to gauge without actually sending in submissions, but good judgement can help.

Next lesson, we will overcome these limitations by building an ensemble of decision trees with the powerful Random Forest algorithm. Go there now!

All code from this tutorial is available on my Github repository.