Committed to Open Source… Again
Introducing fancy decision tree plots to scikit-learn.
While there could hardly be a more chaotic event than frightened people scrambling to escape a sinking ship, the disaster is famous for saving “women and children first”. With an inadequate number of lifeboats available only a fraction of the passengers survived, and through this series of lessons, we’ll try to predict who they were.
As with most Kaggle competitions, you are given two datasets:
As this is a beginner’s competition, Kaggle has provided a couple of excellent tutorials to get you moving in the right direction, one in Excel, and another using more powerful tools in the Python programming language. Ah, but you would feel (justifiably) embarrassed to use Excel, and Python seems a little heavy right now? Well you’ve come to the right place. I’m going to introduce you to a free and powerful statistical programming language called R and get you started with predictive analytics.
Over the next few weeks I’ll ease you into R and its syntax, piece-by-piece, and step you through a selection of algorithms, from the trivial to the powerful. I’ll also introduce some feature engineering concepts that will start to push the envelope.
In fact, by the time we’re done, you’ll have achieved big gains over the rest of the leaderboard by increasing your accuracy by only a few extra percentage points. That alone is a good lesson for Kaggle: those few points, or even fractions thereof, can translate to massive ranking swings and mean the difference between getting a top 10% badge on your profile (or even getting paid), once you’re ready for the big leagues.
The guide is intended for people with zero experience in R, and probably very little programming experience as well. I won’t get to cover all the syntax, but if you get through the lessons, you may wish to expand your horizons further by checking out some more broad tutorials here and here. Or if you’re more of a book person, this is one that I can recommend highly: The Art of R Programming: A Tour of Statistical Software Design.
If you have any questions about these lessons, I’d encourage you to post them to the Kaggle forums where many others may have already come across the issue before and can jump in to help you out. If you notice any bugs or typos, or have any suggestions on making the tutorial easier to follow, please send me a direct message through Twitter. All code is available on my Github repository.
I will be dividing this series of tutorials into five parts:
So go ahead and get started with part 1