Committed to Open Source… Again
Introducing fancy decision tree plots to scikit-learn.
I am very impressed with the Kagglers that managed to CV this competition to a high degree of certainty. Some on the forums were talking about 40-70 folds to get a good idea of their model cut-off errors. I generally work comps locally on my little 4-core 8GB laptop, and at 12-36 hours run time per submission on this competition, I really couldn’t afford much more rigor on my already expensive models.
So anyhow, here’s a high-level look at what I did for Higgs, and how I briefly approached the top of the leaderboard… It basically comes down to my new favorite feature engineering pipeline:
I rolled these variables out incrementally over the 2 months I worked on Higgs. Obviously I had a dimensionality problem here. With 33 variables, that’s several hundred new variables each time you look at a new set, most of which is garbage. So, here’s the pipeline I used to throw out the chaff:
It was my hope that this would get rid of spurious variables. I think I probably could have cut deeper as the 1% thresholds are clearly arbitrarily chosen. This would have saved some training time for the next phases too… And thus I could have CV’d harder. Oh well.
Now, I created a new dataset with all the surviving variables from each auto-variable set. And then…
After this process I usually had a list of 250-300 variables to run with depending on which sets I was combining. Here’s a glance at the top 10 variables from one fold. Note that the scaled interactions are simply marked with the operator, while raw quantities explicitly call out that fact with “/raw” for example:
I am no high-energy physicist, but maybe someone out there may care to comment as to why these might have been important to discriminating background from signal. As should be truly evident by now, I cared not why, just that it was working.
Glancing at my CV and LB scores, I’m thinking that most of the bump I saw was from the normalized variables. Why? Well, at a guess, trees are greedy and make decisions at each node without looking forward or back, they will miss the importance of interactions at each and every node as that is never inspected. Generating scaled interaction variables seems to me to go hand-in-hand with tree-based learning. “Is this variable ‘big’ and that one ‘small’?”. These automated variables pick that right up in a single node, where you might waste two or more to get there in a raw tree. It worked nicely for this comp and I’ll be keeping it in my toolkit for the next one.
The rest of my pipeline was pretty standard and rested on XGBoost’s shoulders with a 4-fold CV that was clearly too loose for this evaluation metric. I experimented with a fairly wide range of hyper-parameters, but it was hard to discern any significant difference between them due to the CV instabilities.
Anyhow, after Higgs ended I finally hit the top 1000 of Kagglers and scored my third top 10%. The next milestone is quite obvious now, the Kaggle Master golden jersey must be mine!
See you on the leaderboards.