Committed to Open Source… Again
Introducing fancy decision tree plots to scikit-learn.
The tails of this distribution were also of interest though. I was shocked to find that almost every industry that LinkedIn lets you chose from was represented. In fact, it will be much faster for me to tell you what 13 industries were NOT on the final list:
Dairy, fishery, furniture, judiciary, law enforcement, legislative office, plastics, ranching, recreational facilities and services, supermarkets, textiles, tobacco, and warehousing.
134 different industries are represented at these MeetUp events about data; this just goes to show how prevalent the hysteria over understanding your company’s data has become.
Another point to note, perhaps it’s time to include a “data science” industry on LinkedIn?
I already mentioned the search terms used to mine the MeetUp members above, but thought a little more detail on the problems encountered would be helpful to understand the data. Below is a plot of the actual yield of our experiments, along with the various sources of data loss along the way:
From the MeetUp members list, we passed the name strings through a regular expression to extract first and last names. A large number of people on MeetUp use only their first name, or some combination of initials to identify themselves. These names were thrown out and are marked “Invalid Name” above.
LinkedIn had its own set of challenges too. Frequently there were duplicate names returned, perhaps for a very common name or a person with multiple accounts, in this case we added the keyword “data” to the search and retried. If it remained a duplicate after this, it was thrown into the “Duplicate” bucket. Other issues were people who could not be found on LinkedIn, “No Match”, or those with private LinkedIn profiles, “Private”.
After all was said and done, we had generated summary statistics on the industries of over 11,000 data scientists, a yield of 37%. Perhaps the dairy farmers and fishermen were in the other 63%?