Friday, April 6, 2012

Human Guided Forests - HGF

Yesterday I posted some slides about an idea I had recently which I call Human Guided Forests or HGF for short.  This an attempt to marry crowdsourcing with machine learning to produce better class predictors for datasets with very large features spaces.  Specifically, the idea is to replace the 'random' in the Random Forest algorithm with 'human'.

Forest road

Random Forests basically work like this:
Given a labeled dataset with M input variables and N samples,
  • Choose m as the number of input variables allowed per tree in the forest
  • For X iterations:
    1. Choose a subset of n samples from the training set
    2. Select m random input variables
    3. Build a decision tree using the randomly selected variables, all n samples (the 'bootstrap' or 'in bag' sample), and standard induction techniques (e.g. C4.5)
    4. Measure the error rate for that tree on the samples not used to train it (the 'out of bag' or 'oob' samples)
    5. Save the tree
After the forest of decision trees has been constructed, classify new samples by running them through all the trees and choose the class that is predicted most frequently.  (This is a very successful kind of 'ensemble classifier' that is similar to one that I, because of my ignorance, reinvented as one of my first projects in bioinformatics.)

This algorithm has been shown to be very effective at extracting good classifiers from datasets automatically.  However, as the random forest authors say:
 "But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem."  Leo Breiman and Adele Cutler
So, the question I'm posing is this: by inserting humans into the learning process can we improve it using their reasoning and background knowledge?

For HGF we replace the random input variable selection at step 2 above with expert guided variable selection.  (We may also let people guide the inference of the decision tree.)  The hypothesis is that the experts will choose feature sets that are better than the randomly selected ones - that will generalize to novel datasets with less error and produce more easily understood classifiers.  

As with standard RF, we need the HGF to produce many trees for each dataset with each tree having high classification performance and low overlap with the rest of the forest.  Ideally we would have each bootstrap of the training data converted into a tree by a different expert where the experts were drawn from a pool with highly diverse expertise.  This would require a significant investment of work from a large collection of expensive people..

Now, the next question is how on earth are we going to get a very large pool of skilled professionals to contribute their expertise to this project?  The answer we have been gravitating towards is games.  We hope to translate the feature selection problem into a game that knowledgable biologists and interested lay people will play for fun.

The formulation of the game(s) that will be used to drive an HGF implementation is very much a work in progress.  At the moment, the basic structure of our candidate games is that of a card game.  One way or another, players compose 'hands' of cards that correspond to features in a particular dataset.  For example, cards might correspond to genes from a gene expression dataset.  Hands are scored by testing the predictive performance of classifier trees inferred using the features in the hand and the training data (like one cycle of a random forest run).

Relation to Network Guided Forests


This idea is highly related to the concept of 'Network Guided Forests' (NGF) described by Dutkowski and Ideker in a PLoS paper last fall.  In that approach, the features used to build decision trees are constrained to related nodes within protein-protein interaction networks.  Features are selected for a given tree by picking one at random and then walking out along the network to bring in others in close proximity in the network.  The algorithm did not improve on classification performance in comparison to standard random forest as measured in cross-validation, but it did result in much more stable and coherent feature selection across several datasets.  It tended towards choosing genes that were known to relate to the phenotype of interest (e.g. breast cancer prognosis) much more often than random methods.  In comparison, NGF has the huge advantage that it can be used immediately based on data in databases without any dependence on human intelligence.  HGF has the theoretical advantage of tapping into a much broader collection of knowledge that is not limited to interaction data.  

Call for comments


At this point, this is just a nascent idea.  I have no evidence beyond intuition that it will succeed and there is a quite a bit of difficult work ahead to find out.  Any thoughts on it at this early point in time are most welcome!


4 comments:

Anonymous said...

Benny,

This is really cool and sounds interesting. Good to see that you're still at it.

One issue that I have with the card game, or any type of game for that matter, is that there's a good chance you will introduce extra variables to the equation. These could skew the results. Plus you also have to create and market the game, too much extra work.

Why not design the human expert part to be as straightforward as possible. If it turns out boring for them, no problem, just pay the testers. Take advantage of Amazon Mechanical Turk or something like that.

Your old college roommate

Benjamin Good said...

Matthew, thanks for checking this out!

Your comment about the games being a lot of work to make and to market and to keep bias free is a great one. I think for large classes of problems you are completely correct that there are much easier, more cost-effective approaches to crowdsourcing. Here is some great evidence for this from the folks at Freebase.

The reason that we are gravitating towards games for this particular domain is that the tasks that we need accomplished are generally very difficult - too difficult to really be able to translate them into HITs for the Mechanical Turk that we could expect people to solve for the amount of money we could pay them (which is very little). Could you reliably tell me which of two genes is more likely to relate to cancer? I certainly couldn't. Tasks like this require either a serious investment of time or a substantial amount of expertise. Believe me, if I could spend $1,000 on the Turk and build a better predictor for cancer prognosis I would do it..

I think that games open up the possibility to get people working on a problem that would never otherwise participate. While I couldn't pay a professor enough to spend a couple hours on my problem - I might be able to build a purposeful game that they played just for fun (or to show how much smarter they were than their other professor friends).

Besides, what could be more fun (in terms of work) for me than building these kinds of games? ;) (BTW you have to read this book if you haven't already: "Reality of Broken")

obiwan said...

Ben,

We have discussed this in person. But, want to go on record that I think this is a very clever idea. Its very interesting about which part of the forest building to hand over to human effort versus what (if any) parts to leave as random. It seems best to leave the random division used for OOB testing untouched. That leaves you with some kind of built in cross-validation. Although depending on downstream steps it may become less reliable as an estimate of future performance. But what about the random choice of m at each node. Should that still be left random? Then the human picks up from there, choosing the best splitter variable and how to use it? That would ensure the wide variety of trees that you want.

Benjamin Good said...

Obi, Its really hard to say but it seems like people might have more to bring to the table in terms of finding good feature combinations (choosing m) than they would at optimizing the split structure of a particular tree based on the larger size of the search space for the former.

But... I'm not sure about that and its certainly vital that we don't end up with a clone forest... Ideally we can design the system to reward contributions of both good, novel feature sets and quality trees. The current 3-axis scoring function for the prototype expert tree builder is an attempt at that push.