Thursday, April 4, 2013

Machine Learning and Perl

----EDIT TO POST----
To the people who find this post by an internet search I suggest you give up on machine learning in Perl and learn to do it in Python. That's what I did.
---------------------------

Yep, that's correct. I started to do some basic machine learning programming in Perl.

See I had this large dataset that I generated and wanted to understand it better. We were working with someone who was using RandomForest modeling and I wanted to have a better grasp of that so I decided to program a Decision Tree learning program to take a general dataset and map it to a binary effect. I chose a binary effect because non-binary effects are much harder to do especially on a small dataset like mine with ~75 values.

Basically what the program I wrote does is take the whole dataset and calculate binomial probabilities for a given feature value to predict the outcome. This has led me down many streams of inquiry.

How to bin features?
Been thinking of using k-means clustering.

How/If to calculate feature entropy?
I have been having trouble with Information Gain Entropy because some features I use have lots of possible values. Information gain entropy biases to features with fewer values. I could iterate over all possible mutual information values but that could possibly become very problematic with datasets with many features. But probably not.

Is it possible to generalize machine learning?
Kind of but not really. There are so many caveats with every machine learning technique. It cracks me up. I will spend time thinking about about why something shouldn't work only to read about it and find out that it shouldn't work or that there is a bias in the method.


So I wrote a basic machine learning program in Perl that takes binary effects and tries to predict them using the features and states of features that you give it. I used it for trying to find any information as to how or why mutations I made to a protein did what they did. It didn't tell me much besides the fact that many characteristics of each residue determine it's effect on function. Surprisingly BLOSUM62 scores are a really good measure of the effect of a mutation on a proteins function.

The code is a total kludge but I like Perl because of hashes. You can put anything in a hash and do anything with a hash. Very nice. However, the way Perl handles array values vs array references is not very nice.

The code will take a training set of data and find the best predictors for your binary effect it will then attempt to predict data in your prediction set. It will compare the top 3 best predictors for each datapoint to come up with a prediction. It will tell you how often you predicted correctly. Mutual information code is in there but it doesn't use it because my data doesn't have enough points to make mutual information useful.


Decision Tree Code


0-6 on Post Doc emails (1 response)
I think I will blame it on sequestration.