Monday, July 30, 2012

The Chromochord in Europe

It has been pretty crazy being over here in Amsterdam. I came over to do some science with Dr. Tilo Mathes who works with Dr. John Kennis at Vrije University. We worked on doing Fourier Transform Infrared Spectroscopy on a bunch of mutant proteins I created of the AsLOV2 domain. We actually acquired a fair bit of results on over 10 proteins. I am pretty happy. I travel to Berlin tomorrow and I am going to give a talk titled "Protein Engineering and BioElectronics using light activated LOV domains." I am really excited about Berlin and giving a talk. I plan on playing the Chromochord a little bit for them. I will be really glad to make it back to America. It is really difficult to leave for 3 weeks in another country. Things that were normal and simple in your old life, such as obtain cash from an ATM become difficult. It is also stressful to the body which makes it difficult for the mind to function at it's peak. And I forgot that my soldering iron doesn't work with 220V and so now when I need to solder something I have to heat it up on the stove, hah. Yes, I brought some electronics things to work on while I am here. Not as productive as I hoped but still ok. Hopefully I will have some good posts next week on electromyography.

Tuesday, July 10, 2012

Google Ngrams part 2

I originally thought, probably like most people, that Google ngrams was pretty fun and amazing until I figured out that most of their data is completely wrong. I am kind of impressed at how bad the actual dataset is. I think there are a few problems which could most likely be easily fixed. One is with fixing errors from the Optical Character Recognition (OCR). There are some consistent errors that could be fixed with some simple parsing such as recognizing english letter characters as non-english letter characters. I know Google is also using a very loose interpretation of what a word is and what they actually mean is a string of characters. I don't think many? any? languages consider $0.00 or 2& a word One of the other problems is proper nouns and non-common, non-english words. For example words in books that are actively translating in the text See here for example. Figure 2 in the original paper would be completely wrong by all normal standards because it focuses so much on frequency yet the frequencies would be all wrong. The reason so many of the words in ngrams are not in the dictionary is because they either are not words or they are not english words!!! Further, there is a clear dependence on the number of books for a given year and the number of ngrams and most likely spurious ngrams(though I have not verified the spurious ngram part). While writing this I stumbled upon Microsoft's Web N-grams http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx which I will look at soon. Hopefully in the next two weeks I will put up the basic information I found from my language analysis. I am going to the Netherlands to perform some ultrafast laser spectroscopy so hopefully I will have lots of free time to code and write as I heard that they don't work on the weekends overseas. haha.