Friday, June 29, 2012

Google Ngrams Part 1

So I have been spending some time analyzing language usage from a bunch of different areas including: chats, twitter, email, phone conversations and books. I have been starting to analyze the raw data from Google ngrams and the raw data sucks ass. Come on Google you can't write a 5 minute piece of perl code to parse the grams. Don't believe me look here:
As you can tell there has been a drop in the usage of vi for editing books, as it probably moved on to be a programmer's tool around the 1830s. This is not even to mention the words with non alpha numeric characters. I guess I will parse my happy heart out. Just removing words that contain numbers or non-alpha characters besides (') has me at about 1/3 of the original size for 1grams. I also removed words with less than 5 occurrences in less than 2 books. I.e. each word has to occur at least 5 times in at least 2 books. I'll post the scripts when I am done.