This weekend happened to compose of one more extra day than usual, for we had that archaic “holiday” of Columbus Day off. So, what did I decide to do for those three days of school-less bliss? Attempt to calculate the average distribution of letters in the English language of course! To begin the study, I started off by measuring the amount of letters in this paragraph (you don’t have to actually read this part, just be aware that it’s there):
Language is a particularly curious thing. Why do we write as we do? How did language come to be the way it is? Why are some letters used more than others? It spurns on many questions, most of which I can’t answer. One of them I will attempt, however: what is the distribution of letters in English? So, let’s try and see if we can find out. I will count all the letters present on this page, find their percentages, and graph them. Time for math!
So, with all the letters counted, I created this graph. The x axis is for each letter, and the y axis represents the amount. From a quick glance at the graph, it appears that the top three most used letters, at least in the above paragraph, are “e”, “t,” and “i.” While this is great to know for a paragraph of 347 words, it doesn’t really help much with my overall goal. If I really want to find the average distribution, I’ll have to go much bigger, and by bigger I mean using multiple books.
Of course, there are certain practical problems with using texts of over 500 letters; mainly, the fact that I would have to count each individual letter one by one. Books have at least 50,000 words, let alone letters! To rectify this issue, I decided to write up a simply Java program to count the letters for me. If for some strange reason you want to view the code I used to accomplish this, I humbly direct you to this link.
Now that my letter counting problem was solved, it was time to select the texts I would be using for the program itself. I settled on eight books, two of which would be of my own authorship. Why? Well because I’m selfish and narcissistic, that’s why. The first two books, I decided, would be Blood on the Golden Horn and The Heist, both written by myself. The next would be The Hitchhiker’s Guide to the Galaxy by the infinitely hilarious Douglas Adams. H2G2, as it’s been abbreviated, is one of my favorite books of all time, and greatly represents modern speech and diction. Next up came The Fellowship of the Ring, the first novel in J.R.R. Tolkien’s epic Lord of the Rings series. I thought this book was an appropriate choice, considering it created the fantasy genre. Follow that was Ulysses by James Joyce. Considered by many the best book ever written in the English language, it was all too perfect for a study like this. To try and spice things up a bit and get a better range of data, I decided that the last books would all be something other than just pure fiction. For my scientific book, I chose On the Origin of the Species by Charles Darwin, a landmark book. Paradise Lost, the epic poem by John Milton, came next. And what better to top off a study on the English language than a play by the man who essentially made the language what it is today? My last text was Hamlet by William Shakespeare.
With my books chosen, I plugged them into my Java program and recorded the results of each. Then, I graphed each and every book. You can view all eight graphs here. (The image is too big to post here). With all the books totals counted, I added them all together to get a grand total across all eight works. With this grand total, I then found out the percentage each letter took of the total letters (which happened to be 3,910,041). With that, I created the final graph.
As you can see, this graph looks eerily similar to the first graph. It seems that at even such a small amount of letters as 347, the distribution holds true from that all the way up to almost 4 million different letters. To my surprise, the letters used most are not the five vowels. The top five letters used are “e,” “t,” “a,” “o,” and “i.” “U” is missing from the top, replaced by “t.” But even if “t” was eliminated, “u” would still be nowhere near the top. What a curious little language we have. I hope that this little experiment of sorts has given you a bit more knowledge about the language that we all use each and every day!
NOTE: If you want to see the full data table of the number of letters in each book, all the charts, and the like, you can download my report here (in .xls).