Word Frequency for YouTube Videos



>> Wednesday, November 17, 2010

YouTube has a feature where you can browse the top viewed videos over a specific time-frame (today, this week, this month, or all time). I thought it would be interesting to see which words (if any) pop up more than others. By just glancing at the list I guessed that "justin" and "beiber" would top the list. I thought I'd write some quick Groovy code to see if I was right.

The Stats:

Here is what I found when I looked at the top 160 most viewed videos of all time (as of today):

Top 25 Words:

WordCountFreq
official193%
music121%
song71%
cyrus71%
miley71%
version60%
gaga60%
lady60%
bieber60%
justin60%
jason50%
feat50%
dance50%
love50%
baby50%
high40%
david40%
best40%
this40%
sean30%
nuki30%
iglesias30%
enrique30%
goes30%
like30%

Other Stats:
Total words: 619
Total unique words: 424

The Code:

All the source code is located here (box.net)

Here are the guts of the program:
def html = new XmlSlurper(new SAXParser()).parse(urlString)
html.'**'.findAll{ it.@class == 'video-title'}.each {nextVideo ->
//split the title using regex on non-word characters
nextVideo.text().split(/\W/).each{nextWord ->
def lowerCase = nextWord.toLowerCase()
//limit the results to "interesting" words
if(lowerCase.length() >= minWordLength && !(lowerCase in ignoreList)){
wordFreq[lowerCase] = wordFreq[lowerCase] == null ? 1 : wordFreq[lowerCase] + 1
}
}
}


Here is what it does:
1. Parse the YouTube URL using XmlSlurper
2. Find all the titles on the page
4. Split the title to get individual words
5. Convert it to lower case to make comparisons easier
7. Limit the words by a minimum length (to get rid of stuff like "of", "and", "the") and ignore other words (like "video")
8. Update the wordFreq map

I am not very comfortable with minimum length and ignoring words, but without that the top 10 words were: video, the, official, ft, music, i, t, you, in, on. That is much less interesting that the filtered list, in my opinion.

0 comments:

Post a Comment

  © Blogger template Webnolia by Ourblogtemplates.com 2009

Back to TOP