Here at isotoma, we have a company irc channel that is used for general communication, chattering and link sharing.
Everyone joins it at the start of the day, and keeps up to date with what’s going on, and who’s talking about what.
lolcats are occasionally mentioned.
Now, having logs of the channel reaching many megabytes, I was curious as to the text statistics produced by this channel, who has what reading age, and how much they’ve talked in comparison to other people.
While I won’t release the actual statistics I’ve gathered for the channel, I did think it’d be cool to release the script I wrote to do the analysis itself.
It uses the Natural Language Toolkit (NLTK), and the readability contrib module for it. It’s not particularly nice code (inline html generation and other nastiness), but it does work. I’ll attempt to release a cleaned up version when I get some more time to work on it.
Currently, it expects a log in the format from znc, the irc bouncer software that I use, although it can be modified easily by altering the timestamp_count to the correct number to skip the timestamp. It also expects nicks to be surrounded in ‘<‘ and ‘>’. I _did_ say it wasn’t particularly nice code.
However, code style issues aside, it is a demonstration and example of using NTLK and the readability module on real world data, and the output is kind of cool. Especially when you find out that the ircbot has a higher reading age than you.
Find the source attached.