Put on your tin-foil hats fella’s…
In my on-going development of the Akismet plugin, I needed to figure out exactly what data one of their functions was receiving (so I knew what pieces I needed to steal to check for whitelist / blacklist).
The easiest way to do this was to simply spit out the data right before it’s sent to the Akismet server to be processed there. I load up my test blog, put in a cheeky comment, hit the big red button, then wait for snoopy goodness to get dumped to my newly created logging table in the WP database.
The results? Way more than I expected…
Array
(
[comment_post_ID] => 7
[comment_author] => MellerTime
[comment_author_email] => chris@doesnthaveone.com
[comment_author_url] => http://chrismeller.com
[comment_content] => more commenty goodness!!!
[comment_type] =>
[user_ID] => 2
[user_ip] => 127.0.0.1
[user_agent] => Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
[referrer] => http://localhost/noteblog/?p=7
[blog] => http://localhost/noteblog
[HTTP_HOST] => localhost
[HTTP_USER_AGENT] => Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
[HTTP_ACCEPT] => text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
[HTTP_ACCEPT_LANGUAGE] => en-us,en;q=0.5
[HTTP_ACCEPT_ENCODING] => gzip,deflate
[HTTP_ACCEPT_CHARSET] => ISO-8859-1,utf-8;q=0.7,*;q=0.7
[HTTP_KEEP_ALIVE] => 300
[HTTP_CONNECTION] => keep-alive
[HTTP_REFERER] => http://localhost/noteblog/?p=7
[HTTP_COOKIE] => [snipped for brevity]
[CONTENT_TYPE] => application/x-www-form-urlencoded
[CONTENT_LENGTH] => 79
[PATH] => C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\Program Files\Common Files\Adobe\AGL;C:\php;;C:\Program Files\QuickTime\QTSystem\;C:\Program Files\MySQL\MySQL Server 4.1\bin;C:\Program Files\Bitvise Tunnelier
[SystemRoot] => C:\WINDOWS
[COMSPEC] => C:\WINDOWS\system32\cmd.exe
[PATHEXT] => .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH
[WINDIR] => C:\WINDOWS
[SERVER_SIGNATURE] => Apache/2.0.54 (Win32) PHP/5.0.5 Server at localhost Port 80
[SERVER_SOFTWARE] => Apache/2.0.54 (Win32) PHP/5.0.5
[SERVER_NAME] => localhost
[SERVER_ADDR] => 127.0.0.1
[SERVER_PORT] => 80
[REMOTE_ADDR] => 127.0.0.1
[DOCUMENT_ROOT] => C:/htdocs
[SERVER_ADMIN] => chris@doesnthaveone.com
[SCRIPT_FILENAME] => C:/htdocs/noteblog/wp-comments-post.php
[REMOTE_PORT] => 4751
[GATEWAY_INTERFACE] => CGI/1.1
[SERVER_PROTOCOL] => HTTP/1.1
[REQUEST_METHOD] => POST
[QUERY_STRING] =>
[REQUEST_URI] => /noteblog/wp-comments-post.php
[SCRIPT_NAME] => /noteblog/wp-comments-post.php
[PHP_SELF] => /noteblog/wp-comments-post.php
)
Needless to say, I was a bit surprised… Why exactly is every $_SERVER[] variable needed to process my blog’s spam? You just manually grabbed the necessary values (as I see them) a few lines previously:
$comment['user_ip'] = $_SERVER['REMOTE_ADDR'];
$comment['user_agent'] = $_SERVER['HTTP_USER_AGENT'];
$comment['referrer'] = $_SERVER['HTTP_REFERER'];
$comment['blog'] = get_option('home');
So why do you need to know the rest? Even if we ignore any possible privacy concerns here, if nothing else, looks to me like we’re wasting a LOT of bandwidth… Let’s do some quick math, shall we?
All that crap, when saved to a text file, totals 2,639 bytes (2.57 kb). If we cut out the relevent stuff at the beginning (everything after “blog” is removed), we’re down to 437 bytes.
After checking the Akismet Homepage, we see from their Zeitgeist that they’ve caught a total of 302,974 SPAMs, which represents 82% of all comments. If I try and remember some of my high school Algebra classes, that means:
302974 = .82(x)
x = 369480.4878
We’ll use 369,480 for simplicity. Time for a little more math:
369480 x 2639 = 975,057,720
You checking me as we go along? Good… So that’s 975 million bytes of data, give or take some gzip compression here and there, some header information, and a few random character sets.
975057720 / 1024 = 952204.8046875 (kbytes)
952204.8046875 / 1024 = 929.8875 (mbytes)
So that’s 929.8875 megabytes of data hitting their servers. In the grand scheme of things, that’s not much, but let’s look at what it would have been with our smaller set of data:
369480 x 437 = 161,462,760
So now we’ve got 161 million bytes…
161462760 / 1024 = 157678.4765625 (kbytes)
157678.4765625 / 1024 = 153.983 (mbytes)
So we've gone from almost a gig of data, down to 150mb... Seems pretty damn sizeable to me, how about you?
Hmm, maybe I should offer a neutered Akismet plugin option?