While I was at the beach earlier this week I started getting hammered with emails that one of my crons was generating errors. I eventually got out my laptop and cleared out the hundreds of error emails and turned off the cron, but just now figured out the actual problem.
The script in question parsed out the Word of the Day page to see if the word had changed. I’ve run into random problems in the past with the WOTD being broken, updated at random times of the day, and randomly broken during the day with no obvious reason, so I had a script setup to check it every minute for debugging purposes. For sure that wasn’t nice of me, but it wasn’t exactly my idea to have to go through all that effort in the first place - there is no API.
After adding some extra error checking to my script, which I should have done originally anyway, I checked the output it was downloading every time and found that it was a JPEG image being served in place of the page. Adding an extra Content-Type header displayed the image… their standard “currently unavailable, back soon” sign.
Well since I could clearly view the page in my browser without a problem, my first suspicion was right: they’re doing User Agent sniffing and killing anything that doesn’t look like a web browser. Any HTTP library you’re using will let you set arbitrary headers for your request and a simple copy and paste had the script making requests that looked identical to my Google Chrome beta.
Now really I understand why they do this, they’re trying to limit the number of hits to their page. Why it’s such a big deal is beyond me, and they could easily solve it if they were to provide a simple API endpoint for developers to use - since they obviously were targeting bots scraping the page when they made this change. I didn’t want to have to parse out their invalid HTML in the first place, a simple REST endpoint would have been much preferable.
The real moral of this rant is that they’ve made an enemy, rather than a friend. If there had been an API to use that clearly set limits for each IP / user and told you when it was updated (maybe a TTL field like Yahoo! uses on their APIs) they’d have gotten a very loyal fan that told everyone how easy it was to use. Instead they decided to repeatedly make it difficult on me and that doesn’t make me want to recommend anything they do to anyone I know.