Alexa Thumbnail Service

Amazon offers some pretty cool services: S3, EC2, Alexa Site Thumbnail, and others. A while back I wanted to use AST with Django, so ended up writing the Python bindings to the REST API (they didn’t previously exist. I even wrote up a quick tutorial.

Update: Amazon no longer maintains AST. I’ve decided to archive a few of the old sites, so no longer need to take thumbnails. However, a few other thumbnail services seem to have crept up, including SnapCasa", and WebSnapr.

The Risk in Risk Mitigation

Back in the day the barrier to entry for the Internet was quite high. The technology used required a steep learning curve, the equipment extremely expensive, and sometimes even hard to acquire. Fast forward to 2007 and things have certainly changed. If you know any tech people you can likely get free hosting for a small website, and even more demanding websites can be hosted for not much. The cost of dedicated servers has dropped even more. And the final kicker: web services. I’ve started to think of some web services not as a service, but more like outsourcing requirements.

This very dependency adds risk for a multitude of reasons, and when your entire web application platform revolves around a third party, such as is the case with mashups, you incur great risk.

One of the nice things when requirements are outsourced is the fact that risk is mitigated. I’ll use SmugMug as an example. In summary, they moved their storage to Amason’s S3 network, which is something I will be utilizing as well. Amazon’s S3 (and other web services) continue to drive down the barrier of entry – now you don’t even need to purchase hugely expensive servers for the sole purchase of storage! If you don’t need to purchase them, you also don’t need to manage them. Risk mitigated.

However, continuing the slight allusion from The Other Blog’s article on mashups, I see a slight problem with the outsourcing of requirements. While the following thought isn’t particularly innovative: mitigating risk and outsourcing requirements creates a dependency on the third-party. This very dependency adds risk for a multitude of reasons, and when your entire web application platform revolves around a third party, such as is the case with mashups, you incur great risk.

But, as is evident by the fact that I’ve had stitches nine different times, I’m still going to do some cool mashups anyways, so stay tuned.

Python, AST and SOAP

For one of my projects I need to generate thumbnails for a page. And lots and lots and lots of them. Even though I can generate them via a python script and a very light “gtk browser”, I would prefer to mitigate the server load. To do this I’ve decided to tap into the Alexa Thumbnail Service. They allow two methods: REST and SOAP. After several hours of testing things out, I’ve decided to toss in the towel and settle on REST. If you can spot the error with my SOAP setup, I owe you a beer.
I’m using the ZSI module for python.

1. wsdl2py

I pull in the needed classes by using wsdl2py.

wsdl2py -b http://ast.amazonaws.com/doc/2006-05-15/AlexaSiteThumbnail.wsdl

2. Look at the code generated.

See AlexaSiteThumbnail_types.py and AlexaSiteThumbnail_client.py.

3. Write python code to access AST over SOAP.


#!/usr/bin/env python
import sys
import datetime
import hmac
import sha
import base64
from AlexaSiteThumbnail_client import *

print 'Starting...'

AWS_ACCESS_KEY_ID = 'super-duper-access-key'
AWS_SECRET_ACCESS_KEY = 'super-secret-key'

print 'Generating signature...'

def generate_timestamp(dtime):
    return dtime.strftime("%Y-%m-%dT%H:%M:%SZ")

def generate_signature(operation, timestamp, secret_access_key):
    my_sha_hmac = hmac.new(secret_access_key, operation + timestamp, sha)
    my_b64_hmac_digest = base64.encodestring(my_sha_hmac.digest()).strip()
    return my_b64_hmac_digest

timestamp_datetime = datetime.datetime.utcnow()
timestamp_list = list(timestamp_datetime.timetuple())
timestamp_list[6] = 0
timestamp_tuple = tuple(timestamp_list)
timestamp_str = generate_timestamp(timestamp_datetime)

signature = generate_signature('Thumbnail', timestamp_str, AWS_SECRET_ACCESS_KEY)

print 'Initializing Locator...'

locator = AlexaSiteThumbnailLocator()
port = locator.getAlexaSiteThumbnailPort(tracefile=sys.stdout)

print 'Requesting thumbnails...'

request = ThumbnailRequestMsg()
request.Url = "alexa.com"
request.Signature = signature
request.Timestamp = timestamp_tuple
request.AWSAccessKeyId = AWS_ACCESS_KEY_ID
request.Request = [request.new_Request()]

resp = port.Thumbnail(request)

4. Run, and see error.


ZSI.EvaluateException: Got None for nillable(False), minOccurs(1) element 
(http://ast.amazonaws.com/doc/2006-05-15/,Url), 



 xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" 
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" 
xmlns:ZSI="http://www.zolera.com/schemas/ZSI/" 
xmlns:ns1="http://ast.amazonaws.com/doc/2006-05-15/" 
xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

[Element trace: /SOAP-ENV:Body/ns1:ThumbnailRequest]

55. Conclusion

I’m not entirely certain what I’m doing wrong. I’ve also written another version but actually with NPBinding connecting to the wsdl file. It seems to work much better, as it fully connects, and I get a 200, but it doesn’t return the thumbnail location in the response, and I get a:

TypeError: Response is "text/plain", not "text/xml"

So, while I have things working fine with REST, I would like to get the SOAP calls working. One beer reward.

AWS in Python (REST)

As some of you may know, I have some projects cooked up. I don’t expect to make a million bucks (wish me luck!), but a few extra bills in the pocket wouldn’t hurt. Plus, I’m highly considering further education, which will set me back a few-thirty grand. That said, one of my projects will rely heavily on Amazon Web Services. Amazon has, for quite some time now, opened up their information via REST and SOAP. I’ve been trying (virtually the entire day) to get SOAP to work, but seem to get snagged on a few issues. Stay tuned.
However, in my quest to read every RTFM I stumbled upon a post regarding Python+REST to access Alexa Web Search. After staring at Python code, especially trying to grapple why SOAP isn’t working, updating the outdated REST code was a 5 minute hack. So, if you are interested in using Alexa Web Search with Python via Rest, look below:

websearch.py


#!/usr/bin/python

"""
Test script to run a WebSearch query on AWS via the REST interface.  Written
 originally by Walter Korman ([email protected]), based on urlinfo.pl script from 
  AWIS-provided sample code, updated to the new API by  
Kelvin Nicholson ([email protected]). Assumes Python 2.4 or greater.
"""

import base64
import datetime
import hmac
import sha
import sys
import urllib
import urllib2

AWS_ACCESS_KEY_ID = 'your-access-key'
AWS_SECRET_ACCESS_KEY = 'your-super-secret-key'

def get_websearch(searchterm):
    def generate_timestamp(dtime):
        return dtime.strftime("%Y-%m-%dT%H:%M:%SZ")
    
    def generate_signature(operation, timestamp, secret_access_key):
        my_sha_hmac = hmac.new(secret_access_key, operation + timestamp, sha)
        my_b64_hmac_digest = base64.encodestring(my_sha_hmac.digest()).strip()
        return my_b64_hmac_digest
    
    timestamp_datetime = datetime.datetime.utcnow()
    timestamp_list = list(timestamp_datetime.timetuple())
    timestamp_list[6] = 0
    timestamp_tuple = tuple(timestamp_list)
    timestamp = generate_timestamp(timestamp_datetime)
    
    signature = generate_signature('WebSearch', timestamp, AWS_SECRET_ACCESS_KEY)
    
    def generate_rest_url (access_key, secret_key, query):
        """Returns the AWS REST URL to run a web search query on the specified
        query string."""
    
        params = urllib.urlencode(
            { 'AWSAccessKeyId':access_key,
              'Timestamp':timestamp,
              'Signature':signature,
              'Action':'WebSearch',
              'ResponseGroup':'Results',
              'Query':searchterm, })
        return "http://websearch.amazonaws.com/?%s" % (params)
    
    # print "Querying '%s'..." % (query)
    url = generate_rest_url(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, searchterm)
    # print "url => %s" % (url)
    print urllib2.urlopen(url).read()

You run it like this:

>>> from websearch import get_websearch
>>> get_websearch('python')