Google App Engine Meetup Notes
A couple nights ago I went to the Google App Engine (GAE) Meetup sponsored by the Silicon Valley Cloud Computing group and the SF Bay Area Google App Engine Developers group (Updated: Thanks for catching the omission Bill). I've been watching GAE since it launched almost a year ago because its a very cool idea in cloud computing, but not without a few issues that needed to be addressed. Luckily, the event was really impressive and lets just say I'm seriously into GAE now. It definitely rose to the top of my to-do list.
As the first speaker, Mike Repass put it, other cloud services like EC2 offer "Scalable Infrastructure as a Service", whereas GAE offers a "Scalable Platform as a Service". In other words, if you code your app top GAE, you have to do zero (I mean it, zero) sysadmin work to setup, maintain and scale your site. Further, because it shards out your data automatically across google's BigTable "database" that they use for all their products, and the platform handles spinning up app instances for you behind the scenes, if you get your app working for 1 million users, it will work for 11 million users with little to no changes. Incredible.
Whats also incredibly exciting about GAE is that they are really following a Blogger type model with it. They are giving away the service totally for free up to approximately 5 million page views per month. This is an insanely huge amount of traffic to give away for free. I would seriously suggest anyone who has designs in mind for small one off web app projects, drop everything and at least give GAE a shot.
I'm a really big fan of Amazon's offerings (AWS), in particular S3, EC2 and SQS, and I use them every day for my day job. At first these services seem to overlap, but I think they are pretty orthogonal to each other. GAE's purpose is to target the heart of the bellcurve of the needs of the spectrum of web apps out there, no more, no less. This means that out of the box, you have to considerably change the way you think about developing your application, but its probably something you should be doing anyhow if you plan to be able to scale your app in the future.
One of the big things you lose in GAE is access to a shell on the machine you are running on. This means you can't do traditional sysadmin things like CRON jobs to automate scheduled tasks (send out a newsletter, sync data to an external service, etc). Also, you can't exec outside processes to do some of your work. For instance, at my day job we are building a backend service on EC2, S3 and SQS that autoscales to do massive bulk conversion of video (and soon other) content. What is all boils down to in the end is a very fancy wrapper around FFMPEG which does the actual conversions. This is out of reach for GAE, so our company will be using both AWS and GAE for the foreseeable future.
I went to the meeting with finding the answers to a couple problems I envision with the service, and they have pretty much address all of them to my liking. I'll be taking the plunge and doing a test implementation of a project for work with it now, very exiting.
On a personal note, I had a nice moment of "fanboy dream come true". I've been pretty obsessed with Python for the last few years since I started working with it professionally at Outspark.com. I work with a whole pile of different languages all the time, but still after all these years, every time I wish I was working in Python. At this event I had been hoping that I would get a chance to talk directly to some of the GAE engineers about my problems with the product, and wouldn't you know it, right before the first talk starts Guido Van Rossum (the creator of Python) sits right next to me. He was really friendly and excited to talk to anyone who had questions about Python or GAE, so I had the rare opportunity to pose my concerns with their product directly from the source. Awesome.
My concerns were:
Vendor lockin. if you code to GAE, how can you break your app out of their system and run it on a competitors system if for some reason GAE's terms, price or other features just aren't cutting it for you. The Django wrappers I discuss here pretty much handle that for me. I'm going to go one further and see if I can write a Amazon SimpleDB backend for one of them to really let you port over to a similar service without too much pain. Not sure this is realistic yet, just an idea.
Importing massive amounts of data. Since you pay per transaction, and for data transferred and stored, the thought of trying to import my 10 million record events database was a bit scary. Of course the GAE guys thought of this and have written a bulk importer which will take a giant CSV file and split it into chunks, transfer it to GAE and import it into your datastore, all automated. I'll post back my experiences with it later. Guido also mentioned that there would be a much improved feature "soon".
Refactoring your data. So lets say you've gotten your 15 millions records happily into the Google datastore, but suddenly business requirements change and you need to significantly refactor the way your data is structured. Now, this is a hard situation in any database system, but the thought of only having external API access to do this amount of work is daunting. Looks like GAE will have something called "workflows" in a not to distant release that should allow this sort of bulk work to be done in a sane fashion. Guido suggested a really nice short term solution, which is to version every record in your datastore, then make any objects which deal with that data aware of how to update a record, and have them update each record before making use of them. This way the datastore will magically heal itself up to the latest revision over time.
This is pretty symbolic of how you just have to think a little bit differently about attacking problems in something like GAE. A similar approach was demo'd by Gee from Rotzy in order to get around the lack of any facility to do schedule, batch processing with GAE. In short, Rotzy has a task que, and any requests that are simply serving static content (like the profile pictures on a comment page) first check the task queue to do a little bit of batch work, then serve there content and die. A Pretty damned inventive solution.
(Photo of Mike Repass on rotzy, and yes that my green shoulder in the foreground ;)
A couple questions I still have lingering:
One suggested way to share one data source with 3 frontend user app's was to have 1 backend app that does the data hording, give it a REST api and have the 3 frontend apps fetch their data over that API. This was suggested by someone giving a talk that night... but isn't that a violation of the TOS? Wouldn't that be how you can "shard" your app across several instances to get over their free quota limits?
Currently there are two overlapping projects to let you develop to Django running on GAE (Google App Engine Helper for Django and App Engine Patch). The biggest distinction is that Helper actually wraps up the GAE data store api's in a Django compatible way, whereas Patch claims that that is impossible, and so doesn't even bother. It would be interesting to see some deeper comparisons of the two.