O'Reilly Strata Making Data Work Conference

February 2, 2011

Data Without Limits

Werner Vogels (Amazon.com)

Alistair Croll (Bitcurrent)

Our next speaker, Werner Vogels, is behind all of the technology stuff at Amazon, and in fact, as you probably know, Amazon is a huge set of platform tools for anybody who wants to play with data of any size. Always interesting with lots of things to say about big data. Werner.

Werner Vogels (Amazon.com)

Actually, first an apology because there are many folks as in this room, especially new, young businesses, that are all waiting on top of the Amazon Web Services and I'm not going to mention you today. It's pretty high up because so much stuff happens on top of the Amazon cloud these days. It's extremely exciting. Today, I just want to try and give a bit of an overview about how we, especially in the web services business, look at the kind of data processing that is going on. The kind of innovation that, I think, also we at Amazon need to do to make sure that we can serve those who are interested in really managing and in storing and in processing data much better.

Most of my examples actually come out of the e-commerce space mainly because, I think, that speaks to most people's imagination, but there's a lot of other stuff happening. Now, whether it is oil or farmer, lots of things in sky stuff. Many, many new data sets are being created, also by the government and publicly available for everyone to use and to match up against. There's lots of cool stuff happening.

First, actually, I wish this one would work. First of all, I work for a book shop so I have to recommend you a book to read. This book is called "The Fourth Paradigm." It's a book collected by the guys who made some research as a tribute to Jim Gray. Jim, famous researcher, in data and databases had many of the ideas that are actually the foundation for how we now talk about building very large data systems, how to analyze them, and the impact on society, as well as he had a lot of great ideas about how to share them. They are all put down in that book. I recommend you to read it. The Kindle version costs you $0.99. If you want a short link, just note this one down and you'll get taken to that particular page.

Another thing I want to do, before I forget it, I want to make a push for telling you about the education program of the Amazon Web Services. If you either want to use cloud computing in your classroom, if you want to use it in your data research or in any research that you're doing, or if you are a student and it doesn't necessarily need to be a university student, you can be a high school student and you have a great idea for a project for which you need massive storage and massive computation, go to that link, submit your product proposal, and the likelihood that you get funded for that is actually pretty high.

I'm an infrastructure guy. For me big data is this, that actually most of the pieces around how you collect, how you store, you manage it, how you process it, all of those things don't count anymore. That's really a bottom up. That's an engineering view of what big data, for me at least, means. In reality, for most businesses, this is what big data means. They feel that in their data there is a competitive advantage. If only they could be processing all the data that they would have or where they feel that they are sitting on top of, that is actually something that could give them a competitive advantage, with the idea that actually bigger is better.

At Amazon, given that we've kind of pioneered, especially some of the recommendation things, we know that bigger is better, because the more data you can collect, the more finer-grained often you can do your analysis and you can avoid things like this, where just your recommendation for a duck is actually six other ducks. Actually, there's a whole range. Just type into Google "funny Amazon recommendations." Most of those are actually from our Adult Goods category, and especially if the number of sales for that particular product isn't that high, you will find very surprising recommendations. That's all driven by the fact that there isn't enough data, actually, to make a really solid recommendation.

But most of this, and also because there are things bigger and better, there's actually one caveat on the bigger and better. There are a number of categories of data where the quality of data is actually much more important than the amount of data that you have. We'll get later to what particular category that is actually about.

Now given that things that are bigger and better, I like to give something else that's sets a new style of data analysis aside from traditional business intelligence, especially if you look at it from an infrastructure point of view. In the old days, aforehand, you knew what kind of questions you wanted to ask, and the questions that you wanted to ask actually drove the data model. The data model drove how you were going to store it, and the data model also drove almost how you were going to collect that particular data.

Most of the things that are happening now around data analysis is actually, and especially with collecting as much data as possible, is that there is a bottom up thing. You collect as much data as possible without that you aforehand already know which questions you're going to ask. More importantly, you don't know which algorithms you're going to run, and you don't know how you are going to refine those algorithms. With all of that, with all that uncertainty, how much data, how much processing power, it means that you don't actually really know how much resources you need to support the kind of business intelligence or the kind of analysis that you want to do.

So, this uncertainty around resource usage drives a sort of very close marriage between big data analysis and Cloud computing. Because mainly, I think this is that if you're really serious about a new style of data analysis, you should not be worried about data storage. You should not be worried about the amount of computation that you can do. You should be able to be completely free of those constraints. And that's actually the kinds of things that we try to pursue with the Amazon Web Services, which is some of the pioneer pieces of Cloud computing in helping big data scientists to become successful.

So I believe, from my point of view, this is the pipeline for most big data analysis. There's a collection phase, or maybe they're just challenge areas or opportunities areas or what are the things that you would like to do while sitting in your underwear. So collect, store, organize, analyze, and share. I think those are the overall pieces of the process of analysis, and each of them from an infrastructure point of view need to be solved in a different way or deserve attention how to get done.

So let's start off with collect. If you look at our customers how they're actually using and how they're moving data into the Cloud, there is a whole variety, but in general what makes them different is the timeline on which data arrives. There are some customers who are real time streaming their data into their systems, and at the same time actually real time reading and analyzing. They're basically streaming and appending data into it. Others may take other different timelines. They'll move their locked files from their e-commerce site on an hourly basis into Amazon, and then there's others that will just have such large datasets that are only making sense if the complete data set actually arrives in our environment.

For example, the US Census data, does it make sense to deliver half the data. The data need to be there all or nothing. For example, there is still speed of light, and there is still network gradients and things like that. This is actually the Oceanographic Observatory Initiative University of San Diego, University of Washington working together in many different places around the world putting sensors into the ocean, and those sensors will be live streaming data continuously about movements of the ocean and things like that. And they stream it directly into Amazon S3. There are no longer intermediary stations or things like that. To do that, we made a collaboration with CENIC as well as the Pacific Northwest GigaPoP, such that the whole California Research Network, which is a 10GB network, actually ends directly into the Amazon storage systems, into the Amazon Institute environment. So there's a direct link of 10 GB from each academic institution, where actually also each K-12 institution in California into the Cloud.

But of course it's not suitable for everything. Some things can't easily be streamed. Some things are just too big to be streamed. For that, you shouldn't underestimate the bandwidth of a FedEx box. Amazon Import/Export is a service that we've set up so that you can actually FedEx your data into Amazon. You create disks, you put your data on those disks. These, when you have easily hundreds of terabytes or petabytes of data, this is the preferred way of actually getting that data to us. It's fast and it's high bandwidth.

How to store that data once you have collected it and how you move it into an environment, this is actually the biggest challenge that companies have, whether they are doing the big data or the data analysis themselves or whether they have intermediary partners do it for them. This is, in general, when people start to consider using the Cloud, because they are realizing that the bigger the dataset grows, the more important it is that they become data storage experts. You'll find many companies that have moved to the Cloud, not necessarily immediately for cost reasons, but because actually storing all this data yourself is a nightmare, especially if it's growing.

Razorfish is a company that does a lot of click stream analysis for their customers, and they are actually processing terabytes and terabytes of data a day. They had a pretty good predictive model of how things would grow, and they thought they were ahead with that in terms of ordering their network data storage to store all that data on. Then in 2009, it turned out that the holiday season for e-commerce companies, many of them who they serve, was going to be much more successful than anticipated. At that moment, they basically needed to start driving trucks to Fry's to start buying more disks just so that they can continue to support their customers. They couldn't sign up more customers because they were maxed out on their storage. Your business should not be in any sense constrained by the amount of hardware you can buy. Those are the old days. The new days, the data in your storage should be unconstrained.

There are many pieces to organize. Not only how do manage data and how do you get them together, but this is also a phase where quality is really important. One of the tools actually that we use at Amazon is to use Amazon Mechanical Turk. I don't know if you know that. There are a number of tasks where humans are much better equipped than computers. There is a whole rage of these tasks where we've set up a system where you can insert work into the system as if it is a computer program. Then in the middle, there are a few hundred thousand workers that pick up this work and actually do this work for you. You can do very great things with it.

One of them is to control data. That's an area where, for example, if you have user-generated, here you can put some control on actually what arrives in your data center and whatnot. You can correct it. Sometimes, or quite often, especially when you merge datasets, there is a lot of mess in it. It's really dirty data. Workers are really good in actually making these judgment calls or validate data. Is this really true? Is this really a dress, or is it a pair of jeans? Those things actually happen a lot. Especially Amazon, if you see merging of catalogues, it's a nightmare.

It's easy to enrich data. If people can look at products or can look at information, raw information, they can actually look at that and start making metadata enrichments of that data.

This is an example of a very large provider of business listings. They have about 20 million listings. They take in about a million pieces of new information a day. Quite a bit of that is either new data or it is data that may be duplicates. There's a whole process around it. They're using Mechanical Turk, they're using humans to make that judgment call about whether this is actually a new data item or not.

In analysis, there's a lot of you guys in this room that are actually in this business. I'll skip this slide as quickly as possible, because it's probably an insult to many of the other companies that are also doing analysis. You guys know who you are. For many of those companies that are actually in the analysis phase, we provide easy tools at Amazon especially if they're working on Hadoop. The idea is really that it's still pretty hard to run Hadoop jobs. I mean, for Elastic MapReduce, our goal was to make it easy for everybody else to start up. We have hundreds of nodes and process hundreds of terabytes with little effort. Also, a tight integration with the other services as well as that we want to make sure that you would have the latest version of Hadoop, fully tested, for clouds, on this platform available.

Again, there's always, Razorfish is a good example. They are big users of Elastic MapReduce. They do a lot of contextual ads based on continuous analysis of data that their customers send to them, and they're really successful with that.

Netflix has been named before. I want you to take a look at Adrian Cockcroft's presentation on "Crunch Your Data in the Cloud," really a hands-on presentation on how you can use all these different tools to actually go through massive data sets about customers as well as other types. Most of our customers don't do it for just one thing. They'll have 20 or 30 different analysis's going on.

Etsy and Yelp probably are in your toolbox of very favorite sites. Both of them are big users of Elastic MapReduce. Yelp, I think they run about 200 MapReduce jobs a day, processing about three or four terabytes of data. But it's not only the Hadoop world, of course. It's also big companies like SAP use all of their analytic skill. This is a great website. Carbon Impact website from SAP, where you can feed your EP data into this website, where they then do the analysis and actually evaluate for you what the carbon impact of all your processes is. It also runs on cloud of course, because all of this stuff needs to be able to grow and shrink as much as you like.

We're almost finished here. Sharing is still a very important part I think of all these processes. In the past, most of our sharing actually was datasets themselves. Yeah, you would do a massive analysis and then that would result in a dataset. But often those datasets would then be visualized. But these days, the resulting datasets are almost as large as the original datasets. So you see a whole new area of visualizations and new types of actually sharing happening. A good example there is the NASDAQ, just because they wanted to build a market replay app, had started to store all the data that they were collecting from the market into Amazon S3. And now, they've actually taken the next step, understanding that, yes, many people like this app but there could be so much more usage of this data. So they're now putting an API on top of data and make that data available for everyone to use. So if you wanted to build applications on top of market replay data, go to data.nasdaq.com.

More public datasets are available as well. Go and have a look at that. But most importantly, for all of this infrastructure stuff, this is still day one and we really depend in terms of Cloud on you guys to feed back to us what are the kind of things that Cloud doesn't do well at this moment for you and that we should be doing different such that we could serve you better and that we can give data in the hands of everybody. Thank you.