hn-classics/_stories/2010/13762510.md

8.3 KiB
Raw Permalink Blame History

created_at title url author points story_text comment_text num_comments story_id story_title story_url parent_id created_at_i _tags objectID year
2017-03-01T09:36:18.000Z Lessons Weve Learned Using AWS (2010) http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html denzil_correa 80 16 1488360978
story
author_denzil_correa
story_13762510
13762510 2010

Source

5 Lessons Weve Learned Using AWS Netflix TechBlog Medium

Homepage

Netflix TechBlog

Follow

Sign inGet started

Go to the profile of Netflix Technology Blog

Netflix Technology BlogBlockedUnblockFollowFollowing

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations

Dec 16, 2010


5 Lessons We've Learned Using AWS

In my last post I talked about some of the reasons we chose AWS as our computing platform. We're about one year into our transition to AWS from our own data centers. We've learned a lot so far, and I thought it might be helpful to share with you some of the mistakes we've made and some of the lessons we've learned.

1. Dorothy, you're not in Kansas anymore.

If you're used to designing and deploying applications in your own data centers, you need to be prepared to unlearn a lot of what you know. Seek to understand and embrace the differences operating in a cloud environment.

Many examples come to mind, such as hardware reliability. In our own data centers, session-based memory management was a fine approach, because any single hardware instance failure was rare. Managing state in volatile memory was reasonable, because it was rare that we would have to migrate from one instance to another. I knew to expect higher rates of individual instance failure in AWS, but I hadn't thought through some of these sorts of implications.

Another example: in the Netflix data centers, we have a high capacity, super fast, highly reliable network. This has afforded us the luxury of designing around chatty APIs to remote systems. AWS networking has more variable latency. We've had to be much more structured about "over the wire" interactions, even as we've transitioned to a more highly distributed architecture.

2. Co-tenancy is hard.

When designing customer-facing software for a cloud environment, it is all about managing down expected overall latency of response. AWS is built around a model of sharing resources: hardware, network, storage, etc. Co-tenancy can introduce variance in throughput at any level of the stack. You've got to either be willing to abandon any specific subtask, or manage your resources within AWS to avoid co-tenancy where you must.

Your best bet is to build your systems to expect and accommodate failure at any level, which introduces the next lesson.

3. The best way to avoid failure is to fail constantly.

We've sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We're designing each distributed system to expect and tolerate failure from other systems on which it depends.

If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We'll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.

One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey's job is to randomly kill instances and services within our architecture. If we aren't constantly testing our ability to succeed despite failure, then it isn't likely to work when it matters mostin the event of an unexpected outage.

4. Learn with real scale, not toy models.

Before we committed ourselves to AWS, we spent time researching the platform and building test systems within it. We tried hard to simulate realistic traffic patterns against these research projects.

This was critical in helping us select AWS, but not as helpful as we expected in thinking through our architecture. Early in our production build out, we built a simple repeater and started copying full customer request traffic to our AWS systems. That is what really taught us where our bottlenecks were, and some design choices that had seemed wise on the whiteboard turned out foolish at big scale.

We continue to research new technologies within AWS, but today we're doing it at full scale with real data. If we're thinking about new NoSQL options, for example, we'll pick a real data store and port it full scale to the options we want to learn about.

5. Commit yourself.

When I look back at what the team has accomplished this year in our AWS migration, I'm truly amazed. But it didn't always feel this good. AWS is only a few years old, and building at a high scale within it is a pioneering enterprise today. There were some dark days as we struggled with the sheer size of the task we'd taken on, and some of the differences between how AWS operates vs. our own data centers.

As you run into the hurdles, have the grit and the conviction to fight through them. Our CEO, Reed Hastings, has not only been fully on board with this migration, he is the person who motivated it! His commitment, the commitment of the technology leaders across the company, helped us push through to success when we could have chosen to retreat instead.

AWS is a tremendous suite of services, getting better all the time, and some big technology companies are running successfully there today. You can too! We hope some of our mistakes and the lessons we've learned can help you do it well.

— John Ciancutti.

See Also:

Four Reasons We Choose Amazon's Cloud as Our Computing Platform
_We think cloud computing is the future._medium.com


Originally published at _techblog.netflix.com on December 16, 2010._

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.

117

1

  • BlockedUnblockFollowFollowing

Go to the profile of Netflix Technology Blog

Netflix Technology Blog

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations

  • Follow

Netflix TechBlog

Netflix TechBlog

Learn about Netflix's world class engineering efforts, company culture, product developments and more.

  • 117
      • Netflix TechBlog

Never miss a story from** Netflix TechBlog**, when you sign up for Medium. Learn more

Never miss a story from** Netflix TechBlog**

Get updatesGet updates