8.3 KiB

Raw Permalink Blame History

created_at

title

url

author

points

story_text

comment_text

num_comments

story_id

story_title

story_url

parent_id

created_at_i

_tags

objectID

year

2017-03-01T09:36:18.000Z

Lessons We’ve Learned Using AWS (2010)

http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html

denzil_correa

1488360978

story

author_denzil_correa

story_13762510

13762510

2010

Source

5 Lessons We’ve Learned Using AWS – Netflix TechBlog – Medium

Homepage

Netflix Technology BlogBlockedUnblockFollowFollowing

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations

Dec 16, 2010

5 Lessons We've Learned Using AWS

In my last post I talked about some of the reasons we chose AWS as our computing platform. We're about one year into our transition to AWS from our own data centers. We've learned a lot so far, and I thought it might be helpful to share with you some of the mistakes we've made and some of the lessons we've learned.

1. Dorothy, you're not in Kansas anymore.

If you're used to designing and deploying applications in your own data centers, you need to be prepared to unlearn a lot of what you know. Seek to understand and embrace the differences operating in a cloud environment.

Many examples come to mind, such as hardware reliability. In our own data centers, session-based memory management was a fine approach, because any single hardware instance failure was rare. Managing state in volatile memory was reasonable, because it was rare that we would have to migrate from one instance to another. I knew to expect higher rates of individual instance failure in AWS, but I hadn't thought through some of these sorts of implications.

Another example: in the Netflix data centers, we have a high capacity, super fast, highly reliable network. This has afforded us the luxury of designing around chatty APIs to remote systems. AWS networking has more variable latency. We've had to be much more structured about "over the wire" interactions, even as we've transitioned to a more highly distributed architecture.

2. Co-tenancy is hard.

When designing customer-facing software for a cloud environment, it is all about managing down expected overall latency of response. AWS is built around a model of sharing resources: hardware, network, storage, etc. Co-tenancy can introduce variance in throughput at any level of the stack. You've got to either be willing to abandon any specific subtask, or manage your resources within AWS to avoid co-tenancy where you must.

Your best bet is to build your systems to expect and accommodate failure at any level, which introduces the next lesson.

3. The best way to avoid failure is to fail constantly.

We've sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We're designing each distributed system to expect and tolerate failure from other systems on which it depends.

If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We'll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.

One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey's job is to randomly kill instances and services within our architecture. If we aren't constantly testing our ability to succeed despite failure, then it isn't likely to work when it matters most — in the event of an unexpected outage.

4. Learn with real scale, not toy models.

Before we committed ourselves to AWS, we spent time researching the platform and building test systems within it. We tried hard to simulate realistic traffic patterns against these research projects.

This was critical in helping us select AWS, but not as helpful as we expected in thinking through our architecture. Early in our production build out, we built a simple repeater and started copying full customer request traffic to our AWS systems. That is what really taught us where our bottlenecks were, and some design choices that had seemed wise on the whiteboard turned out foolish at big scale.

We continue to research new technologies within AWS, but today we're doing it at full scale with real data. If we're thinking about new NoSQL options, for example, we'll pick a real data store and port it full scale to the options we want to learn about.

5. Commit yourself.

When I look back at what the team has accomplished this year in our AWS migration, I'm truly amazed. But it didn't always feel this good. AWS is only a few years old, and building at a high scale within it is a pioneering enterprise today. There were some dark days as we struggled with the sheer size of the task we'd taken on, and some of the differences between how AWS operates vs. our own data centers.

As you run into the hurdles, have the grit and the conviction to fight through them. Our CEO, Reed Hastings, has not only been fully on board with this migration, he is the person who motivated it! His commitment, the commitment of the technology leaders across the company, helped us push through to success when we could have chosen to retreat instead.

AWS is a tremendous suite of services, getting better all the time, and some big technology companies are running successfully there today. You can too! We hope some of our mistakes and the lessons we've learned can help you do it well.

— John Ciancutti.

Netflix Technology Blog

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations

Netflix TechBlog

Learn about Netflix's world class engineering efforts, company culture, product developments and more.

Never miss a story from** Netflix TechBlog**, when you sign up for Medium. Learn more

Never miss a story from** Netflix TechBlog**

Get updatesGet updates

8.3 KiB Raw Permalink Blame History Unescape Escape

5 Lessons We’ve Learned Using AWS – Netflix TechBlog – Medium

5 Lessons We've Learned Using AWS

See Also:

Netflix Technology Blog

Netflix TechBlog

8.3 KiB

Raw Permalink Blame History