hn-classics/_stories/2008/7156179.md

443 lines
21 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
created_at: '2014-01-31T11:13:18.000Z'
title: The Google technology stack (2008)
url: http://michaelnielsen.org/blog/lecture-course-the-google-technology-stack/
author: erbdex
points: 61
story_text: ''
comment_text:
num_comments: 14
story_id:
story_title:
story_url:
parent_id:
created_at_i: 1391166798
_tags:
- story
- author_erbdex
- story_7156179
objectID: '7156179'
year: 2008
---
[Source](http://michaelnielsen.org/blog/lecture-course-the-google-technology-stack/ "Permalink to The Google Technology Stack | Michael Nielsen")
# The Google Technology Stack | Michael Nielsen
[Skip to content][1]
* ![RSS][2]
[Michael Nielsen][3]
* [About][4]
* [Michaels main blog][5]
* [Data-driven intelligence blog][6]
* [Writing][7]
# The Google Technology Stack
### Posts
#### PageRank and MapReduce
* [Introduction to PageRank][8] ([video][9])
* [Building our PageRank intuition][10] ([video][11])
* [The PageRank distribution for the web][12] (no video, short supplement extending the last post)
* [Using your laptop to compute PageRank for millions of webpages][13]
* [Write your first MapReduce program in 20 minutes][14]
* [Using MapReduce to compute PageRank][15]
* [Consistent Hashing][16]
* [A Number-Theoretic Approach to Consistent Hashing][17]
* **To be added:** Well put MapReduce on a cluster, use it to compute PageRank, and much more.
#### Data mining and machine learning
* [Introduction to statistical machine translation][18]. Good background is [Peter Norvigs introduction to spelling correction][19].
* [Implementing statistical machine translation using MapReduce][20]
* **Much more to be added**
### About
Part of what makes Google such an amazing engine of innovation is their internal technology stack: a set of powerful proprietary technologies that makes it easy for Google developers to generate and process enormous quantities of data. According to a senior Microsoft developer who moved to Google, Googlers work and think at a higher level of abstraction than do developers at many other companies, including Microsoft: “Google uses Bayesian filtering the way Microsoft uses the if statement” ([Credit: Joel Spolsky][21]). This series of posts describes some of the technologies that make this high level of abstraction possible. The technologies Ill describe include:
* **The Google File System:** a simple way of accessing enormous amounts of data spread across a large cluster of machines. Acts as a “virtual file system” that makes the cluster look to developers more like a single machine, and eliminates the need to think about details like what happens when a machine fails.
* **Bigtable:** Building on the Google File System, Bigtable provides a simple database model which can be run across large clusters. This allows developers to ignore many of the underlying details of the cluster, and concentrate on getting things done.
* **MapReduce:** A powerful programming model for processing and generating very large data sets on large clusters, MapReduce makes it easy to automatically parallelize many programming tasks. It is used internally by Google developers to process many petabytes of data every day, enabling developers to code and run simple but large parallel jobs in minutes, instead of days.
Together, these technologies make it easy to run large parallel jobs on very big data sets. Running on top of this technology stack are many powerful data mining and machine learning algorithms. Ill describe at least two of these:
* Web Crawling
* PageRank
and may describe more, including statistical approaches to spell checking and machine translation, and recommendation algorithms.
In addition to understanding how these technologies work, well also:
* Develop toy implementations of some of the technologies.
* Investigate some related open source technologies, such as Hadoop, CouchDB, Nutch, Lucene, and others.
Ive never worked at Google. The posts are based on the published literature, especially some key technical publications released by Google describing the Google File System, Bigtable, MapReduce, and PageRank. Im sure there are some significant differences between the material I will describe, and the current internal state-of-the-art. Still, we should be able to build up a pretty good picture of a data-intensive web company, even if its not accurate about Googles particulars.
**Exercises and problems:** The posts will contain exercises for you to do to master the material, and also some more challenging and often open-ended problems. Note that while Ive worked on many of the problems, I by no means have full solutions.
**FriendFeed room:** There is a [FriendFeed room][22] for the series.
**Accompanying lectures:** Im basing a lecture series in Waterloo, Ontario, on the posts. Heres some details about the lectures.
**Audience:** The audience will include both theoretical physicists and developers; the lectures should be accessible to both sets of people.
**Time:** Lectures will be held every Tuesday, 7:00-8:30pm. The exception is the first lecture, which will be Thursday, December 4, 2008, 7:00-8:30pm.
**Location:** [AideRSS][23] have kindly offered us a room for the course, at 505-180 King Street South, Waterloo its opposite the Brick Brewery, and near the Hospital. _To get into the building, you need someone from AideRSS to let you in. If no one from AideRSS is by the entrance, call (519) 498 1476._
## Background reading
The lectures are based primarily on the following sources:
1. **Original PageRank paper:** “[The PageRank Citation Ranking: Bringing Order to the Web”][24], by Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd (1998).
2. **Textbook on PageRank and related ideas:** “[Googles PageRank and Beyond: The Science of Search Engine Rankings”][25], by Amy N. Langville and Carl~D.~Meyer (2006).
3. **Overview of Googles early architecture:** “[The Anatomy of a Large-Scale Hypertextual Web Search Engine”][26], by Sergey Brin and Lawrence Page (1998).
4. **More recent overview of Googles architecture:** “[Web Search for a Planet: The Google Cluster Architecture”][27], by Luiz Barroso, Jeffrey Dean, and Urs Hoelze (2003).
5. **Original Google File System paper:** “[The Google File System”][28], by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (2003).
6. **Original Bigtable paper:** “[Bigtable: A distributed storage system for structured data”][29], by Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber (2006).
7. **Original MapReduce paper:** “[MapReduce: Simplified Data Processing on Large Clusters”][30], by Jeffrey Dean and Sanjay Ghemawat (2004).
Related resources include:
1. [First of five lectures on MapReduce][31].
2. Lecture course on [scalable systems][32] at the University of Washington, covering many similar topics. Videos of the lectures are included.
Many more resources will be posted in the [FriendFeed room][22] for the course.
14 Comments [Leave one →][33]
### Trackbacks and Pingbacks
1. [Michael Nielsen » Lectures on the Google Technology Stack 1: Introduction to PageRank][8]
2. [Michael Nielsen » Lectures on the Google Technology Stack 2: Building our PageRank Intuition][10]
3. [Michael Nielsen » Write your first MapReduce program in 20 minutes][14]
4. [Michael Nielsen » Using MapReduce to compute PageRank][15]
5. [map and reduce @ rvg][34]
6. [Michael Nielsen » Introduction to Statistical Machine Translation][18]
7. [Michael Nielsen » Implementing Statistical Machine Translation Using MapReduce][20]
8. [Michael Nielsen » Consistent hashing][16]
9. [Michael Nielsen » A number-theoretic approach to consistent hashing][35]
10. [Os segredos do Google « Charles Cadé Blog / Comunicação, tecnologia e cultura digital][36]
11. [Link list for october 2009 « Mixotricha][37]
12. [links for 2009-11-02 « Donghai Ma][38]
13. [Data-driven intelligence | Michael Nielsen][39]
14. [Tab sweep January 2014 | Colin Charles Agenda][40]
### Leave a Reply [Cancel reply][41]
Comment
**Note:** HTML is allowed. Your email address will not be published.
[Subscribe to this comment feed via RSS][42]
Name (required) Email (required, will not be published) Website
* ## Follow Michael
* ![][2][ Subscribe to this blog][43]
* ![][44][Subscribe to this blog by email][45]
* [Michael's blog on data-driven intelligence][46]
* ![][47][ Follow Michael on Twitter][48]
* ## Other places on the web
* [GitHub][49]
* [Delicious][50]
* [Delicious research links for book project][51]
* [Delicious research links for last book][52]
* [Google Plus][53]
* ## Books
![][54]![][55]
[Reinventing Discovery: The New Era of Networked Science][56]![][55] [(Errata)][57]
![][58]![][59]
[Quantum Computation and Quantum Information][60]![][59]
* ## Other projects
* [Essays][61]
* [TEDxWaterloo video about open science][62]
* [Quantum computing for the determined][63]
* [Polymath wiki][64]
* ## Recent Posts
* [Is there a tension between creativity and accuracy?][65]
* [Where will the key ideas shaping the future of scientific publishing come from?][66]
* [How the backpropagation algorithm works][67]
* [Reinventing Explanation][68]
* [Neural Networks and Deep Learning: first chapter goes live][69]
* ## Recent Comments
* [Siva][70] on [Is there a tension between creativity and accuracy?][71]
* [Reading into the landscape drossbucket][72] on [Is there a tension between creativity and accuracy?][73]
* kBite on [Is there a tension between creativity and accuracy?][74]
* Siname on [Is there a tension between creativity and accuracy?][75]
* Florian Seidl-Schulz on [Is there a tension between creativity and accuracy?][76]
* ## ![RSS][77] [Linklog][78]
* ## ![RSS][77] [Documentaries][78]
* ## Search
* ## Archives
* [April 2017][79]
* [May 2015][80]
* [April 2014][81]
* [January 2014][82]
* [November 2013][83]
* [February 2013][84]
* [February 2012][85]
* [January 2012][86]
* [October 2011][87]
* [September 2011][88]
* [August 2011][89]
* [June 2011][90]
* [May 2011][91]
* [April 2011][92]
* [March 2011][93]
* [February 2011][94]
* [January 2011][95]
* [December 2010][96]
* [November 2010][97]
* [August 2010][98]
* [May 2010][99]
* [April 2010][100]
* [February 2010][101]
* [January 2010][102]
* [December 2009][103]
* [November 2009][104]
* [October 2009][105]
* [September 2009][106]
* [August 2009][107]
* [July 2009][108]
* [June 2009][109]
* [May 2009][110]
* [April 2009][111]
* [March 2009][112]
* [February 2009][113]
* [January 2009][114]
* [December 2008][115]
* [November 2008][116]
* [October 2008][117]
* [September 2008][118]
* [August 2008][119]
* [July 2008][120]
* [June 2008][121]
* [May 2008][122]
* [April 2008][123]
* [March 2008][124]
* [February 2008][125]
* [January 2008][126]
* [November 2007][127]
* [October 2007][128]
* [September 2007][129]
* [August 2007][130]
* [July 2007][131]
* [June 2007][132]
* [May 2007][133]
* [April 2007][134]
* [August 2006][135]
* [July 2006][136]
* [June 2006][137]
* [April 2006][138]
* [March 2006][139]
* [October 2005][140]
* [August 2005][141]
* [July 2005][142]
* [June 2005][143]
* [May 2005][144]
* [April 2005][145]
* [March 2005][146]
* [February 2005][147]
* [January 2005][148]
* [December 2004][149]
* [November 2004][150]
* [October 2004][151]
* [September 2004][152]
* [August 2004][153]
* [July 2004][154]
* [June 2004][155]
* [May 2004][156]
* [April 2004][157]
* [March 2004][158]
* [February 2004][159]
* [January 2004][160]
* [November 2003][161]
* [October 2003][162]
* [September 2003][163]
* [August 2003][164]
* * * Copyright © 2018 . [Titan Theme][165] by [The Theme Foundry][166]
[1]: http://michaelnielsen.org#content
[2]: http://michaelnielsen.org/blog/wp-content/themes/titan_pro/images/flw-rss.png
[3]: http://michaelnielsen.org/blog/
[4]: http://michaelnielsen.org/blog/michael-a-nielsen/
[5]: http://michaelnielsen.org/blog
[6]: http://michaelnielsen.org/ddi/
[7]: http://michaelnielsen.org/blog/writing/
[8]: http://michaelnielsen.org/blog/?p=506
[9]: http://michaelnielsen.org/blog/?p=509
[10]: http://michaelnielsen.org/blog/?p=511
[11]: http://michaelnielsen.org/blog/?p=520
[12]: http://michaelnielsen.org/blog/?p=516
[13]: http://michaelnielsen.org/blog/?p=523
[14]: http://michaelnielsen.org/blog/?p=529
[15]: http://michaelnielsen.org/blog/?p=534
[16]: http://michaelnielsen.org/blog/?p=613
[17]: http://michaelnielsen.org/blog/a-number-theoretic-approach-to-consistent-hashing/
[18]: http://michaelnielsen.org/blog/?p=577
[19]: http://norvig.com/spell-correct.html
[20]: http://michaelnielsen.org/blog/?p=587
[21]: http://www.joelonsoftware.com/items/2005/10/17.html
[22]: http://friendfeed.com/rooms/lecture-course-on-the-google-techno
[23]: http://www.postrank.com
[24]: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.5427
[25]: http://pagerankandbeyond.com/
[26]: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.109.4049
[27]: http://research.google.com/archive/googlecluster.html
[28]: http://labs.google.com/papers/gfs.html
[29]: http://labs.google.com/papers/bigtable.html
[30]: http://labs.google.com/papers/mapreduce.html
[31]: http://www.youtube.com/watch?v=yjPBkvYh-ss
[32]: http://www.cs.washington.edu/education/courses/cse490h/08au/
[33]: http://michaelnielsen.org#respond "Leave one →"
[34]: http://rvg.blogs.way2p.org/2009/02/map-and-reduce/
[35]: http://michaelnielsen.org/blog/?p=618
[36]: http://charlescade.com.br/2009/10/05/os-segredos-do-google/
[37]: http://zyxo.wordpress.com/2009/11/01/link-list-for-october-2009/
[38]: http://donghaima.wordpress.com/2009/11/03/links-for-2009-11-02/
[39]: http://michaelnielsen.org/blog/data-driven-intelligence/
[40]: http://www.bytebot.net/blog/archives/2014/02/01/tab-sweep-january-2014
[41]: http://michaelnielsen.org/blog/lecture-course-the-google-technology-stack/#respond
[42]: http://michaelnielsen.org/blog/lecture-course-the-google-technology-stack/feed/
[43]: http://michaelnielsen.org/blog/feed/
[44]: http://michaelnielsen.org/blog/wp-content/themes/titan_pro/images/flw-email.png
[45]: http://feedburner.google.com/fb/a/mailverify?uri=michaelnielsen/wmna&loc=en_US
[46]: http://michaelnielsen.org/ddi
[47]: http://michaelnielsen.org/blog/wp-content/themes/titan_pro/images/flw-twitter.png
[48]: http://twitter.com/#!/michael_nielsen
[49]: https://github.com/mnielsen
[50]: http://delicious.com/nielsen
[51]: http://delicious.com/nielsen/nnftd
[52]: http://delicious.com/nielsen/tfos
[53]: https://plus.google.com/116973436260300431929/posts
[54]: http://ws.assoc-amazon.com/widgets/q?_encoding=UTF8&Format=_SL110_&ASIN=0691148902&MarketPlace=US&ID=AsinImage&WS=1&tag=michaniels-20&ServiceVersion=20070822
[55]: http://www.assoc-amazon.com/e/ir?t=michaniels-20&l=as2&o=1&a=0691148902&camp=217145&creative=399373
[56]: http://www.amazon.com/gp/product/0691148902/ref=as_li_tf_tl?ie=UTF8&tag=michaniels-20&linkCode=as2&camp=217145&creative=399373&creativeASIN=0691148902
[57]: http://michaelnielsen.org/blog/errata-for-reinventing-discovery/
[58]: http://ws.assoc-amazon.com/widgets/q?_encoding=UTF8&ASIN=1107002176&Format=_SL110_&ID=AsinImage&MarketPlace=US&ServiceVersion=20070822&WS=1&tag=michaniels-20
[59]: http://www.assoc-amazon.com/e/ir?t=michaniels-20&l=as2&o=1&a=1107002176
[60]: http://www.amazon.com/gp/product/1107002176/ref=as_li_tf_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1107002176&linkCode=as2&tag=michaniels-20
[61]: http://michaelnielsen.org/blog/essays/
[62]: http://www.youtube.com/watch?v=DnWocYKqvhw&feature=player_embedded
[63]: http://michaelnielsen.org/blog/quantum-computing-for-the-determined/
[64]: http://michaelnielsen.org/polymath1/index.php?title=Main_Page
[65]: http://michaelnielsen.org/blog/is-there-a-tension-between-creativity-and-accuracy/
[66]: http://michaelnielsen.org/blog/where-will-the-key-ideas-shaping-the-future-of-scientific-publishing-come-from/
[67]: http://michaelnielsen.org/blog/how-the-backpropagation-algorithm-works/
[68]: http://michaelnielsen.org/blog/reinventing-explanation/
[69]: http://michaelnielsen.org/blog/neural-networks-and-deep-learning-first-chapter-goes-live/
[70]: http://sivark.me
[71]: http://michaelnielsen.org/blog/is-there-a-tension-between-creativity-and-accuracy/comment-page-1/#comment-76394
[72]: https://drossbucket.wordpress.com/2017/05/31/reading-into-the-landscape/
[73]: http://michaelnielsen.org/blog/is-there-a-tension-between-creativity-and-accuracy/comment-page-1/#comment-76383
[74]: http://michaelnielsen.org/blog/is-there-a-tension-between-creativity-and-accuracy/comment-page-1/#comment-76338
[75]: http://michaelnielsen.org/blog/is-there-a-tension-between-creativity-and-accuracy/comment-page-1/#comment-76336
[76]: http://michaelnielsen.org/blog/is-there-a-tension-between-creativity-and-accuracy/comment-page-1/#comment-76324
[77]: http://michaelnielsen.org/blog/wp-includes/images/rss.png
[78]:
[79]: http://michaelnielsen.org/blog/2017/04/
[80]: http://michaelnielsen.org/blog/2015/05/
[81]: http://michaelnielsen.org/blog/2014/04/
[82]: http://michaelnielsen.org/blog/2014/01/
[83]: http://michaelnielsen.org/blog/2013/11/
[84]: http://michaelnielsen.org/blog/2013/02/
[85]: http://michaelnielsen.org/blog/2012/02/
[86]: http://michaelnielsen.org/blog/2012/01/
[87]: http://michaelnielsen.org/blog/2011/10/
[88]: http://michaelnielsen.org/blog/2011/09/
[89]: http://michaelnielsen.org/blog/2011/08/
[90]: http://michaelnielsen.org/blog/2011/06/
[91]: http://michaelnielsen.org/blog/2011/05/
[92]: http://michaelnielsen.org/blog/2011/04/
[93]: http://michaelnielsen.org/blog/2011/03/
[94]: http://michaelnielsen.org/blog/2011/02/
[95]: http://michaelnielsen.org/blog/2011/01/
[96]: http://michaelnielsen.org/blog/2010/12/
[97]: http://michaelnielsen.org/blog/2010/11/
[98]: http://michaelnielsen.org/blog/2010/08/
[99]: http://michaelnielsen.org/blog/2010/05/
[100]: http://michaelnielsen.org/blog/2010/04/
[101]: http://michaelnielsen.org/blog/2010/02/
[102]: http://michaelnielsen.org/blog/2010/01/
[103]: http://michaelnielsen.org/blog/2009/12/
[104]: http://michaelnielsen.org/blog/2009/11/
[105]: http://michaelnielsen.org/blog/2009/10/
[106]: http://michaelnielsen.org/blog/2009/09/
[107]: http://michaelnielsen.org/blog/2009/08/
[108]: http://michaelnielsen.org/blog/2009/07/
[109]: http://michaelnielsen.org/blog/2009/06/
[110]: http://michaelnielsen.org/blog/2009/05/
[111]: http://michaelnielsen.org/blog/2009/04/
[112]: http://michaelnielsen.org/blog/2009/03/
[113]: http://michaelnielsen.org/blog/2009/02/
[114]: http://michaelnielsen.org/blog/2009/01/
[115]: http://michaelnielsen.org/blog/2008/12/
[116]: http://michaelnielsen.org/blog/2008/11/
[117]: http://michaelnielsen.org/blog/2008/10/
[118]: http://michaelnielsen.org/blog/2008/09/
[119]: http://michaelnielsen.org/blog/2008/08/
[120]: http://michaelnielsen.org/blog/2008/07/
[121]: http://michaelnielsen.org/blog/2008/06/
[122]: http://michaelnielsen.org/blog/2008/05/
[123]: http://michaelnielsen.org/blog/2008/04/
[124]: http://michaelnielsen.org/blog/2008/03/
[125]: http://michaelnielsen.org/blog/2008/02/
[126]: http://michaelnielsen.org/blog/2008/01/
[127]: http://michaelnielsen.org/blog/2007/11/
[128]: http://michaelnielsen.org/blog/2007/10/
[129]: http://michaelnielsen.org/blog/2007/09/
[130]: http://michaelnielsen.org/blog/2007/08/
[131]: http://michaelnielsen.org/blog/2007/07/
[132]: http://michaelnielsen.org/blog/2007/06/
[133]: http://michaelnielsen.org/blog/2007/05/
[134]: http://michaelnielsen.org/blog/2007/04/
[135]: http://michaelnielsen.org/blog/2006/08/
[136]: http://michaelnielsen.org/blog/2006/07/
[137]: http://michaelnielsen.org/blog/2006/06/
[138]: http://michaelnielsen.org/blog/2006/04/
[139]: http://michaelnielsen.org/blog/2006/03/
[140]: http://michaelnielsen.org/blog/2005/10/
[141]: http://michaelnielsen.org/blog/2005/08/
[142]: http://michaelnielsen.org/blog/2005/07/
[143]: http://michaelnielsen.org/blog/2005/06/
[144]: http://michaelnielsen.org/blog/2005/05/
[145]: http://michaelnielsen.org/blog/2005/04/
[146]: http://michaelnielsen.org/blog/2005/03/
[147]: http://michaelnielsen.org/blog/2005/02/
[148]: http://michaelnielsen.org/blog/2005/01/
[149]: http://michaelnielsen.org/blog/2004/12/
[150]: http://michaelnielsen.org/blog/2004/11/
[151]: http://michaelnielsen.org/blog/2004/10/
[152]: http://michaelnielsen.org/blog/2004/09/
[153]: http://michaelnielsen.org/blog/2004/08/
[154]: http://michaelnielsen.org/blog/2004/07/
[155]: http://michaelnielsen.org/blog/2004/06/
[156]: http://michaelnielsen.org/blog/2004/05/
[157]: http://michaelnielsen.org/blog/2004/04/
[158]: http://michaelnielsen.org/blog/2004/03/
[159]: http://michaelnielsen.org/blog/2004/02/
[160]: http://michaelnielsen.org/blog/2004/01/
[161]: http://michaelnielsen.org/blog/2003/11/
[162]: http://michaelnielsen.org/blog/2003/10/
[163]: http://michaelnielsen.org/blog/2003/09/
[164]: http://michaelnielsen.org/blog/2003/08/
[165]: http://thethemefoundry.com/titan/
[166]: http://thethemefoundry.com/