Más contenido relacionado
La actualidad más candente (6)
Similar a FIT5 Ch. 5, CIS 110 13F (20)
FIT5 Ch. 5, CIS 110 13F
- 2. How a Search Engine Works
A. The Web Crawler
•
software robots (called spiders or bots)
=> spiders crawl the web to build an index
(keywords & web pages)
TOKEN
URL
cat
www.cat.com
icanhascheezburger.com
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 3. How a Search Engine Works:
the Web Crawler
• Web crawler: a program that indexes
content on the web
• Algorithm:
– Start from one "seed" page
– Extract all links on that page
– Follow each link to find new pages
– Extract all links from new pages
– keep going ...
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 4. Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 5. How a Search Engine Works:
B. The Query Processor
• user enters search terms (keywords)
• query processor looks up word in index
• returns hit list
• create index in advance
• store in RAM,
=> fast query response
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 8. Power of Indexed Search
• Search engines can look at billions of Web
pages and return an answer in less than a
fifth of a second
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 9. Data Centers
• Search Index is RAM-resident
– RAM 100,000x faster than disk
– Hennessy/Patterson (4ed) memory access times:
» Register: 250ps
» L1 Cache: 1ns
» RAM: 100ns
» Hard Disk 10ms (SSD Flash 100 msec.)
=> Data Centers: a growth industry in
Oregon
• Why?
Data Centers as Information Substations
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 10. Google’s Data Centers
– Google’s facility in The Dalles is only one two
dozen, which stretch from Silicon Valley to
Dublin.
– #servers: 1,000,000 - 2,000,000
• 2 exabytes of hard disk storage – enough to copy
the web
• “The Indexed Web contains at least 3.59 billion
pages (Tuesday, 15 October, 2013).”
• 8 petabytes of RAM
– Field Trip: Google’s Data Centers
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 11. datacenterknowledge.com
• rapid growth in data center electricity use from
•
•
2000 to 2005
slowed significantly from 2005 to 2010,
2010: total electricity use by all data centers
about 1.3% of all electricity use for the world
(2% for the US)
=> Google’s entire global data center network:
220 megawatts
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 12. Data Center Energy Efficiency
• PUE (power usage effectiveness)
• standard from Green Grid consortium
• measures how much power goes directly to
computing vs. cooling, lighting, etc.
• Score of 1: no power goes to the extra costs
• 1.5 means that ancillary services
consume half of power used
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 13. Data Center Energy Efficiency
• Google PUE: 1.1
=> 11% to cooling, etc.
• 6 Things You’d Never Guess About Google’s
Energy Use
• Read more
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 14. What Search Engines Look At
– Title— <title> element contains key words
– Anchor text— <a> element, describes the
page it links to
– Landing page— <a> element, the page it
connects to
– Meta—A <meta> tag in the head section often
used for key words
– Alt attributes— <img> element attribute gives
a textual description
– Content— text on the page
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 15. Page Rank Algorithm:
Pioneered by Google
• PageRank works like a voting system
– If page A links to page B, A’s link adds to B’s
importance
– Pages linked-to by many pages have a high
page rank
– Links from pages with a high page ranking are
ranked as more important
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 16. Field Trip: Basic Search
• Google Search Education
http://bit.ly/16ZW6Ow
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 17. Advanced Search: Logic Ops
• logic operator: AND
– human AND powered AND flight
hits have at all words
• logic operator: OR
– marshmallow OR strawberry OR chocolate
– OR-queries hits have at least one word
• logic opeator: NOT
– tigers AND NOT baseball
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 18. Combining Logical Operators
(marshmallow OR strawberry) AND sundae
• logic operators work like arithmetic
• Google also uses a minus (–) as an
abbreviation for NOT
– http://www.powersearchingwithgoogle.com/
course/ps/assets/
PowerSearchingQuickReference.pdf
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 19. Site Search
• Many sites offer the opportunity to perform
a site search
• (eg) Try this Google search:
Google chief economist Hal Varian,
site:uoregon.edu
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 20. Field Trip: Power Search
• Google Search Education
http://www.powersearchingwithgoogle.com/
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 21. Alternatives to the Search Giant
How Wolfram|Alpha Works
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 22. Cloud Storage
•
•
•
•
•
Facebook: 300 petabytes (PB)
Microsoft Hotmail: 100 petabytes,
Microsoft SkyDrive: 10PB
Amazon S3: 900 PB
Dropbox: 40PB
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13
- 23. Ch. 5: Assessment
Learning Outcomes - Know the following
Copyright © 2013 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Wednesday, October 16, 13