SlideShare a Scribd company logo
1 of 44
:)
tiny   :projects
Tesseract OCR



1985       2006
HP       Google
Tesseract OCR



2006       2011
TIFF              *
Tesseract OCR



2009       2010
Text      layout
Tesseract OCR



2007          2011
 6               33
Tesseract OCR
  Arabic, English, Bulgarian, Catalan, Czech,
 Chinese (Simplified and Traditional), Danish
(standard and Fraktur script), German, Greek,
Finnish, French, Hebrew, Croatian, Hungarian,
Indonesian, Italian, Japanese, Korean, Latvian,
     Lithuanian, Dutch, Norwegian, Polish,
    Portuguese, Romanian, Russian, Slovak
   (standard and Fraktur script), Slovenian,
   Spanish, Serbian, Swedish, Tagalog, Thai,
       Turkish, Ukrainian and Vietnamese
Tesseract OCR

Officially supported:




 Probably runs on:
Image processing
Google Refine
Runs on:
Runs in:
Major features:

Import from anywhere
Faceting
Clustering
Split crate custom columns
GREL transformations
Export/etc
google protocol buffers

                                   Person person;
                                   person.set_id(123);




                               >
message Person {                   person.set_name("Bob");
  required int32 id = 1;           person.set_email("bob@example.com");
  required string name = 2;
  optional string email = 3;       fstream out("person.pb", ios::out ...
}                                  person.SerializeToOstream(&out);
                                   out.close();
512   bytes / tweet
  340,000,000   tweets / day (2012)
7,253,333,333   bytes / hour
    2,014,814   bytes / second
        1,921   Mbytes / second
       15,371   Mbits / second

           8    Tbytes / day (2011)

  Google: ~ 377M searches/day
+ =
+ =
+ =
>   + =
>   + =
>   + =
?

    MapReduce
snappy
http://code.google.com/p/snappy/
snappy


Fast                Stable




Robust
                  Free and BSD
Size(less is better)
                                             compression ratio (%)
80



70



60



50



40



30



20



10



0
     lzjb 2010 lzo 2.04 1x fastlz 0.1 - fastlz 0.1 - 3.6 vf lzf 3.6 uf lzrw1
                                                   lzf                         lzrw1-a   lzrw2   lzrw3   lzrw3-a   snappy   quicklz    quicklz
                                1            2                                                                       1.0    1.5.0 -1   1.5.0 -2
6
                                     Data types
                    5




                    4
compression ratio




                    3                                    snappy
                                                         zlib



                    2




                    1




                    0
                        plain text       html     jpeg
Size



from 20% to 100% bigger

                :(


     ...not for amazon glacier
Speed is better)
                                            Compression (MB/s) (more
250




200




150




100




50




  0
      lzjb 2010   lzo 2.04 fastlz 0.1 - fastlz 0.1 - 3.6 vf lzf 3.6 uf lzrw1
                                                   lzf                         lzrw1-a   lzrw2   lzrw3   lzrw3-a   snappy   quicklz    quicklz
                     1x         1            2                                                                       1.0    1.5.0 -1   1.5.0 -2
Speed is better)
                                          Decompression (MB/s) (more
500


450


400


350


300


250


200


150


100


50


  0
      lzjb 2010   lzo 2.04 fastlz 0.1 - fastlz 0.1 - 3.6 vf lzf 3.6 uf lzrw1
                                                   lzf                         lzrw1-a   lzrw2   lzrw3   lzrw3-a   snappy   quicklz    quicklz
                     1x         1            2                                                                       1.0    1.5.0 -1   1.5.0 -2
On 1 core of 64-bit Core i7 processor:

  • Compression:        250MB/s

  • Decompression: 500MB/s

                   :P
Portable, but...
Portable, but primarily optimized
for 64-bit x86-compatible
processors
Used:

 BigTable
MapReduce
Google RPC
 Hadoop
Bindings:
@TarasRoshko

       HTTP headers here:

http://code.google.com/p/snappy/
source/browse/trunk/framing_for
             mat.txt
QA?   Ostap Andrusiv

      Software Engineer
      Eleks software
      @p1f

More Related Content

Viewers also liked

Viewers also liked (8)

A Look At Google Glass
A Look At Google GlassA Look At Google Glass
A Look At Google Glass
 
Lessons learned from Tesla Watch Apps experiments
Lessons learned from Tesla Watch Apps experimentsLessons learned from Tesla Watch Apps experiments
Lessons learned from Tesla Watch Apps experiments
 
Scaladroids: Developing Android Apps with Scala
Scaladroids: Developing Android Apps with ScalaScaladroids: Developing Android Apps with Scala
Scaladroids: Developing Android Apps with Scala
 
Wearable Connectivity Architectures
Wearable Connectivity ArchitecturesWearable Connectivity Architectures
Wearable Connectivity Architectures
 
Breaking Glass: Glass development without Glass
Breaking Glass: Glass development without GlassBreaking Glass: Glass development without Glass
Breaking Glass: Glass development without Glass
 
UX Challenges in VR
UX Challenges in VRUX Challenges in VR
UX Challenges in VR
 
Wearables - The Next Level of Mobility
Wearables - The Next Level of MobilityWearables - The Next Level of Mobility
Wearables - The Next Level of Mobility
 
The Making of Tesla Smartwatch Apps
The Making of Tesla Smartwatch AppsThe Making of Tesla Smartwatch Apps
The Making of Tesla Smartwatch Apps
 

Similar to Tiny Google Projects

Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
Enkitec
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
Jesse Vincent
 
New idc architecture
New idc architectureNew idc architecture
New idc architecture
Mason Mei
 

Similar to Tiny Google Projects (20)

Blogopolisの裏側
Blogopolisの裏側Blogopolisの裏側
Blogopolisの裏側
 
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
Deployment
DeploymentDeployment
Deployment
 
NFS and Oracle
NFS and OracleNFS and Oracle
NFS and Oracle
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
 
Oleg Natalushko. Drupal server anatomy. DrupalCamp Kyiv 2011
Oleg Natalushko. Drupal server anatomy. DrupalCamp Kyiv 2011Oleg Natalushko. Drupal server anatomy. DrupalCamp Kyiv 2011
Oleg Natalushko. Drupal server anatomy. DrupalCamp Kyiv 2011
 
Dream colo
Dream coloDream colo
Dream colo
 
Speed is Essential for a Great Web Experience
Speed is Essential for a Great Web ExperienceSpeed is Essential for a Great Web Experience
Speed is Essential for a Great Web Experience
 
Performance tuning
Performance tuningPerformance tuning
Performance tuning
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war story
 
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesOSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
 
Lustre Generational Performance Improvements & New Features
Lustre Generational Performance Improvements & New FeaturesLustre Generational Performance Improvements & New Features
Lustre Generational Performance Improvements & New Features
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Day 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConfDay 2 General Session Presentations RedisConf
Day 2 General Session Presentations RedisConf
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
 
Tuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for LogsTuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for Logs
 
New idc architecture
New idc architectureNew idc architecture
New idc architecture
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Tiny Google Projects

  • 1. :)
  • 2.
  • 3. tiny :projects
  • 4.
  • 5.
  • 6.
  • 7. Tesseract OCR 1985 2006 HP Google
  • 8. Tesseract OCR 2006 2011 TIFF *
  • 9. Tesseract OCR 2009 2010 Text layout
  • 10. Tesseract OCR 2007 2011 6 33
  • 11. Tesseract OCR Arabic, English, Bulgarian, Catalan, Czech, Chinese (Simplified and Traditional), Danish (standard and Fraktur script), German, Greek, Finnish, French, Hebrew, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak (standard and Fraktur script), Slovenian, Spanish, Serbian, Swedish, Tagalog, Thai, Turkish, Ukrainian and Vietnamese
  • 14.
  • 15.
  • 16.
  • 20. Major features: Import from anywhere Faceting Clustering Split crate custom columns GREL transformations Export/etc
  • 21.
  • 22. google protocol buffers Person person; person.set_id(123); > message Person { person.set_name("Bob"); required int32 id = 1; person.set_email("bob@example.com"); required string name = 2; optional string email = 3; fstream out("person.pb", ios::out ... } person.SerializeToOstream(&out); out.close();
  • 23. 512 bytes / tweet 340,000,000 tweets / day (2012) 7,253,333,333 bytes / hour 2,014,814 bytes / second 1,921 Mbytes / second 15,371 Mbits / second 8 Tbytes / day (2011) Google: ~ 377M searches/day
  • 24. + =
  • 25. + =
  • 26. + =
  • 27. > + =
  • 28. > + =
  • 29. > + = ? MapReduce
  • 30.
  • 32. snappy Fast Stable Robust Free and BSD
  • 33. Size(less is better) compression ratio (%) 80 70 60 50 40 30 20 10 0 lzjb 2010 lzo 2.04 1x fastlz 0.1 - fastlz 0.1 - 3.6 vf lzf 3.6 uf lzrw1 lzf lzrw1-a lzrw2 lzrw3 lzrw3-a snappy quicklz quicklz 1 2 1.0 1.5.0 -1 1.5.0 -2
  • 34. 6 Data types 5 4 compression ratio 3 snappy zlib 2 1 0 plain text html jpeg
  • 35. Size from 20% to 100% bigger :( ...not for amazon glacier
  • 36. Speed is better) Compression (MB/s) (more 250 200 150 100 50 0 lzjb 2010 lzo 2.04 fastlz 0.1 - fastlz 0.1 - 3.6 vf lzf 3.6 uf lzrw1 lzf lzrw1-a lzrw2 lzrw3 lzrw3-a snappy quicklz quicklz 1x 1 2 1.0 1.5.0 -1 1.5.0 -2
  • 37. Speed is better) Decompression (MB/s) (more 500 450 400 350 300 250 200 150 100 50 0 lzjb 2010 lzo 2.04 fastlz 0.1 - fastlz 0.1 - 3.6 vf lzf 3.6 uf lzrw1 lzf lzrw1-a lzrw2 lzrw3 lzrw3-a snappy quicklz quicklz 1x 1 2 1.0 1.5.0 -1 1.5.0 -2
  • 38. On 1 core of 64-bit Core i7 processor: • Compression: 250MB/s • Decompression: 500MB/s :P
  • 40. Portable, but primarily optimized for 64-bit x86-compatible processors
  • 43. @TarasRoshko HTTP headers here: http://code.google.com/p/snappy/ source/browse/trunk/framing_for mat.txt
  • 44. QA? Ostap Andrusiv Software Engineer Eleks software @p1f

Editor's Notes

  1. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  2. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  3. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  4. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  5. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  6. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  7. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  8. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  9. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  10. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  11. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  12. In-memory test (compression and decompression) with ENWIK8 using1 core of Intel Xeon X5355 @ 2.66GHz (64-bit compilation under gcc 4.1.1 (Linux) -O3 -fomit-frame-pointer -fstrict-aliasing -fforce-addr -ffast-math --param inline-unit-growth=999 -DNDEBUG)
  13. zlibsnappyplain text1.5-1.72.7html2-4 3-7 jpeg11
  14. http://aws.amazon.com/glacier/
  15. http://pastebin.com/SFaNzRuf
  16. http://encode.ru/threads/1255-Google-released-Snappy-compression-decompression-library
  17. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  18. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/