SlideShare una empresa de Scribd logo
1 de 26
at io n
In fo rm          l
          ri ev a
     R et
            en ges
   C h a ll           Bruno Pedro
                        March 2010
Bruno Pedro
A n e x p e r i e n c e d We b d e v e l o p e r a n d
entrepreneur. Has extensive background in
large scale projects and technical writing.

http://tarpipe.com/user/bpedro
What is tarpipe?




       User
What is tarpipe?
3 Challenges
• Real-Time Retrieval
• Understanding Context
• Inferring Identify
Real-Time Retrieval




             http://www.flickr.com/photos/josephrobertson/127758523/
WordPress


                     source: wordpress.com




• Average ~10K posts/hour
• ~3 new posts every second
twitter


                      source: mashable.com




• Average ~1.1M tweets/hour
• ~300 new tweets every second
Challenge
• ~300 reads/second
• 160 X 300 = 48 KB/second = 4 GB/day
  (approximate calculation)




• How to process all this information?
Strategy
• Read and store immediately:
 • High performance write storage
 • No locks allowed
 • Prepare for lots of reading errors
Strategy
• Process later:
 • Regular expressions
 • Term extraction
 • Machine learning
Context




     For me context is the key - from
     that comes the understanding of
     everything. — Kenneth Noland
source: joelonsoftware.com




Dogs?
source: Google Reader Play




Dogs with unmatched title?
source: Google Buzz




Still doesn’t make a lot of sense...
This is the
worst case
scenario
Challenge
• Find context from associated content:
 • Pictures
 • Comments
 • Location information
 • Timelines
 • Authors
Strategy
• Associate content through common
  identifiers
• Establish timeline of different pieces
• Group pieces by same author
• Present in a comprehensible fashion
Identity




           source: abc Australia
Many Identifiers
• E-mail: user@example.com
• facebook: @User Name
• flickr: user or User Name (?)
• Google Buzz: @user@example.com
• twitter: @user
 ...
Addressable
• http://facebook.com/user
• http://flickr.com/user
• http://www.google.com/profiles/user
• http://twitter.com/user
  ...
How to make sense?
Challenge
• Parse every message, tweet or post
• Find possible user identifiers
• Substitute for meaningful information:
 • A link to the original profile
 • Equivalent identity on destination
Strategy
• Decentralized processing:
 • Browser based (plugin)
 • Extract identities from page
 • Process
 • Replace with meaningful information
Food for thought
• PubsubHubbub
  http://code.google.com/p/pubsubhubbub/

• Activity Streams
  http://activitystrea.ms/
• Web Finger
  http://webfinger.org/
tarpipe streamlines your     tarpipe is one of the most   Today I had a chance to
updates to various social    curious experiments in       spend time experimenting
web sites, creating simple   social media that I've       with tarpipe and I have to
or complex workflows to       seen lately. The service     say that I am intrigued by
update several buckets in    has the potential to be      the concept and impressed
one fell swoop.              the answer to the lament     by the implementation.
                             I first talked about in The
Adam Pash                    looming crisis: Personal     Jeff Barr
lifehacker                   syndication overload.        Amazon.com

                             Rafe Needleman
                             CNET news




thank you                                                 share your life

Más contenido relacionado

Similar a Information Retrieval Challenges

Triple your blog post frequency
Triple your blog post frequencyTriple your blog post frequency
Triple your blog post frequencyAndraz Tori
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchainjasonhaddix
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPChristian Morbidoni
 
Blogging for a better classroom
Blogging for a better classroomBlogging for a better classroom
Blogging for a better classroomVicki Davis
 
Technical Communication for Unity Developers
Technical Communication for Unity DevelopersTechnical Communication for Unity Developers
Technical Communication for Unity DevelopersUnity Technologies
 
Real-World Challenges of Real-Time Social Analytics
Real-World Challenges of Real-Time Social AnalyticsReal-World Challenges of Real-Time Social Analytics
Real-World Challenges of Real-Time Social AnalyticsAttensity
 
Write a better FM
Write a better FMWrite a better FM
Write a better FMRich Bowen
 
Energizing PowerPoint
Energizing PowerPointEnergizing PowerPoint
Energizing PowerPointLaDonna Coy
 
Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...
Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...
Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...Anselm Hook
 
Nguyen phuong truong anh a story of bug bounty hunter
Nguyen phuong truong anh   a story of bug bounty hunterNguyen phuong truong anh   a story of bug bounty hunter
Nguyen phuong truong anh a story of bug bounty hunterSecurity Bootcamp
 
The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...
The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...
The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...sewilkie
 
Social Media Academy 2016 Presentation Slides
Social Media Academy 2016 Presentation SlidesSocial Media Academy 2016 Presentation Slides
Social Media Academy 2016 Presentation SlidesHarvardComms
 
Code Quality, Standards and Best Practices, Discuss
Code Quality, Standards and Best Practices, DiscussCode Quality, Standards and Best Practices, Discuss
Code Quality, Standards and Best Practices, DiscussJapheth Thomson
 
Keith J. Jones, Ph.D. - Crash Course malware analysis
Keith J. Jones, Ph.D. - Crash Course malware analysisKeith J. Jones, Ph.D. - Crash Course malware analysis
Keith J. Jones, Ph.D. - Crash Course malware analysisKeith Jones, PhD
 
项亮 推荐系统实践 从入门到精通
项亮 推荐系统实践 从入门到精通 项亮 推荐系统实践 从入门到精通
项亮 推荐系统实践 从入门到精通 topgeek
 

Similar a Information Retrieval Challenges (20)

From OSINT to Phishing presentation
From OSINT to Phishing presentationFrom OSINT to Phishing presentation
From OSINT to Phishing presentation
 
Doonish
DoonishDoonish
Doonish
 
Triple your blog post frequency
Triple your blog post frequencyTriple your blog post frequency
Triple your blog post frequency
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 
Blogging for a better classroom
Blogging for a better classroomBlogging for a better classroom
Blogging for a better classroom
 
Fighting Spam at Flickr
Fighting Spam at FlickrFighting Spam at Flickr
Fighting Spam at Flickr
 
Technical Communication for Unity Developers
Technical Communication for Unity DevelopersTechnical Communication for Unity Developers
Technical Communication for Unity Developers
 
Real-World Challenges of Real-Time Social Analytics
Real-World Challenges of Real-Time Social AnalyticsReal-World Challenges of Real-Time Social Analytics
Real-World Challenges of Real-Time Social Analytics
 
Write a better FM
Write a better FMWrite a better FM
Write a better FM
 
Energizing PowerPoint
Energizing PowerPointEnergizing PowerPoint
Energizing PowerPoint
 
Ideation,demos
Ideation,demosIdeation,demos
Ideation,demos
 
Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...
Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...
Ubiquitous Angels; ambient sensor networks to crowd source crisis response an...
 
Nguyen phuong truong anh a story of bug bounty hunter
Nguyen phuong truong anh   a story of bug bounty hunterNguyen phuong truong anh   a story of bug bounty hunter
Nguyen phuong truong anh a story of bug bounty hunter
 
The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...
The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...
The Art of APPlication: Using Apps to Engage Students as Collaborators, Creat...
 
Social Media Academy 2016 Presentation Slides
Social Media Academy 2016 Presentation SlidesSocial Media Academy 2016 Presentation Slides
Social Media Academy 2016 Presentation Slides
 
Code Quality, Standards and Best Practices, Discuss
Code Quality, Standards and Best Practices, DiscussCode Quality, Standards and Best Practices, Discuss
Code Quality, Standards and Best Practices, Discuss
 
Keith J. Jones, Ph.D. - Crash Course malware analysis
Keith J. Jones, Ph.D. - Crash Course malware analysisKeith J. Jones, Ph.D. - Crash Course malware analysis
Keith J. Jones, Ph.D. - Crash Course malware analysis
 
项亮 推荐系统实践 从入门到精通
项亮 推荐系统实践 从入门到精通 项亮 推荐系统实践 从入门到精通
项亮 推荐系统实践 从入门到精通
 

Más de Bruno Pedro

What are Web APIs
What are Web APIsWhat are Web APIs
What are Web APIsBruno Pedro
 
Growing your business with an API
Growing your business with an APIGrowing your business with an API
Growing your business with an APIBruno Pedro
 
Product growth with an API
Product growth with an APIProduct growth with an API
Product growth with an APIBruno Pedro
 
How to grow your business with an API
How to grow your business with an APIHow to grow your business with an API
How to grow your business with an APIBruno Pedro
 
APIs Love to Chat
APIs Love to ChatAPIs Love to Chat
APIs Love to ChatBruno Pedro
 
How to Automate API Testing
How to Automate API TestingHow to Automate API Testing
How to Automate API TestingBruno Pedro
 
Asynchronous Microservices in nodejs
Asynchronous Microservices in nodejsAsynchronous Microservices in nodejs
Asynchronous Microservices in nodejsBruno Pedro
 
How to Automate API Discovery
How to Automate API DiscoveryHow to Automate API Discovery
How to Automate API DiscoveryBruno Pedro
 
Api Design & The Paris Subway
Api Design & The Paris SubwayApi Design & The Paris Subway
Api Design & The Paris SubwayBruno Pedro
 
The importance of /me
The importance of /meThe importance of /me
The importance of /meBruno Pedro
 
Maintainable consumers
Maintainable consumersMaintainable consumers
Maintainable consumersBruno Pedro
 
API Code Generation
API Code GenerationAPI Code Generation
API Code GenerationBruno Pedro
 
Bridging the Gap Between APIs and Customers
Bridging the Gap Between APIs and CustomersBridging the Gap Between APIs and Customers
Bridging the Gap Between APIs and CustomersBruno Pedro
 
Who's using your API?
Who's using your API?Who's using your API?
Who's using your API?Bruno Pedro
 
Is OAuth Really Secure?
Is OAuth Really Secure?Is OAuth Really Secure?
Is OAuth Really Secure?Bruno Pedro
 
tarpipe WordPress plugin demo
tarpipe WordPress plugin demotarpipe WordPress plugin demo
tarpipe WordPress plugin demoBruno Pedro
 
Everything OAuth
Everything OAuthEverything OAuth
Everything OAuthBruno Pedro
 
Activity Streams And Contexts
Activity Streams And ContextsActivity Streams And Contexts
Activity Streams And ContextsBruno Pedro
 

Más de Bruno Pedro (20)

What are Web APIs
What are Web APIsWhat are Web APIs
What are Web APIs
 
Growing your business with an API
Growing your business with an APIGrowing your business with an API
Growing your business with an API
 
Product growth with an API
Product growth with an APIProduct growth with an API
Product growth with an API
 
How to grow your business with an API
How to grow your business with an APIHow to grow your business with an API
How to grow your business with an API
 
APIs Love to Chat
APIs Love to ChatAPIs Love to Chat
APIs Love to Chat
 
How to Automate API Testing
How to Automate API TestingHow to Automate API Testing
How to Automate API Testing
 
Asynchronous Microservices in nodejs
Asynchronous Microservices in nodejsAsynchronous Microservices in nodejs
Asynchronous Microservices in nodejs
 
How to Automate API Discovery
How to Automate API DiscoveryHow to Automate API Discovery
How to Automate API Discovery
 
Api Design & The Paris Subway
Api Design & The Paris SubwayApi Design & The Paris Subway
Api Design & The Paris Subway
 
The importance of /me
The importance of /meThe importance of /me
The importance of /me
 
Maintainable consumers
Maintainable consumersMaintainable consumers
Maintainable consumers
 
API Code Generation
API Code GenerationAPI Code Generation
API Code Generation
 
Bridging the Gap Between APIs and Customers
Bridging the Gap Between APIs and CustomersBridging the Gap Between APIs and Customers
Bridging the Gap Between APIs and Customers
 
Who's using your API?
Who's using your API?Who's using your API?
Who's using your API?
 
node-fs
node-fsnode-fs
node-fs
 
Is OAuth Really Secure?
Is OAuth Really Secure?Is OAuth Really Secure?
Is OAuth Really Secure?
 
tarpipe WordPress plugin demo
tarpipe WordPress plugin demotarpipe WordPress plugin demo
tarpipe WordPress plugin demo
 
OAuth checklist
OAuth checklistOAuth checklist
OAuth checklist
 
Everything OAuth
Everything OAuthEverything OAuth
Everything OAuth
 
Activity Streams And Contexts
Activity Streams And ContextsActivity Streams And Contexts
Activity Streams And Contexts
 

Último

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Information Retrieval Challenges

  • 1. at io n In fo rm l ri ev a R et en ges C h a ll Bruno Pedro March 2010
  • 2. Bruno Pedro A n e x p e r i e n c e d We b d e v e l o p e r a n d entrepreneur. Has extensive background in large scale projects and technical writing. http://tarpipe.com/user/bpedro
  • 5. 3 Challenges • Real-Time Retrieval • Understanding Context • Inferring Identify
  • 6. Real-Time Retrieval http://www.flickr.com/photos/josephrobertson/127758523/
  • 7. WordPress source: wordpress.com • Average ~10K posts/hour • ~3 new posts every second
  • 8. twitter source: mashable.com • Average ~1.1M tweets/hour • ~300 new tweets every second
  • 9. Challenge • ~300 reads/second • 160 X 300 = 48 KB/second = 4 GB/day (approximate calculation) • How to process all this information?
  • 10. Strategy • Read and store immediately: • High performance write storage • No locks allowed • Prepare for lots of reading errors
  • 11. Strategy • Process later: • Regular expressions • Term extraction • Machine learning
  • 12. Context For me context is the key - from that comes the understanding of everything. — Kenneth Noland
  • 14. source: Google Reader Play Dogs with unmatched title?
  • 15. source: Google Buzz Still doesn’t make a lot of sense...
  • 16. This is the worst case scenario
  • 17. Challenge • Find context from associated content: • Pictures • Comments • Location information • Timelines • Authors
  • 18. Strategy • Associate content through common identifiers • Establish timeline of different pieces • Group pieces by same author • Present in a comprehensible fashion
  • 19. Identity source: abc Australia
  • 20. Many Identifiers • E-mail: user@example.com • facebook: @User Name • flickr: user or User Name (?) • Google Buzz: @user@example.com • twitter: @user ...
  • 21. Addressable • http://facebook.com/user • http://flickr.com/user • http://www.google.com/profiles/user • http://twitter.com/user ...
  • 22. How to make sense?
  • 23. Challenge • Parse every message, tweet or post • Find possible user identifiers • Substitute for meaningful information: • A link to the original profile • Equivalent identity on destination
  • 24. Strategy • Decentralized processing: • Browser based (plugin) • Extract identities from page • Process • Replace with meaningful information
  • 25. Food for thought • PubsubHubbub http://code.google.com/p/pubsubhubbub/ • Activity Streams http://activitystrea.ms/ • Web Finger http://webfinger.org/
  • 26. tarpipe streamlines your tarpipe is one of the most Today I had a chance to updates to various social curious experiments in spend time experimenting web sites, creating simple social media that I've with tarpipe and I have to or complex workflows to seen lately. The service say that I am intrigued by update several buckets in has the potential to be the concept and impressed one fell swoop. the answer to the lament by the implementation. I first talked about in The Adam Pash looming crisis: Personal Jeff Barr lifehacker syndication overload. Amazon.com Rafe Needleman CNET news thank you share your life

Notas del editor

  1. - Google Chrome plugin - Get identity information from Google Social Graph API or other means