SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
: Analyzing Large-Scale
                User Data with Hadoop and HBase

                 Aaron Kimball – CTO




                                             Odiago, Inc.
Developed By:
is…
      • A large-scale storage, serving, and analysis
        platform
      • For user- or other entity-centric data




Developed By:
use cases
      • Product/content recommendations
                – “Because you liked book X, you may like book Y”
      • Ad targeting
                – “Because of your interest in sports, check out…”
      • Social network analysis
                – “You may know these people…”
      • Fraud detection, anti-spam, search
        personalization…

Developed By:
Use case characteristics
      • Have a large number of users
      • Want to store (large) transaction data as well
        as derived data (e.g., recommendations)
      • Need to serve recommendations interactively
      • Require a combination of offline and on-the-
        fly computation



Developed By:
A typical workflow




Developed By:
Challenges
      • Support real-time retrieval of profile data
      • Store a long transactional data history
      • Keep related data logically and physically close
      • Update data in a timely fashion without
        wasting computation
      • Fault tolerance
      • Data schema changes over time


Developed By:
architecture




                          Certified Technology product
Developed By:
HBase data model
      • Data in cells, addressed by four “coordinates”
                – Row Id (primary key)
                – Column family
                – Column “qualifier”
                – Timestamp




Developed By:
Schema free: not what you want
      • HBase may not impose a schema, but your
        data still has one
      • Up to the application to determine how to
        organize & interpret data
      • You still need to pick a serialization system




Developed By:
Schemas = trade-offs
      • Different schemas enable efficient
        storage/retrieval/analysis of different types of
        data
      • Physical organization of data still makes a big
        difference
                – Especially with respect to read/write patterns




Developed By:
WibiData workloads
      • Large number of fat rows (one per user)
      • Each row updated relatively few times/day
                – Though updates may involve large records
      • Raw data written to one set of columns
      • Processed results read from another
                – Often with an interactive latency requirement
      • Needs to support complex data types


Developed By:
Serialization with
      • Apache Avro provides flexible serialization
      • All data written along with its “writer schema”
      • Reader schema may differ from the writer’s

      {
                "type": "record",
                "name": "LongList",
                "fields" : [
                  {"name": "value", "type": "long"},
                  {"name": "next", "type": ["LongList", "null"]}
                ]
      }


Developed By:
Serialization with
      • No code generation required
      • Producers and consumers of data can migrate
        independently
      • Data format migrations do not require
        structural changes to underlying data




Developed By:
WibiData: An extended data model



                <column>
                  <name>email</name>
                  <description>Email address</description>
                  <schema>"string"</schema>
                </column>



      • Columns or whole families have common Avro
        schemas for evolvable storage and retrieval

Developed By:
WibiData: An extended data model




      • Column families are a logical concept
      • Data is physically arranged in locality groups
      • Row ids are hashed for uniform write pressure


Developed By:
WibiData: An extended data model




      • Wibi uses 3-d storage
      • Data is often sorted by timestamp



Developed By:
Analyzing data: Producers




      • Producers create derived column values
      • Produce operator works on one row at a time
                – Can be run in MapReduce, or on a one-off basis
      • Produce is a row mutation operator

Developed By:
Analyzing data: Gatherers




   • Gatherers aggregate data across all rows
   • Always run within MapReduce
   • A bridge between rows and (key, value) pairs


Developed By:
Interactive access: REST API
                PUT request                GET request




      • REST API provides interactive access
      • Producers can be triggered “on demand” to
        create fresh recommendations


Developed By:
Example: Ad Targeting




Developed By:
Example: Ad Targeting



                       Producer




Developed By:
Example: Ad Targeting



                       Producer




Developed By:
Gathering Category Associations
      • Gather observed behavior




Developed By:
Gathering Category Associations
      • Associate interests, clicks




Developed By:
Gathering Category Associations
      • … for all pairs




Developed By:
Gathering Category Associations
      • And aggregate across all users
                Map phase:        Reduce phase:




Developed By:
Conclusions
      • Hadoop, HBase, Avro form the core of a large-
        scale machine learning/analysis platform
      • How you set up your schema matters
      • Producer/gatherer programming model allows
        computations over tables to be expressed
        naturally; works with MapReduce




Developed By:
www.wibidata.com / @wibidata
                    Aaron Kimball – aaron@odiago.com




Developed By:

Más contenido relacionado

La actualidad más candente

Oracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureOracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureArthur Gimpel
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integrationibi
 
MongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL DatabaseMongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL DatabaseGaurav Awasthi
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatternsAnurag S
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon
 
Hadoop foundation for analytics
Hadoop foundation for analyticsHadoop foundation for analytics
Hadoop foundation for analyticsHariniA7
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveIBM Cloud Data Services
 
Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)ArangoDB Database
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsHisham Arafat
 
Drupal Training Topics
Drupal Training TopicsDrupal Training Topics
Drupal Training Topicsvibrantuser
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQLTony Tam
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Yahoo Developer Network
 

La actualidad más candente (20)

Oracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data ArchitectureOracle Week 2016 - Modern Data Architecture
Oracle Week 2016 - Modern Data Architecture
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Practical Use of a NoSQL Database
Practical Use of a NoSQL DatabasePractical Use of a NoSQL Database
Practical Use of a NoSQL Database
 
IBM Visualization Data Explorer
IBM Visualization Data ExplorerIBM Visualization Data Explorer
IBM Visualization Data Explorer
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
MongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL DatabaseMongoDB - An Agile NoSQL Database
MongoDB - An Agile NoSQL Database
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatterns
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
 
Hadoop foundation for analytics
Hadoop foundation for analyticsHadoop foundation for analytics
Hadoop foundation for analytics
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The Move
 
Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)Introduction to ArangoDB (nosql matters Barcelona 2012)
Introduction to ArangoDB (nosql matters Barcelona 2012)
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
MahoutNew
MahoutNewMahoutNew
MahoutNew
 
Drupal Training Topics
Drupal Training TopicsDrupal Training Topics
Drupal Training Topics
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQL
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Database
DatabaseDatabase
Database
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
Hadoop..
Hadoop..Hadoop..
Hadoop..
 
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
 

Similar a Analyzing Large-Scale User Data with Hadoop and HBase

Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Building Recommendation Platforms with Hadoop
Building Recommendation Platforms with HadoopBuilding Recommendation Platforms with Hadoop
Building Recommendation Platforms with HadoopJayant Shekhar
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013Dipti Borkar
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Apache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data StreamingApache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data StreamingShivji Kumar Jha
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
New big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxNew big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxVanshGupta597842
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Graph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnalsGraph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnalsMSDEVMTL
 

Similar a Analyzing Large-Scale User Data with Hadoop and HBase (20)

Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Building Recommendation Platforms with Hadoop
Building Recommendation Platforms with HadoopBuilding Recommendation Platforms with Hadoop
Building Recommendation Platforms with Hadoop
 
Apache drill
Apache drillApache drill
Apache drill
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013How companies use NoSQL and Couchbase - NoSQL Now 2013
How companies use NoSQL and Couchbase - NoSQL Now 2013
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
NoSQL-Overview
NoSQL-OverviewNoSQL-Overview
NoSQL-Overview
 
Apache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data StreamingApache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data Streaming
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
New big data architecture in hadoop.pptx
New big data architecture in hadoop.pptxNew big data architecture in hadoop.pptx
New big data architecture in hadoop.pptx
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Graph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnalsGraph databases for SQL Server profesionnals
Graph databases for SQL Server profesionnals
 

Más de WibiData

Data Evolution on HBase with Kiji
Data Evolution on HBase with KijiData Evolution on HBase with Kiji
Data Evolution on HBase with KijiWibiData
 
Exploring the Enron Email Dataset with Kiji and Hive
Exploring the Enron Email Dataset with Kiji and HiveExploring the Enron Email Dataset with Kiji and Hive
Exploring the Enron Email Dataset with Kiji and HiveWibiData
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBaseWibiData
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseWibiData
 
Building Personalized Applications at Scale
Building Personalized Applications at ScaleBuilding Personalized Applications at Scale
Building Personalized Applications at ScaleWibiData
 
Building Personalized Applications with HBase
Building Personalized Applications with HBaseBuilding Personalized Applications with HBase
Building Personalized Applications with HBaseWibiData
 

Más de WibiData (6)

Data Evolution on HBase with Kiji
Data Evolution on HBase with KijiData Evolution on HBase with Kiji
Data Evolution on HBase with Kiji
 
Exploring the Enron Email Dataset with Kiji and Hive
Exploring the Enron Email Dataset with Kiji and HiveExploring the Enron Email Dataset with Kiji and Hive
Exploring the Enron Email Dataset with Kiji and Hive
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
 
Building Personalized Applications at Scale
Building Personalized Applications at ScaleBuilding Personalized Applications at Scale
Building Personalized Applications at Scale
 
Building Personalized Applications with HBase
Building Personalized Applications with HBaseBuilding Personalized Applications with HBase
Building Personalized Applications with HBase
 

Último

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

Analyzing Large-Scale User Data with Hadoop and HBase

  • 1.
  • 2. : Analyzing Large-Scale User Data with Hadoop and HBase Aaron Kimball – CTO Odiago, Inc. Developed By:
  • 3. is… • A large-scale storage, serving, and analysis platform • For user- or other entity-centric data Developed By:
  • 4. use cases • Product/content recommendations – “Because you liked book X, you may like book Y” • Ad targeting – “Because of your interest in sports, check out…” • Social network analysis – “You may know these people…” • Fraud detection, anti-spam, search personalization… Developed By:
  • 5. Use case characteristics • Have a large number of users • Want to store (large) transaction data as well as derived data (e.g., recommendations) • Need to serve recommendations interactively • Require a combination of offline and on-the- fly computation Developed By:
  • 7. Challenges • Support real-time retrieval of profile data • Store a long transactional data history • Keep related data logically and physically close • Update data in a timely fashion without wasting computation • Fault tolerance • Data schema changes over time Developed By:
  • 8. architecture Certified Technology product Developed By:
  • 9. HBase data model • Data in cells, addressed by four “coordinates” – Row Id (primary key) – Column family – Column “qualifier” – Timestamp Developed By:
  • 10. Schema free: not what you want • HBase may not impose a schema, but your data still has one • Up to the application to determine how to organize & interpret data • You still need to pick a serialization system Developed By:
  • 11. Schemas = trade-offs • Different schemas enable efficient storage/retrieval/analysis of different types of data • Physical organization of data still makes a big difference – Especially with respect to read/write patterns Developed By:
  • 12. WibiData workloads • Large number of fat rows (one per user) • Each row updated relatively few times/day – Though updates may involve large records • Raw data written to one set of columns • Processed results read from another – Often with an interactive latency requirement • Needs to support complex data types Developed By:
  • 13. Serialization with • Apache Avro provides flexible serialization • All data written along with its “writer schema” • Reader schema may differ from the writer’s { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["LongList", "null"]} ] } Developed By:
  • 14. Serialization with • No code generation required • Producers and consumers of data can migrate independently • Data format migrations do not require structural changes to underlying data Developed By:
  • 15. WibiData: An extended data model <column> <name>email</name> <description>Email address</description> <schema>"string"</schema> </column> • Columns or whole families have common Avro schemas for evolvable storage and retrieval Developed By:
  • 16. WibiData: An extended data model • Column families are a logical concept • Data is physically arranged in locality groups • Row ids are hashed for uniform write pressure Developed By:
  • 17. WibiData: An extended data model • Wibi uses 3-d storage • Data is often sorted by timestamp Developed By:
  • 18. Analyzing data: Producers • Producers create derived column values • Produce operator works on one row at a time – Can be run in MapReduce, or on a one-off basis • Produce is a row mutation operator Developed By:
  • 19. Analyzing data: Gatherers • Gatherers aggregate data across all rows • Always run within MapReduce • A bridge between rows and (key, value) pairs Developed By:
  • 20. Interactive access: REST API PUT request GET request • REST API provides interactive access • Producers can be triggered “on demand” to create fresh recommendations Developed By:
  • 22. Example: Ad Targeting Producer Developed By:
  • 23. Example: Ad Targeting Producer Developed By:
  • 24. Gathering Category Associations • Gather observed behavior Developed By:
  • 25. Gathering Category Associations • Associate interests, clicks Developed By:
  • 26. Gathering Category Associations • … for all pairs Developed By:
  • 27. Gathering Category Associations • And aggregate across all users Map phase: Reduce phase: Developed By:
  • 28. Conclusions • Hadoop, HBase, Avro form the core of a large- scale machine learning/analysis platform • How you set up your schema matters • Producer/gatherer programming model allows computations over tables to be expressed naturally; works with MapReduce Developed By:
  • 29. www.wibidata.com / @wibidata Aaron Kimball – aaron@odiago.com Developed By: