SlideShare una empresa de Scribd logo
1 de 27
Numerical Recipes in  Hadoop Jake Mannix linkedin/in/jakemannix twitter/pbrane jake.mannix@gmail.com jmannix@apache.org Principal SDE, LinkedIn Committer, Apache Mahout, Zoie,  Bobo-Browse, Decomposer Author, Lucene in Depth (Manning MM/DD/2010)
A Mathematician’s Apology What mathematical structure describes all of these? Full-text search: Score documents matching “query string” Collaborative filtering recommendation: Users who liked {those} also liked {these} (Social/web)-graph proximity: People/pages “close” to {this} are {these}
Matrix Multiplication!
Full-text Search Vector Space Model of IR Corpus as term-document matrix Query as bag-of-words vector Full-text search is just:
Collaborative Filtering User preference matrix  (and item-item similarity matrix                 ) Input user as vector of preferences  (simple) Item-based CF recommendations are: T
Graph Proximity Adjacency matrix: 2nd degree adjacency matrix:   Input all of a user’s “friends” or page links: (weighted) distance measure of 1st – 3rd degree connections is then:
Dictionary Applications                  Linear Algebra
How does this help? In Search: Latent Semantic Indexing (LSI) probabalistic LSI Latent Dirichlet Allocation In Recommenders: Singular Value Decomposition Layered Restricted Boltzmann Machines  (Deep Belief Networks) In Graphs: PageRank Spectral Decomposition / Spectral Clustering
Often use “Dimensional Reduction” To alleviate the sparse Big Data problem of “the curse of dimensionality” Used to improve recall and relevance  in general: smooth the metric on your data set
New applications with Matrices If Search is finding doc-vector by:  and users query with data represented: Q =  Giving implicit feedback based on click-through per session: C =
… continued Then               has the form (docs-by-terms) for search! Approach has been used by Ted Dunning at Veoh (and probably others)
Linear Algebra performance tricks Naïve item-based recommendations: Calculate item similarity matrix: Calculate item recs: Express in one step: In matrix notation: Re-writing as:      is the vector of preferences for user “v”,       is the vector of preferences of item “i” The result is the matrix sum of the outer (tensor) products of these vectors, scaled by the entry they intersect at.
Item Recommender via Hadoop
Apache Mahout Apache Mahout currently on release 0.3 http://lucene.apache.org/mahout Will be a “Top Level Project” soon (before 0.4) ( http://mahout.apache.org ) “Scalable Machine Learning with commercially friendly licensing”
Mahout Features  Recommenders  absorbed the Taste project Classification (Naïve Bayes, C-Bayes, more) Clustering (Canopy, fuzzy-K-means, Dirichlet, etc…) Fast non-distributed linear mathematics  absorbed the classic CERN Colt project Distributed Matrices and decomposition absorbed the Decomposer project mahout shell-script analogous to $HADOOP_HOME/bin/hadoop $MAHOUT_HOME/bin/mahout kmeans –i “in” –o “out” –k 100 $MAHOUT_HOME/bin/mahout svd –i “in” –o “out” –k 300 etc… Taste web-app for real-time recommendations
DistributedRowMatrix Wrapper around a SequenceFile<IntWritable,VectorWritable> Distributed methods like: Matrix transpose(); Matrix times(Matrix other); Vector times(Vectorv); Vector timesSquared(Vectorv); To get SVD: pass into DistributedLanczosSolver: LanczosSolver.solve(Matrix input, Matrix eigenVectors, List<Double> eigenValues, int rank);
Questions? Contact:  jake.mannix@gmail.com jmannix@apache.org http://twitter.com/pbrane http://www.decomposer.org/blog http://www.linkedin.com/in/jakemannix
Appendix There are lots of ways to deal with sparse Big Data, and many (not all) need to deal with the dimensionality of the feature-space growing beyond reasonable limits, and techniques to deal with this depend heavily on your data… That having been said, there are some general techniques
Dealing with Curse of Dimensionality Sparseness means fast, but overlap is too small Can we reduce the dimensionality (from “all possible text tokens” or “all userIds”) while keeping the nice aspects of the search problem? If possible, collapse “similar” vectors (synonymous terms, userIds with high overlap, etc…) towards each other while keeping “dissimilar” vectors far apart…
Solution A: Matrix decomposition Singular Value Decomposition (truncated) “best” approximation to your matrix Used in Latent Semantic Indexing (LSI) For graphs: spectral decomposition Collaborative filtering (Netflix leaderboard) Issues: very computation intensive  no parallelized open-source packages see Apache Mahout Makes things too dense
SVD: continued Hadoopimpl. in Mahout (Lanczos) O(N*d*k) for rank-k SVD on N docs, delt’s each  Density can be dealt with by doing Canopy Clustering offline But only extracting linear feature mixes Also, still very computation intensive and I/O intensive (k-passes over data set), are there better dimensional reduction methods?
Solution B: Stochastic Decomposition co-ocurrence-based kernel + online Random Projection + SVD
Co-ocurrence-based kernel Extract bigram phrases / pairs of items rated by the same person (using Log-Likelihood Ratio test to pick the best) “Disney on Ice was Amazing!” -> {“disney”, “disney on ice”, “ice”, “was” “amazing”} {item1:4, item2:5, item5:3, item9:1} -> {item1:4, (items1+2):4.5, item2:5, item5:3,…} Dim(features) goes from 105to 108+(yikes!)
Online Random Projection Randomly project kernelized text vectors down to “merely” 103dimensions with a Gaussian matrix  Or project eachnGram down to an random (but sparse) 103-dim vector: V= {123876244 =>1.3}    (tf-IDF of “disney”) V’= c*{h(i) => 1, h(h(i)) =>1, h(h(h(i))) =>1}     (c= 1.3 / sqrt(3))
Outer-product and Sum Take the 103-dim projected vectors and outer-product with themselves, result is 103x103-dim matrix ,[object Object],All results go to single Reducer, where you compute…
SVD  SVD-them quickly (they fit in memory)  Over and over again (as new data comes in) Use the most recent SVD to project your (already randomly projected) text still further (now encoding “semantic” similarity). SVD-projected vectors can be assigned immediately to nearest clusters if desired
References Randomized matrix decomposition review: http://arxiv.org/abs/0909.4061 Sparse hashing/projection: John Langford et al. “VowpalWabbit” http://hunch.net/~vw/

Más contenido relacionado

La actualidad más candente

Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Lucidworks
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Yves Raimond
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonMark Conway
 
Linked in stream experimentation framework
Linked in stream experimentation frameworkLinked in stream experimentation framework
Linked in stream experimentation frameworkJoseph Adler
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphLucidworks
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsJoshua Shinavier
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Complex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseComplex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseMax Neunhöffer
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesTodd McGrath
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkSandy Ryza
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 

La actualidad más candente (20)

Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
Real-World NoSQL Schema Design
Real-World NoSQL Schema DesignReal-World NoSQL Schema Design
Real-World NoSQL Schema Design
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionality
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
 
Linked in stream experimentation framework
Linked in stream experimentation frameworkLinked in stream experimentation framework
Linked in stream experimentation framework
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBs
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Complex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseComplex queries in a distributed multi-model database
Complex queries in a distributed multi-model database
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code Examples
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 

Destacado

The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...
The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...
The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...tulipbiru64
 
Information retrieval based on word sens 1
Information retrieval based on word sens 1Information retrieval based on word sens 1
Information retrieval based on word sens 1ATHMAN HAJ-HAMOU
 
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending  Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending Assem CHELLI
 
K Search Al Khawarizmy Language Software
K Search Al Khawarizmy Language SoftwareK Search Al Khawarizmy Language Software
K Search Al Khawarizmy Language SoftwareAbdallah Aziz
 
Indexing Strategies to Help You Scale
Indexing Strategies to Help You ScaleIndexing Strategies to Help You Scale
Indexing Strategies to Help You ScaleMongoDB
 
treaty of hudabiya
treaty of hudabiyatreaty of hudabiya
treaty of hudabiyaAsif Sheikh
 
Treaty of Al Hudaybiyah
Treaty of Al HudaybiyahTreaty of Al Hudaybiyah
Treaty of Al HudaybiyahFaryal2000
 

Destacado (14)

K Search
K SearchK Search
K Search
 
E lex presentation_03
E lex presentation_03E lex presentation_03
E lex presentation_03
 
The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...
The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...
The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...
 
Cebit2009new
Cebit2009newCebit2009new
Cebit2009new
 
Information retrieval based on word sens 1
Information retrieval based on word sens 1Information retrieval based on word sens 1
Information retrieval based on word sens 1
 
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending  Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
 
Chap10
Chap10Chap10
Chap10
 
K Search Al Khawarizmy Language Software
K Search Al Khawarizmy Language SoftwareK Search Al Khawarizmy Language Software
K Search Al Khawarizmy Language Software
 
Statistika
StatistikaStatistika
Statistika
 
REA (Resources, Events, Agents)
REA (Resources, Events, Agents)REA (Resources, Events, Agents)
REA (Resources, Events, Agents)
 
Indexing Strategies to Help You Scale
Indexing Strategies to Help You ScaleIndexing Strategies to Help You Scale
Indexing Strategies to Help You Scale
 
treaty of hudabiya
treaty of hudabiyatreaty of hudabiya
treaty of hudabiya
 
Treaty of Al Hudaybiyah
Treaty of Al HudaybiyahTreaty of Al Hudaybiyah
Treaty of Al Hudaybiyah
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 

Similar a Seattle Scalability Mahout

OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engineKeeyong Han
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Apache Mahout
Apache MahoutApache Mahout
Apache MahoutAjit Koti
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...Edward Blurock
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational DatabasesUdi Bauman
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYAAditya Srinivasan
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data FrameworkseXascale Infolab
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Jane Recommendation Engines
Jane Recommendation EnginesJane Recommendation Engines
Jane Recommendation EnginesAdam Rogers
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.ASHISH JAGTAP
 

Similar a Seattle Scalability Mahout (20)

OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Jane Recommendation Engines
Jane Recommendation EnginesJane Recommendation Engines
Jane Recommendation Engines
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.
 

Último

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Último (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Seattle Scalability Mahout

  • 1. Numerical Recipes in Hadoop Jake Mannix linkedin/in/jakemannix twitter/pbrane jake.mannix@gmail.com jmannix@apache.org Principal SDE, LinkedIn Committer, Apache Mahout, Zoie, Bobo-Browse, Decomposer Author, Lucene in Depth (Manning MM/DD/2010)
  • 2. A Mathematician’s Apology What mathematical structure describes all of these? Full-text search: Score documents matching “query string” Collaborative filtering recommendation: Users who liked {those} also liked {these} (Social/web)-graph proximity: People/pages “close” to {this} are {these}
  • 4. Full-text Search Vector Space Model of IR Corpus as term-document matrix Query as bag-of-words vector Full-text search is just:
  • 5. Collaborative Filtering User preference matrix (and item-item similarity matrix ) Input user as vector of preferences (simple) Item-based CF recommendations are: T
  • 6. Graph Proximity Adjacency matrix: 2nd degree adjacency matrix: Input all of a user’s “friends” or page links: (weighted) distance measure of 1st – 3rd degree connections is then:
  • 7. Dictionary Applications Linear Algebra
  • 8. How does this help? In Search: Latent Semantic Indexing (LSI) probabalistic LSI Latent Dirichlet Allocation In Recommenders: Singular Value Decomposition Layered Restricted Boltzmann Machines (Deep Belief Networks) In Graphs: PageRank Spectral Decomposition / Spectral Clustering
  • 9. Often use “Dimensional Reduction” To alleviate the sparse Big Data problem of “the curse of dimensionality” Used to improve recall and relevance in general: smooth the metric on your data set
  • 10. New applications with Matrices If Search is finding doc-vector by: and users query with data represented: Q = Giving implicit feedback based on click-through per session: C =
  • 11. … continued Then has the form (docs-by-terms) for search! Approach has been used by Ted Dunning at Veoh (and probably others)
  • 12. Linear Algebra performance tricks Naïve item-based recommendations: Calculate item similarity matrix: Calculate item recs: Express in one step: In matrix notation: Re-writing as: is the vector of preferences for user “v”, is the vector of preferences of item “i” The result is the matrix sum of the outer (tensor) products of these vectors, scaled by the entry they intersect at.
  • 14. Apache Mahout Apache Mahout currently on release 0.3 http://lucene.apache.org/mahout Will be a “Top Level Project” soon (before 0.4) ( http://mahout.apache.org ) “Scalable Machine Learning with commercially friendly licensing”
  • 15. Mahout Features Recommenders absorbed the Taste project Classification (Naïve Bayes, C-Bayes, more) Clustering (Canopy, fuzzy-K-means, Dirichlet, etc…) Fast non-distributed linear mathematics absorbed the classic CERN Colt project Distributed Matrices and decomposition absorbed the Decomposer project mahout shell-script analogous to $HADOOP_HOME/bin/hadoop $MAHOUT_HOME/bin/mahout kmeans –i “in” –o “out” –k 100 $MAHOUT_HOME/bin/mahout svd –i “in” –o “out” –k 300 etc… Taste web-app for real-time recommendations
  • 16. DistributedRowMatrix Wrapper around a SequenceFile<IntWritable,VectorWritable> Distributed methods like: Matrix transpose(); Matrix times(Matrix other); Vector times(Vectorv); Vector timesSquared(Vectorv); To get SVD: pass into DistributedLanczosSolver: LanczosSolver.solve(Matrix input, Matrix eigenVectors, List<Double> eigenValues, int rank);
  • 17. Questions? Contact: jake.mannix@gmail.com jmannix@apache.org http://twitter.com/pbrane http://www.decomposer.org/blog http://www.linkedin.com/in/jakemannix
  • 18. Appendix There are lots of ways to deal with sparse Big Data, and many (not all) need to deal with the dimensionality of the feature-space growing beyond reasonable limits, and techniques to deal with this depend heavily on your data… That having been said, there are some general techniques
  • 19. Dealing with Curse of Dimensionality Sparseness means fast, but overlap is too small Can we reduce the dimensionality (from “all possible text tokens” or “all userIds”) while keeping the nice aspects of the search problem? If possible, collapse “similar” vectors (synonymous terms, userIds with high overlap, etc…) towards each other while keeping “dissimilar” vectors far apart…
  • 20. Solution A: Matrix decomposition Singular Value Decomposition (truncated) “best” approximation to your matrix Used in Latent Semantic Indexing (LSI) For graphs: spectral decomposition Collaborative filtering (Netflix leaderboard) Issues: very computation intensive no parallelized open-source packages see Apache Mahout Makes things too dense
  • 21. SVD: continued Hadoopimpl. in Mahout (Lanczos) O(N*d*k) for rank-k SVD on N docs, delt’s each Density can be dealt with by doing Canopy Clustering offline But only extracting linear feature mixes Also, still very computation intensive and I/O intensive (k-passes over data set), are there better dimensional reduction methods?
  • 22. Solution B: Stochastic Decomposition co-ocurrence-based kernel + online Random Projection + SVD
  • 23. Co-ocurrence-based kernel Extract bigram phrases / pairs of items rated by the same person (using Log-Likelihood Ratio test to pick the best) “Disney on Ice was Amazing!” -> {“disney”, “disney on ice”, “ice”, “was” “amazing”} {item1:4, item2:5, item5:3, item9:1} -> {item1:4, (items1+2):4.5, item2:5, item5:3,…} Dim(features) goes from 105to 108+(yikes!)
  • 24. Online Random Projection Randomly project kernelized text vectors down to “merely” 103dimensions with a Gaussian matrix Or project eachnGram down to an random (but sparse) 103-dim vector: V= {123876244 =>1.3} (tf-IDF of “disney”) V’= c*{h(i) => 1, h(h(i)) =>1, h(h(h(i))) =>1} (c= 1.3 / sqrt(3))
  • 25.
  • 26. SVD SVD-them quickly (they fit in memory) Over and over again (as new data comes in) Use the most recent SVD to project your (already randomly projected) text still further (now encoding “semantic” similarity). SVD-projected vectors can be assigned immediately to nearest clusters if desired
  • 27. References Randomized matrix decomposition review: http://arxiv.org/abs/0909.4061 Sparse hashing/projection: John Langford et al. “VowpalWabbit” http://hunch.net/~vw/

Notas del editor

  1. And the usual references for LSI and Spectral Decomposition