SlideShare una empresa de Scribd logo
1 de 45
A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene Ed Buech é , EMC edward.bueche@emc.com, May 25, 2011
Agenda ,[object Object],[object Object],[object Object],[object Object]
My Background ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Documentum search 101 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Introducing Documentum xPlore  ,[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],1996 2010 2005 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Enhancing Documentum Deployments  with Search ,[object Object],[object Object],[object Object],Content Server DCTM client DQL SQL RDBMS search
Enhancing Documentum Deployments  with Search ,[object Object],[object Object],[object Object],Content Server Documentum client DQL SQL xQuery RDBMS Metadata + content search
Some Basic Design Concepts behind Documentum xPlore ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Design concepts (con’t) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Lessons Learned… Structured Query use-cases Unstructured Query use-cases Fit to use-case
Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Scoring, Relevance, Entities Hierarchical data representations (XML) Full Text searches Constantly changing schemas Relational DB technology
Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Meta data query Transactions   Advanced data management (partitions)   JOINs   Full Text index technology
Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Relational DB technology Full Text index technology
Documentum xPlore ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],xDB Transaction, Index & Page Management xDB Query Processing& Optimization xDB API xPlore API Search  Services Node & Data  Management  Services Indexing  Services Admin  Services Content Processing  Services Analytics
EMC xDB: Native XML database ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Libraries / Collections & Indexes = xDB segment =  xDB Library  / xPlore collection =  xDB Index =  xDB xml file ( dftxml , tracking  xml, status, metrics, audit)
Lucene Integration ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Lucene Integration (con’t) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
xPlore has lucene search engine capabilities plus…. ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Tips and Observations on  IO and Host Virtualization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Tip #1: Don’t assume that  one-size-fits all ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Same concept applies for disk virtualization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],50GB and 100 I/O’s per sec capacity 50GB and 200 I/O’s per sec capacity 50GB and 400 I/O’s per sec capacity
Linear mapping’s and Luns ,[object Object],[object Object]
EMC Symmetrix: Nondisruptive Mobility ,[object Object],[object Object],[object Object],[object Object],[object Object],Virtual Pools Flash 400 GB RAID 5 Fibre Channel 600 GB 15K RAID 1 SATA 2 TB RAID 6
Tip #2: Consolidation Contention ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Some Vmware statistics ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Sample %Ready for a production VM with xPlore deployment for an entire week “ official” area that Indicates pain In this case Avg resp time doubled and max resp time grew by 5x
Actual Ready samples during  several hour period
Some Subtleties with  Interactive CPU denial ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],20 sec interval Denial spike
Sharing I/O capacity ,[object Object],[object Object],[object Object],Volume for Lucene application Both volumes spread over the same set of drives and effectively sharing the I/O capacity Volume for other application
Recommendations on diagnosing disk I/O related issues ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Sample output from the Bonnie tool ¹ Bonnie is an open source disk I/O driver tool for Linux that can be useful for pretesting Linux disk environments prior to an xPlore/Lucene  install. bonnie -s 1024 -y -u -o_direct -v 10 -p 10 This will increase the size of the file to 2 Gb. Examine the output. Focus on the random I/O area: ---Sequential Output (sync)----- ---Sequential Input-- --Rnd Seek- -CharUnlk- -DIOBlock- -DRewrite- -CharUnlk- -DIOBlock- --04k (10)- Machine  MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU Mach2  10*2024 73928 97  104142  5.3 26246  2.9  8872 22.5 43794  1.9  735.7 15.2 -s 1024 means that 2 GB files will be created -o_direct means that direct I/O (by-passing buffer cache) will be done -v 10 means that 10 different 2GB files will be created. -p 10 means that 10 different threads will query those files This output means that the random read test saw 735 random I/O’s per sec at 15% CPU busy
Linux indicators compared  to bonnie output See  https://community.emc.com/docs/DOC-9179  for additional example Device:  tps   kB_read/s  kB_wrtn/s  kB_read  kB_wrtn sde  206.10  2402.40  0.80  24024  8 09:29:17  DEV  tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz  await  svctm  %util 09:29:27  dev8-65  209.24   4877.97  1.62  23.32  1.62  7.75  3.80  79.59 09:29:17 PM  CPU  %user  %nice  %system  %iowait   %steal  %idle 09:29:27 PM  all  41.37  0.00  5.56  29.86   0.00  23.21 09:29:27 PM  0  62.44  0.00  10.56  25.38   0.00  1.62 09:29:27 PM  1  30.90  0.00  4.26  35.56   0.00  29.28 09:29:27 PM  2  36.35  0.00  3.96  30.76  0.00  28.93 09:29:27 PM  3  35.77  0.00  3.46  27.64  0.00  33.13 I/O stat output: SAR –d output: SAR –u output: Notice that at 200+ I/Os per sec the underlying volume is 80% busy. Although there could be multiple causes, one could be that some other VM is consuming the remaining I/O capacity (735 – 209 = 500+). High I/O wait
Tip #3: Try to ensure availability  of resources ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
IO / caching test use-case ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Some xPlore Structures for Search ¹ Dictionary of terms Posting list (doc-id’s for term) Stored fields (facets and node-ids) Security indexes (b-tree based) xDB XML store (contains text for summary) 1 st  doc N-th doc Facet decompression map ¹ Frequency and position structures ignored for simplicity
IO model for search in xPlore Dictionary Posting list (doc-id’s for term) Stored fields Xdb node-id plus facet / security info Security lookup  (b-tree based) xDB XML store (contains text for summary) Facet decompression map Search Term: ‘ term1 term2 ’ Result set
Separation of “covering values” in  stored fields and summary  Facet Calc FinalFacet  calc values over thousands of results Res-1 - sum Res-2 - sum Res-3 - sum : : Res-350-sum Xdb docs with text for summary Small number for result window Small structure Potentially thousands of results Stored fields (Random access) Potentially thousands of hits Security lookup
xPlore Memory Pool areas  at-a-glance xPlore Instance  (fixed size) memory xDB Buffer Cache Lucene Caches & working memory xPlore caches  Other vm working mem Operating System File Buffer cache ( dynamically sized ) Native code content extraction & linguistic processing  memory
Lucene data resides primarily in OS buffer cache Potential for many things to sweep  lucene from that cache
Test Env ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Some results of the query suite ,[object Object],[object Object],[object Object],[object Object],Test Avg Resp to consume all results (sec) MB pre-cached I/O per result Total MB loaded into memory (cached + test) Nothing cached 1.89 0 0.89 77  Stored fields cached 0.95 241  0.38 272  Term dict cached 1.73 537  0.79 604  Positions cached 1.58 8,789  0.74 8,800  Frequencies cached 1.65 1,406  0.63 1,436  Entire index cached 0.59 10,970  < 0.05 10,970
Other Notes ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Contact ,[object Object],[object Object],[object Object],[object Object]

Más contenido relacionado

La actualidad más candente

Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databasesguestdfd1ec
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep DivesRush Shah
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraData Science Milan
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developersBiju Nair
 
What's new with Azure Sql Database
What's new with Azure Sql DatabaseWhat's new with Azure Sql Database
What's new with Azure Sql DatabaseMarco Parenzan
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
 
Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Mohammad Asif
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresExploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresCCG
 
Azure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in ActionAzure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in ActionDenys Chamberland
 
AWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentationAWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentationTATA LILIAN SHULIKA
 
Whitepaper : Working with Greenplum Database using Toad for Data Analysts
Whitepaper : Working with Greenplum Database using Toad for Data Analysts Whitepaper : Working with Greenplum Database using Toad for Data Analysts
Whitepaper : Working with Greenplum Database using Toad for Data Analysts EMC
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload managementBiju Nair
 
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...Principled Technologies
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateCloudera, Inc.
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integrationDylan Wan
 
Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....
Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....
Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....Principled Technologies
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine LearningVasu S
 
Get more out of your Windows 10 laptop experience with SSD storage instead of...
Get more out of your Windows 10 laptop experience with SSD storage instead of...Get more out of your Windows 10 laptop experience with SSD storage instead of...
Get more out of your Windows 10 laptop experience with SSD storage instead of...Principled Technologies
 

La actualidad más candente (20)

Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databases
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep Dives
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
 
Unit 2.pptx
Unit 2.pptxUnit 2.pptx
Unit 2.pptx
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developers
 
What's new with Azure Sql Database
What's new with Azure Sql DatabaseWhat's new with Azure Sql Database
What's new with Azure Sql Database
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresExploring Microsoft Azure Infrastructures
Exploring Microsoft Azure Infrastructures
 
Azure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in ActionAzure Cosmos DB + Gremlin API in Action
Azure Cosmos DB + Gremlin API in Action
 
AWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentationAWS Cloud SAA Relational Database presentation
AWS Cloud SAA Relational Database presentation
 
Whitepaper : Working with Greenplum Database using Toad for Data Analysts
Whitepaper : Working with Greenplum Database using Toad for Data Analysts Whitepaper : Working with Greenplum Database using Toad for Data Analysts
Whitepaper : Working with Greenplum Database using Toad for Data Analysts
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
 
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 
Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....
Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....
Performance benchmark results: Amazon Web Services (AWS) SAN in the Cloud vs....
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 
Explore Azure Cosmos DB
Explore Azure Cosmos DBExplore Azure Cosmos DB
Explore Azure Cosmos DB
 
Get more out of your Windows 10 laptop experience with SSD storage instead of...
Get more out of your Windows 10 laptop experience with SSD storage instead of...Get more out of your Windows 10 laptop experience with SSD storage instead of...
Get more out of your Windows 10 laptop experience with SSD storage instead of...
 

Similar a I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche

Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
GWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformGWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformMarcelo Paiva
 
Azure SQL - more or/and less than SQL Server
Azure SQL - more or/and less than SQL ServerAzure SQL - more or/and less than SQL Server
Azure SQL - more or/and less than SQL ServerRafał Hryniewski
 
Tech-Spark: Exploring the Cosmos DB
Tech-Spark: Exploring the Cosmos DBTech-Spark: Exploring the Cosmos DB
Tech-Spark: Exploring the Cosmos DBRalph Attard
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsDirecti Group
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platformMostafa
 
Microsoft Database Options
Microsoft Database OptionsMicrosoft Database Options
Microsoft Database OptionsDavid Chou
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesComunidade NetPonto
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBArangoDB Database
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfpbonillo1
 
Modern Database Development Oow2008 Lucas Jellema
Modern Database Development Oow2008 Lucas JellemaModern Database Development Oow2008 Lucas Jellema
Modern Database Development Oow2008 Lucas JellemaLucas Jellema
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Scaling Web Apps P Falcone
Scaling Web Apps P FalconeScaling Web Apps P Falcone
Scaling Web Apps P Falconejedt
 
MOSS 2007 Deployment Fundamentals -Part2
MOSS 2007 Deployment Fundamentals -Part2MOSS 2007 Deployment Fundamentals -Part2
MOSS 2007 Deployment Fundamentals -Part2Information Technology
 
A Tour of Azure SQL Databases (NOVA SQL UG 2020)
A Tour of Azure SQL Databases  (NOVA SQL UG 2020)A Tour of Azure SQL Databases  (NOVA SQL UG 2020)
A Tour of Azure SQL Databases (NOVA SQL UG 2020)Timothy McAliley
 
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...PROIDEA
 
2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with AzureMarco Parenzan
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalDataWorks Summit
 

Similar a I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche (20)

Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
GWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformGWAB 2015 - Data Plaraform
GWAB 2015 - Data Plaraform
 
Azure SQL - more or/and less than SQL Server
Azure SQL - more or/and less than SQL ServerAzure SQL - more or/and less than SQL Server
Azure SQL - more or/and less than SQL Server
 
Tech-Spark: Exploring the Cosmos DB
Tech-Spark: Exploring the Cosmos DBTech-Spark: Exploring the Cosmos DB
Tech-Spark: Exploring the Cosmos DB
 
Azure Data Storage
Azure Data StorageAzure Data Storage
Azure Data Storage
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platform
 
Microsoft Database Options
Microsoft Database OptionsMicrosoft Database Options
Microsoft Database Options
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDB
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
Modern Database Development Oow2008 Lucas Jellema
Modern Database Development Oow2008 Lucas JellemaModern Database Development Oow2008 Lucas Jellema
Modern Database Development Oow2008 Lucas Jellema
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Scaling Web Apps P Falcone
Scaling Web Apps P FalconeScaling Web Apps P Falcone
Scaling Web Apps P Falcone
 
MOSS 2007 Deployment Fundamentals -Part2
MOSS 2007 Deployment Fundamentals -Part2MOSS 2007 Deployment Fundamentals -Part2
MOSS 2007 Deployment Fundamentals -Part2
 
A Tour of Azure SQL Databases (NOVA SQL UG 2020)
A Tour of Azure SQL Databases  (NOVA SQL UG 2020)A Tour of Azure SQL Databases  (NOVA SQL UG 2020)
A Tour of Azure SQL Databases (NOVA SQL UG 2020)
 
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
 
2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure2014.11.14 Data Opportunities with Azure
2014.11.14 Data Opportunities with Azure
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Analytics on AWS - IP Expo 2013
Analytics on AWS - IP Expo 2013Analytics on AWS - IP Expo 2013
Analytics on AWS - IP Expo 2013
 

Más de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche

  • 1. A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene Ed Buech é , EMC edward.bueche@emc.com, May 25, 2011
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. Lessons Learned… Structured Query use-cases Unstructured Query use-cases Fit to use-case
  • 12. Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Scoring, Relevance, Entities Hierarchical data representations (XML) Full Text searches Constantly changing schemas Relational DB technology
  • 13. Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Meta data query Transactions Advanced data management (partitions) JOINs Full Text index technology
  • 14. Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Relational DB technology Full Text index technology
  • 15.
  • 16.
  • 17. Libraries / Collections & Indexes = xDB segment = xDB Library / xPlore collection = xDB Index = xDB xml file ( dftxml , tracking xml, status, metrics, audit)
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. Sample %Ready for a production VM with xPlore deployment for an entire week “ official” area that Indicates pain In this case Avg resp time doubled and max resp time grew by 5x
  • 29. Actual Ready samples during several hour period
  • 30.
  • 31.
  • 32.
  • 33. Sample output from the Bonnie tool ¹ Bonnie is an open source disk I/O driver tool for Linux that can be useful for pretesting Linux disk environments prior to an xPlore/Lucene install. bonnie -s 1024 -y -u -o_direct -v 10 -p 10 This will increase the size of the file to 2 Gb. Examine the output. Focus on the random I/O area: ---Sequential Output (sync)----- ---Sequential Input-- --Rnd Seek- -CharUnlk- -DIOBlock- -DRewrite- -CharUnlk- -DIOBlock- --04k (10)- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU Mach2 10*2024 73928 97 104142 5.3 26246 2.9 8872 22.5 43794 1.9 735.7 15.2 -s 1024 means that 2 GB files will be created -o_direct means that direct I/O (by-passing buffer cache) will be done -v 10 means that 10 different 2GB files will be created. -p 10 means that 10 different threads will query those files This output means that the random read test saw 735 random I/O’s per sec at 15% CPU busy
  • 34. Linux indicators compared to bonnie output See https://community.emc.com/docs/DOC-9179 for additional example Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sde 206.10 2402.40 0.80 24024 8 09:29:17 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 09:29:27 dev8-65 209.24 4877.97 1.62 23.32 1.62 7.75 3.80 79.59 09:29:17 PM CPU %user %nice %system %iowait %steal %idle 09:29:27 PM all 41.37 0.00 5.56 29.86 0.00 23.21 09:29:27 PM 0 62.44 0.00 10.56 25.38 0.00 1.62 09:29:27 PM 1 30.90 0.00 4.26 35.56 0.00 29.28 09:29:27 PM 2 36.35 0.00 3.96 30.76 0.00 28.93 09:29:27 PM 3 35.77 0.00 3.46 27.64 0.00 33.13 I/O stat output: SAR –d output: SAR –u output: Notice that at 200+ I/Os per sec the underlying volume is 80% busy. Although there could be multiple causes, one could be that some other VM is consuming the remaining I/O capacity (735 – 209 = 500+). High I/O wait
  • 35.
  • 36.
  • 37. Some xPlore Structures for Search ¹ Dictionary of terms Posting list (doc-id’s for term) Stored fields (facets and node-ids) Security indexes (b-tree based) xDB XML store (contains text for summary) 1 st doc N-th doc Facet decompression map ¹ Frequency and position structures ignored for simplicity
  • 38. IO model for search in xPlore Dictionary Posting list (doc-id’s for term) Stored fields Xdb node-id plus facet / security info Security lookup (b-tree based) xDB XML store (contains text for summary) Facet decompression map Search Term: ‘ term1 term2 ’ Result set
  • 39. Separation of “covering values” in stored fields and summary Facet Calc FinalFacet calc values over thousands of results Res-1 - sum Res-2 - sum Res-3 - sum : : Res-350-sum Xdb docs with text for summary Small number for result window Small structure Potentially thousands of results Stored fields (Random access) Potentially thousands of hits Security lookup
  • 40. xPlore Memory Pool areas at-a-glance xPlore Instance (fixed size) memory xDB Buffer Cache Lucene Caches & working memory xPlore caches Other vm working mem Operating System File Buffer cache ( dynamically sized ) Native code content extraction & linguistic processing memory
  • 41. Lucene data resides primarily in OS buffer cache Potential for many things to sweep lucene from that cache
  • 42.
  • 43.
  • 44.
  • 45.

Notas del editor

  1. Prior to VM’s the CPU/memory resources of a server were dedicated to that application resource planners didn’t worry too much about inter-system contention Resource planning more simple , but expensive in unused capacity VMware achieves significant cost reduction by allowing applications to share unused capacity On average (over the day) the Lucene CPU consumption might be low However, the challenge is that if the capacity is not available when Lucene needs it, then response delays will result