SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Kitenga
               reinventing information

Mark Davis
Founder/CTO
Enabling
Big Data
Search via
the Lucid
ReST API
Big	
  Data	
  
	
  
Enormous	
  transactional	
  data	
  
Enormous	
  unstructured	
  information	
  
Too	
  big	
  for	
  databases	
  
New	
  tools	
  are	
  needed	
  
	
  
kilobyte (kB) 103 210 kibibyte
(KiB) 210 megabyte (MB)
106 220 mebibyte (MiB) 220
gigabyte (GB) 109 230
gibibyte (GiB) 230 terabyte
(TB) 1012 240 tebibyte (TiB)
240 petabyte (PB) 1015 250
pebibyte (PiB) 250 exabyte
(EB) 1018 260 exbibyte (EiB)
260 zettabyte (ZB) 1021 270
zebibyte (ZiB) 270 yottabyte
(YB) 1024 280 yobibyte (YiB)
280




          Volume	
               Velocity	
     Variety	
  
Indexing	
  Challenges	
  
	
  
Complex,	
  varied	
  data	
  
Compute-­‐intensive	
  metadata	
  generation	
  
Schema	
  and	
  collection	
  management	
  
	
  




       Gather	
                        Extract	
  Metadata	
     Index	
  
       Resources	
  

            •  Crawl	
                       •  Named	
                 •  Schema	
  
            •  Crack	
  formats	
               entities	
                 definition	
  
                                             •  Categories	
            •  Collection	
  
                                             •  Machine	
                  management	
  
                                                learning	
  
                                             •  Semantic	
  
                                                analysis	
  
Initial	
  Query	
              Refine	
  Query	
       Evaluate	
  
                                                              Relevance	
  

              •  Keyword	
                  •  Analytic	
          •  Read	
  KWIC	
  
                 guesses	
                     tools	
             •  Read	
  
              •  Category	
                 •  Facetted	
             metadata	
  
                 guidance	
                    guidance	
          •  Read	
  
                                                                      document	
  




Search	
  Experience	
  Challenges	
  
	
  
Complex,	
  varied	
  data	
  
Resource	
  discovery	
  
Facetted	
  search	
  experience	
  management	
  
	
  
The	
  Solution	
  

Enable fast metadata generation:

      Hadoop
      Mahout
      GPUs

Manage and control collections and schema:

      LucidWorks Enterprise API
SQL 	
     Search	
  
           RDBMS      	
     Documents	
  
Transactional	
  Data 	
     Text	
  Classification	
  
           BI	
  Tools	
     Taxonomies	
  
                             Ontologies	
  
Machine-­‐Learning	
  


                       Finite	
  State	
  Transducer
                                                   	
  



Finite	
  State	
  Transducer
                            	
               Finite	
  State	
  Transducer
                                                                         	
  



                      Parts-­‐of-­‐Speech	
  Tagging	
  


                            Lemmatization	
  


                              Tokenization	
  
Resource	
  Integration	
  



Facet	
  Browsing	
                             Facet	
  Charting	
  



   Spellcheck	
                                  Autosuggest	
  


                        Query	
  Language	
  


                             Indexing	
  


                    Metadata	
  Extraction	
  
¡  Start	
  to	
  POC	
  in	
  a	
  week	
  
¡  Open	
  source	
  intelligence	
  problems	
  
ZettaSearch	
  
GOAL:	
  Be	
  more	
  competitive	
  
                                                                                                Facetted Search
SOURCES:	
  Patents,	
  PR	
                                                                     and Analytics

    announcements,	
  legal	
  documents,	
  
                                                                                relationships	
  
    whitepapers,	
  crawled	
  websites	
                                  metadata	
            entities	
  




                                                         ZettaVox	
  
                                                                                        data	
  
ANALYSIS:	
  Extract	
  named	
  entities	
  and	
  
    relationships,	
  classify	
  and	
  label;	
  
    visually	
  understand	
  relationships	
  and	
  
    trends	
  
                                                         Sources	
  
ACTION:	
  Change	
  R&D	
  priorities	
  and	
  
    improve	
  marketing	
  approaches	
  

                                                                                                                  13
¡  Understand	
  IP	
  among	
  competitors	
  
¡  Assist	
  legal	
  team	
  with	
  litigation	
  
¡  Custom	
  search	
  experience	
  
¡  Custom	
  extractors:	
  
  §  Electronic	
  parts	
  
  §  Memory	
  types	
  
  §  Flash	
  memory	
  



                   .                  5/15/12           14
Documents	
                   Size	
  
Dell	
                 102,508	
                     9Gb	
  
EMC	
                  303,678	
                     14Gb	
  
Huawei	
               11,912	
                      890Mb	
  
Kingston	
             2,534	
                       134Mb	
  
Lenovo	
               8,305	
                       542Mb	
  
NEC	
                  3,900	
                       252Mb	
  
Nokia	
                174,681	
                     22Gb	
  
Panasonic	
            5,804	
                       473Mb	
  
Rim	
                  181	
                         8Mb	
  
Sharp	
  USA	
         31,918	
                      4.9Gb	
  
                                       645,421	
                 60.2Gb	
  

    5/15/12        .                                                    15
ZettaSearch	
  
GOAL:	
  Discover	
  new	
  drugs,	
  detect	
  side-­‐
    effects,	
  speed	
  R&D	
                                                                               Facetted Search
                                                                                                             and Analytics
SOURCES:	
  Published	
  research	
  reports,	
  
                                                                                    relationships	
        pathways     	
  
    patents,	
  adverse	
  effects	
  databases,	
                                   sequences	
            entities	
  




                                                                  ZettaVox	
  
    genomics	
  and	
  proteomics	
  databases	
                                                  data	
  
ANALYSIS:	
  Extract	
  named	
  entities	
  and	
  
    relationships,	
  classify	
  and	
  label;	
  visually	
  
    discover	
  trends	
  and	
  relationships	
  
ACTION:	
  Change	
  R&D	
  priorities	
                          Sources	
  




                                                                                                                               16
¡  Lousy	
  search	
  (Google	
  Search	
  Appliance)	
  
¡  Internal	
  regulators	
  can’t	
  find	
  by	
  accession	
  
    number	
  
¡  Custom	
  extractors:	
  
   §  Accession	
  number	
  
   §  Ontology	
  of	
  active	
  ingredients	
  
   §  Drug	
  names	
  



                   © 2012 Kitenga Proprietary                       17
ZettaSearch	
  
GOAL:	
  Build	
  “second	
  screen	
  
                                                                                              Facetted Search
    experiences”	
                                                                             and Analytics

SOURCES:	
  wikipedia,	
  IMDB,	
  blogs	
  
                                                                              relationships	
  
ANALYSIS:	
  Extract	
  named	
  entities	
  and	
                       metadata	
            entities	
  




                                                       ZettaVox	
  
                                                                                      data	
  
    relationships,	
  preserve	
  existing	
  
    structural	
  metadata	
  
ACTION:	
  Enable	
  new	
  media	
  experiences	
  

                                                       Sources	
  




                                                                                                                18
¡  Crawlers	
  on	
  Hadoop	
  
¡  Document	
  format	
  crackers	
  on	
  Hadoop	
  
¡  Extractors	
  on	
  Hadoop	
  
¡  Filters	
  on	
  Hadoop	
  
¡  HTTP	
  documents	
  to	
  Solr	
  sharded	
  cluster	
  
¡  Intermediary	
  files	
  remain	
  on	
  HDFS	
  for	
  
  reprocessing	
  
¡  Missing	
  piece	
  of	
  the	
  puzzle	
  
¡  Addresses	
  the	
  impedance	
  mismatch	
  between	
  
    Big	
  Data	
  technologies	
  and	
  Solr	
  search	
  
¡  Manage	
  collections	
  
¡  Manage	
  schema	
  
¡  Create	
  collections	
  
¡  Delete	
  collections	
  
¡  Update	
  collection	
  properties	
  
¡  Create	
  schema	
  
¡  Modify	
  schema	
  
¡  Schema	
  interrogation	
  
¡  Schema	
  binding	
  to	
  user	
  experience	
  
¡  Facetted	
  search	
  
¡  Embedded	
  analytics	
  
¡    Big	
  Data	
  search	
  and	
  analytics	
  has	
  many	
  challenges:	
  
      §    Volume	
  of	
  data	
  
      §    Variety	
  of	
  data	
  
      §    Velocity	
  of	
  data	
  
      §    Extracting	
  structure	
  from	
  unstructured	
  information	
  
¡    Hadoop	
  processing	
  enables	
  each	
  of	
  these	
  aspects	
  
¡    Controlling	
  indexing	
  and	
  search	
  is	
  enabled	
  by	
  the	
  
      Lucid	
  Imagination	
  search	
  API	
  
¡    We	
  can	
  enable	
  complex	
  user	
  interactions	
  with	
  Big	
  
      Data	
  on	
  a	
  self-­‐serve	
  basis	
  
Analyst	
  Browser
                 	
                                             Enterprise	
  servers	
                                                    Cloud	
  services	
  
                                                                       Tomcat	
  App	
  Server
                                                                                             	
  
                                                                                                                                                    Amazon	
  S3	
  
                                                                                   Tomcat    	
  
                                                                                 Web	
  Services  	
  

                                                                                                                                                               Enterprise  	
  
                                                                          ZettaVoxServices	
  
                                                                                                                                                                 Cloud	
  
                                           XML	
                              Manager	
  
      ZettaVox       	
                      +	
  
       Author   	
                         JSON	
  
                                                                 GPU  	
                       Hadoop 	
  
        RIA	
                                                                                                                                        Search	
  Indexing	
  
                                                                Services   	
                  Services	
  
                                                                Manager       	
               Manager    	
  

                                             ReST	
  
                                             JSON	
  


                                                    GPU	
  MR	
  Service 	
                                  Hadoop	
  Server   	
                                  Hadoop	
  Server   	
  
                                                      Manager       	
                                        Name	
  node 	
                                        Job	
  Tracker
                                                                                                                                                                                  	
  
                                                             GPU	
  

                                                             GPU	
                                                                            Hadoop    	
  
                                                                                                                                                   Hadoop 	
  
                                                                                                                                                             	
  
                                                                                                                                           Task	
  Manager 	
  
                                                                                                                                                     Hadoop
                                                                                                                                             Task	
  Manager 	
   	
  
     Quantum4D	
                                                                                                                                Task	
  Manager

                                                        RDBMS	
  
                                                                                                                                                  Entity	
  
                                                                                                                               Mahout
                                                                                                                                    	
                                         Crawling	
  
                                                                                                                                                Extraction   	
  
               ©	
  2012	
  	
  Kitenga	
  Proprietary	
  
Analyst	
  Browser
                 	
  
                                                                              Enterprise	
  servers	
  


                                                                                                                                      Search	
  Indexing	
  

                                                  • Get	
  collection	
  information	
  
                                                  • Create	
  new	
  collection	
  
                                                  • Create	
  fields	
  
                                                  • Delete	
  fields	
  
                                                  • Edit	
  fields	
  


      ZettaVox       	
                             ReST	
  
       Author   	
                                    	
  
        RIA	
                                       JSON	
  


                                                                                  Hadoop	
  Server   	
                                     Hadoop	
  Server   	
  
                                                                                   Name	
  node 	
                                           Job	
  Tracker
                                                                                                                                                          	
  


                                                                                                                          Hadoop    	
  
                                                                                                                               Hadoop 	
  
                                                                                                                                         	
  
                                                                                                                       Task	
  Manager 	
  
                                                                                                                                 Hadoop
                                                                                                                         Task	
  Manager 	
   	
  
                                                                                                                            Task	
  Manager



                                                                                                     Entity	
  
                                                                            Mahout
                                                                                 	
                                            Crawling	
                Indexing	
  
                                                                                                   Extraction   	
  
               ©	
  2012	
  	
  Kitenga	
  Proprietary	
  
Questions?	
  
Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience

Más contenido relacionado

Similar a Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience

The Next Generation SharePoint: Powered by Text Analytics
The Next Generation SharePoint: Powered by Text AnalyticsThe Next Generation SharePoint: Powered by Text Analytics
The Next Generation SharePoint: Powered by Text AnalyticsAlyona Medelyan
 
The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics Peter Wren-Hilton
 
Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Umesh Ramalingachar
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutesLucidworks (Archived)
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopDataWorks Summit
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceTed Dunning
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrDataWorks Summit
 
How Search 2.0 Has Been Redefined by Enterprise 2.0
How Search 2.0 Has Been Redefined by Enterprise 2.0How Search 2.0 Has Been Redefined by Enterprise 2.0
How Search 2.0 Has Been Redefined by Enterprise 2.0Enterprise 2.0 Conference
 
Streaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionStreaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionDATAVERSITY
 
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011SEO CAMP
 
Analyzing Multi-Structured Data
Analyzing Multi-Structured DataAnalyzing Multi-Structured Data
Analyzing Multi-Structured DataDataWorks Summit
 
Publishing biodiversity: The interplay between Scratchpads and the new Biodiv...
Publishing biodiversity: The interplay between Scratchpads and the new Biodiv...Publishing biodiversity: The interplay between Scratchpads and the new Biodiv...
Publishing biodiversity: The interplay between Scratchpads and the new Biodiv...Dimitrios Koureas
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationLucidworks (Archived)
 
Linked Open data: CNR
Linked Open data: CNRLinked Open data: CNR
Linked Open data: CNRDatiGovIT
 
Qiagram Slides 2011 05
Qiagram Slides 2011 05Qiagram Slides 2011 05
Qiagram Slides 2011 05bhughes26
 
Information Management and Analytics
Information Management and Analytics Information Management and Analytics
Information Management and Analytics AKAGroup
 
"Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey""Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey"Lucidworks (Archived)
 

Similar a Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience (20)

The Next Generation SharePoint: Powered by Text Analytics
The Next Generation SharePoint: Powered by Text AnalyticsThe Next Generation SharePoint: Powered by Text Analytics
The Next Generation SharePoint: Powered by Text Analytics
 
The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics
 
Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012Mesh Labs Introduction June 2012
Mesh Labs Introduction June 2012
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutes
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over Hadoop
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
 
How Search 2.0 Has Been Redefined by Enterprise 2.0
How Search 2.0 Has Been Redefined by Enterprise 2.0How Search 2.0 Has Been Redefined by Enterprise 2.0
How Search 2.0 Has Been Redefined by Enterprise 2.0
 
Streaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionStreaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise Adoption
 
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
Jean-Marc Lazard d'Exalead - Pioneering hypermedia - SEO Campus 2011
 
Analyzing Multi-Structured Data
Analyzing Multi-Structured DataAnalyzing Multi-Structured Data
Analyzing Multi-Structured Data
 
Publishing biodiversity: The interplay between Scratchpads and the new Biodiv...
Publishing biodiversity: The interplay between Scratchpads and the new Biodiv...Publishing biodiversity: The interplay between Scratchpads and the new Biodiv...
Publishing biodiversity: The interplay between Scratchpads and the new Biodiv...
 
Data mining
Data miningData mining
Data mining
 
Open Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to InformationOpen Source for Enterprise Search: Breaking Down the Barriers to Information
Open Source for Enterprise Search: Breaking Down the Barriers to Information
 
Linked Open data: CNR
Linked Open data: CNRLinked Open data: CNR
Linked Open data: CNR
 
Qiagram
QiagramQiagram
Qiagram
 
Qiagram Slides 2011 05
Qiagram Slides 2011 05Qiagram Slides 2011 05
Qiagram Slides 2011 05
 
Qiagram
QiagramQiagram
Qiagram
 
Information Management and Analytics
Information Management and Analytics Information Management and Analytics
Information Management and Analytics
 
"Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey""Search, APIs,Capability Management and the Sensis Journey"
"Search, APIs,Capability Management and the Sensis Journey"
 

Más de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Último (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 

Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience

  • 1. Kitenga reinventing information Mark Davis Founder/CTO
  • 3. Big  Data     Enormous  transactional  data   Enormous  unstructured  information   Too  big  for  databases   New  tools  are  needed    
  • 4. kilobyte (kB) 103 210 kibibyte (KiB) 210 megabyte (MB) 106 220 mebibyte (MiB) 220 gigabyte (GB) 109 230 gibibyte (GiB) 230 terabyte (TB) 1012 240 tebibyte (TiB) 240 petabyte (PB) 1015 250 pebibyte (PiB) 250 exabyte (EB) 1018 260 exbibyte (EiB) 260 zettabyte (ZB) 1021 270 zebibyte (ZiB) 270 yottabyte (YB) 1024 280 yobibyte (YiB) 280 Volume   Velocity   Variety  
  • 5. Indexing  Challenges     Complex,  varied  data   Compute-­‐intensive  metadata  generation   Schema  and  collection  management     Gather   Extract  Metadata   Index   Resources   •  Crawl   •  Named   •  Schema   •  Crack  formats   entities   definition   •  Categories   •  Collection   •  Machine   management   learning   •  Semantic   analysis  
  • 6. Initial  Query   Refine  Query   Evaluate   Relevance   •  Keyword   •  Analytic   •  Read  KWIC   guesses   tools   •  Read   •  Category   •  Facetted   metadata   guidance   guidance   •  Read   document   Search  Experience  Challenges     Complex,  varied  data   Resource  discovery   Facetted  search  experience  management    
  • 7. The  Solution   Enable fast metadata generation: Hadoop Mahout GPUs Manage and control collections and schema: LucidWorks Enterprise API
  • 8. SQL   Search   RDBMS   Documents   Transactional  Data   Text  Classification   BI  Tools   Taxonomies   Ontologies  
  • 9.
  • 10. Machine-­‐Learning   Finite  State  Transducer   Finite  State  Transducer   Finite  State  Transducer   Parts-­‐of-­‐Speech  Tagging   Lemmatization   Tokenization  
  • 11. Resource  Integration   Facet  Browsing   Facet  Charting   Spellcheck   Autosuggest   Query  Language   Indexing   Metadata  Extraction  
  • 12. ¡  Start  to  POC  in  a  week   ¡  Open  source  intelligence  problems  
  • 13. ZettaSearch   GOAL:  Be  more  competitive   Facetted Search SOURCES:  Patents,  PR   and Analytics announcements,  legal  documents,   relationships   whitepapers,  crawled  websites   metadata   entities   ZettaVox   data   ANALYSIS:  Extract  named  entities  and   relationships,  classify  and  label;   visually  understand  relationships  and   trends   Sources   ACTION:  Change  R&D  priorities  and   improve  marketing  approaches   13
  • 14. ¡  Understand  IP  among  competitors   ¡  Assist  legal  team  with  litigation   ¡  Custom  search  experience   ¡  Custom  extractors:   §  Electronic  parts   §  Memory  types   §  Flash  memory   . 5/15/12 14
  • 15. Documents   Size   Dell   102,508   9Gb   EMC   303,678   14Gb   Huawei   11,912   890Mb   Kingston   2,534   134Mb   Lenovo   8,305   542Mb   NEC   3,900   252Mb   Nokia   174,681   22Gb   Panasonic   5,804   473Mb   Rim   181   8Mb   Sharp  USA   31,918   4.9Gb   645,421   60.2Gb   5/15/12 . 15
  • 16. ZettaSearch   GOAL:  Discover  new  drugs,  detect  side-­‐ effects,  speed  R&D   Facetted Search and Analytics SOURCES:  Published  research  reports,   relationships   pathways   patents,  adverse  effects  databases,   sequences   entities   ZettaVox   genomics  and  proteomics  databases   data   ANALYSIS:  Extract  named  entities  and   relationships,  classify  and  label;  visually   discover  trends  and  relationships   ACTION:  Change  R&D  priorities   Sources   16
  • 17. ¡  Lousy  search  (Google  Search  Appliance)   ¡  Internal  regulators  can’t  find  by  accession   number   ¡  Custom  extractors:   §  Accession  number   §  Ontology  of  active  ingredients   §  Drug  names   © 2012 Kitenga Proprietary 17
  • 18. ZettaSearch   GOAL:  Build  “second  screen   Facetted Search experiences”   and Analytics SOURCES:  wikipedia,  IMDB,  blogs   relationships   ANALYSIS:  Extract  named  entities  and   metadata   entities   ZettaVox   data   relationships,  preserve  existing   structural  metadata   ACTION:  Enable  new  media  experiences   Sources   18
  • 19. ¡  Crawlers  on  Hadoop   ¡  Document  format  crackers  on  Hadoop   ¡  Extractors  on  Hadoop   ¡  Filters  on  Hadoop   ¡  HTTP  documents  to  Solr  sharded  cluster   ¡  Intermediary  files  remain  on  HDFS  for   reprocessing  
  • 20. ¡  Missing  piece  of  the  puzzle   ¡  Addresses  the  impedance  mismatch  between   Big  Data  technologies  and  Solr  search   ¡  Manage  collections   ¡  Manage  schema  
  • 21.
  • 22.
  • 23. ¡  Create  collections   ¡  Delete  collections   ¡  Update  collection  properties   ¡  Create  schema   ¡  Modify  schema  
  • 24. ¡  Schema  interrogation   ¡  Schema  binding  to  user  experience   ¡  Facetted  search   ¡  Embedded  analytics  
  • 25.
  • 26.
  • 27. ¡  Big  Data  search  and  analytics  has  many  challenges:   §  Volume  of  data   §  Variety  of  data   §  Velocity  of  data   §  Extracting  structure  from  unstructured  information   ¡  Hadoop  processing  enables  each  of  these  aspects   ¡  Controlling  indexing  and  search  is  enabled  by  the   Lucid  Imagination  search  API   ¡  We  can  enable  complex  user  interactions  with  Big   Data  on  a  self-­‐serve  basis  
  • 28. Analyst  Browser   Enterprise  servers   Cloud  services   Tomcat  App  Server   Amazon  S3   Tomcat   Web  Services   Enterprise   ZettaVoxServices   Cloud   XML   Manager   ZettaVox   +   Author   JSON   GPU   Hadoop   RIA   Search  Indexing   Services   Services   Manager   Manager   ReST   JSON   GPU  MR  Service   Hadoop  Server   Hadoop  Server   Manager   Name  node   Job  Tracker   GPU   GPU   Hadoop   Hadoop     Task  Manager   Hadoop Task  Manager     Quantum4D   Task  Manager RDBMS   Entity   Mahout   Crawling   Extraction   ©  2012    Kitenga  Proprietary  
  • 29. Analyst  Browser   Enterprise  servers   Search  Indexing   • Get  collection  information   • Create  new  collection   • Create  fields   • Delete  fields   • Edit  fields   ZettaVox   ReST   Author     RIA   JSON   Hadoop  Server   Hadoop  Server   Name  node   Job  Tracker   Hadoop   Hadoop     Task  Manager   Hadoop Task  Manager     Task  Manager Entity   Mahout   Crawling   Indexing   Extraction   ©  2012    Kitenga  Proprietary