SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Mark Tabladillo Ph.D.
                                                                                           Data Mining Scientist
                                                                                                   MarkTab Inc.


Applied Enterprise
Semantic Mining
T E X T M I N I NG W I T H S Q L S E RVER 2 0 1 2
P R ESENTED AT AT L A NTA M I CROS OFT BU S I N ESS I N T EL LIGENCE G ROU P
JA N UA RY 2 8 , 2 0 1 3



                                    ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
About MarkTab
http://marktab.com
http://marktab.net
   @MarkTabNet




                     ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                   WORLDWIDE
Introduction
SQL Server 2012 has new Programmability Enhancements
 ◦ Statistical Semantic Search
 ◦ File Tables
 ◦ Full-Text Search Improvements

These combined technologies make SQL Server 2012 a strong contender in text mining




                                                          ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                        WORLDWIDE
Challenges
Building and Maintaining Applications with relational and non-relational data is hard
 ◦ Complex integration
 ◦ Duplicated functionality
 ◦ Compensation for unavailable services

80% of all data is not stored in databases!
Most of it is “unstructured”


(2012, Michael Rys, Microsoft)




                                                              ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                            WORLDWIDE
Microsoft and Google

       ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
History
July 2008
 ◦ Microsoft purchases Powerset for US$100 Million
 ◦ Google Dismisses Semantic Search
 ◦ http://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine-powerset-for-100m-
   plus/
 ◦ http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel-cx_ag_0701powerset.html




                                                               ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                             WORLDWIDE
History
March 2009
◦ Google announces “snippets” as relevant to search
◦ The media picks this story up as “semantic search”
◦ http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google-
  results.html#!/2009/03/two-new-improvements-to-google-results.html




                                                             ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                           WORLDWIDE
History
February 2012
◦ Google announces Knowledge Graph, an explicit application of semantic search
◦ http://mashable.com/2012/02/13/google-knowledge-graph-change-search/




                                                                 ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                               WORLDWIDE
History
April 2012
 ◦ Microsoft purchases 800+ patents from AOL for US$1 Billion
 ◦ Among the patents are semantic search and metadata querying – older than Google
 ◦ http://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/




                                                                 ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                               WORLDWIDE
New in SQL Server 2012
HT TP://MSDN.MICROSOFT.COM/EN -US/LIBRARY/CC645577.ASPX




                    ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Goals of Semantic Search
Reduce the cost of managing all data
Simplify the development of applications over all data
Provide management and programming services for all data
Make SQL Server the preferred choice for managing Unstructured Data and allow building Rich
Application Experience on top
(2012, Michael Rys, Microsoft)




                                                           ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                         WORLDWIDE
Statistical Semantic Search
Identifies statistically relevant key phrases
Based on these phrases, can identify (by score) similar documents




                                                             ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                           WORLDWIDE
FileTables
Built on existing SQL Server FILESTREAM technology
Files and documents
 ◦ Stored in special tables in SQL Server
 ◦ Accessed if they were stored in the file system




                                                     ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                   WORLDWIDE
Full-Text Search Enhancements
Property search: search on tagged properties (such as author or title)
Customizable NEAR: find words or phrases close to one another
New Word Breakers and Stemmers (for many languages)




                                                               ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                             WORLDWIDE
From Documents to Output
                   Office
        Varchar
                                   PDF
       NVarchar
                   Rowset
                   Output
                  with Scores



                                ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                              WORLDWIDE
“Beyond Relational” vs. “Adoption”
Start with unstructured (meaning non-relational) data
Use Windows technology
 ◦ Reading and Writing Files (Win32 API)
 ◦ iFilters for reading proprietary formats

Develop indexed structure from unstructured data




                                                        ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                      WORLDWIDE
(iFilter Required)

                               iFilters                Full-Text
    Documents                                          Keyword
                                                        Index
                                                         “FTI”



                                                      Semantic
                                                     Key Phrase
                               Semantic                Index –
      Semantic Document        Database               Tag Index
      Similarity Index “DSI”                             “TI”



                                          ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                        WORLDWIDE
“iFilter”?
IFilters are components that allow search services to index content of specific file types, letting
you search for content in those files.
They are intended for use with Microsoft Search Services (SharePoint, SQL, Exchange, Windows
Search).




                                                                 ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                               WORLDWIDE
Microsoft Office 2010 Filters Pack
Legacy Office Filter (97-2003; .doc, .ppt, .xls)
Metro Office Filter (2007; .docx, .pptx, .xlsx)
Zip Filter
OneNote filter
Visio Filter
Publisher Filter
Open Document Format Filter




                                                   ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                 WORLDWIDE
Adobe PDF iFilter 9 for 64-bit platforms
Allows PDF search
Not currently supported for Windows 7 or 8
 ◦ But I used it anyway 

Add the Bin directory to your path
 ◦ Computer (right click), Properties, Advanced System Settings, Environment Variables




                                                                    ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                                  WORLDWIDE
“Semantic Language Statistics
Database”?
This database contains the statistical language models required by semantic search.
A single semantic language statistics database contains the language models for all the
languages that are supported for semantic indexing.




                                                              ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                            WORLDWIDE
Languages Currently Supported
Traditional Chinese

German

English

French

Italian

Brazilian

Russian

Swedish

Simplified Chinese

British English

Portuguese

Chinese (Hong Kong SAR, PRC)

Spanish

Chinese (Singapore)

Chinese (Macau SAR)




                               ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                             WORLDWIDE
Phases of Semantic Indexing
     Full Text Keyword Index “FTI”

                                               Semantic Document Similarity
                                                       Index “DSI”
     Semantic Key Phrase Index –
           Tag Index “TI”




    http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing



                                                        ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                      WORLDWIDE
Performance

       ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Integrated Full Text Search (iFTS)
Improved Performance and Scale:
 ◦   Scale-up to 350M documents for storage and search
 ◦   iFTS query performance 7-10 times faster than in SQL Server 2008
 ◦   Worst-case iFTS query response times less than 3 sec for corpus
 ◦   Similar or better than main database search competitors

(2012, Michael Rys, Microsoft)




                                                                        ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                                      WORLDWIDE
Linear Scale of FTI/TI/DSI
First known linearly scaling end-to-end Search and Semantic product in the industry




           Time in Seconds vs. Number of Documents
           (2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)

                                                             ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                           WORLDWIDE
Conclusion
SQL Server 2012 adds new text processing capabilities
This technology scales linearly
Microsoft invites millions of documents for enterprise-level applications




                                                               ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                             WORLDWIDE
Network
MarkTab Consulting
 ◦ http://marktab.com

Blog
 ◦ http://marktab.net

Twitter
 ◦ @marktabnet




                        ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                      WORLDWIDE
Appendix

       ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
References
Video
 ◦ http://channel9.msdn.com/Shows/DataBound/DataBound-Episode-2-Semantic-Search
 ◦ http://www.microsoftpdc.com/2009/SVR32

Semantic Search (Books Online) – explains the demo
 ◦ http://msdn.microsoft.com/en-us/library/gg492075.aspx

Paper
 ◦ http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p213.pdf




                                                                 ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                               WORLDWIDE
Demo: My Semantic Search Sample
http://mysemanticsearch.codeplex.com/
Requires:
 ◦   iFilters
 ◦   Semantic Language Statistics Database
 ◦   IIS7, IIS6, with Windows Authentication
 ◦   .NET 4.0
 ◦   Silverlight 4.0
 ◦   FILESTREAM (complete)




                                               ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                             WORLDWIDE
Demo: T-SQL and Documents
Naveen Garg
Requires Adventure Works (from Codeplex)
http://blogs.msdn.com/b/sqlfts/archive/2011/07/21/introducing-fulltext-statistical-semantic-
search-in-sql-server-codename-denali-release.aspx




                                                             ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                           WORLDWIDE
Abstract
SQL Server 2012 debuts a new Semantic Platform (commonly known as the Semantic Search
applied task). This text mining technology leverages the already established Full Text Index and
builds semantic indexes in a two-phase process. This session's detailed description and demo
give you important information for the enterprise implementation of Tag Index and Document
Similarity Index. The demo is a web-based Silverlight application showing how to interactively
use semantic search. Currently, the indexes work for 15 languages. We'll also look at strategy
tips for how to best leverage the new semantic technology with existing Microsoft text and data
mining functionality.




                                                              ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED
                                                                            WORLDWIDE

Más contenido relacionado

Similar a Applied enterprise semantic mining

The New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudThe New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudInside Analysis
 
Applied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL ServerApplied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL ServerMark Tabladillo
 
Document Classification using DMX in SQL Server Analysis Services
Document Classification using DMX in SQL Server Analysis ServicesDocument Classification using DMX in SQL Server Analysis Services
Document Classification using DMX in SQL Server Analysis ServicesMark Tabladillo
 
Sql Saturday 111 Atlanta applied enterprise semantic mining
Sql Saturday 111 Atlanta applied enterprise semantic miningSql Saturday 111 Atlanta applied enterprise semantic mining
Sql Saturday 111 Atlanta applied enterprise semantic miningMark Tabladillo
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricksBrandon Berlinrut
 
Data Mining With Excel 2007 And SQL Server 2008
Data Mining With Excel 2007 And SQL Server 2008Data Mining With Excel 2007 And SQL Server 2008
Data Mining With Excel 2007 And SQL Server 2008Mark Tabladillo
 
Solutions Linux 2013: SpagoBI and Talend jointly support Big Data scenarios
Solutions Linux 2013: SpagoBI and Talend jointly support Big Data scenarios Solutions Linux 2013: SpagoBI and Talend jointly support Big Data scenarios
Solutions Linux 2013: SpagoBI and Talend jointly support Big Data scenarios SpagoWorld
 
[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...
[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...
[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...Insight Technology, Inc.
 
Data mining with excel 2010 and power pivot
Data mining with excel 2010 and power pivotData mining with excel 2010 and power pivot
Data mining with excel 2010 and power pivotigsc
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo
 
The Big Picture: Big Data for the New Wave of Analytics
The Big Picture: Big Data for the New Wave of AnalyticsThe Big Picture: Big Data for the New Wave of Analytics
The Big Picture: Big Data for the New Wave of AnalyticsInside Analysis
 
SQL Server Data Mining for SQL Server Professionals
SQL Server Data Mining for SQL Server Professionals SQL Server Data Mining for SQL Server Professionals
SQL Server Data Mining for SQL Server Professionals Mark Tabladillo
 
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...Insight Technology, Inc.
 
Big data oracle_introduccion
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccionFran Navarro
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldDataWorks Summit/Hadoop Summit
 
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdfMAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdfGary Mazzaferro
 
DITA and S1000D Two Paths to Structured Documentation
DITA and S1000D   Two Paths to Structured DocumentationDITA and S1000D   Two Paths to Structured Documentation
DITA and S1000D Two Paths to Structured DocumentationJoseph Storbeck
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...Denodo
 
Enterprise Data Intelligence Foundation for Success
Enterprise Data IntelligenceFoundation for SuccessEnterprise Data IntelligenceFoundation for Success
Enterprise Data Intelligence Foundation for SuccessDougSchoemaker
 

Similar a Applied enterprise semantic mining (20)

The New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudThe New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the Cloud
 
Applied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL ServerApplied Semantic Search with Microsoft SQL Server
Applied Semantic Search with Microsoft SQL Server
 
Document Classification using DMX in SQL Server Analysis Services
Document Classification using DMX in SQL Server Analysis ServicesDocument Classification using DMX in SQL Server Analysis Services
Document Classification using DMX in SQL Server Analysis Services
 
Sql Saturday 111 Atlanta applied enterprise semantic mining
Sql Saturday 111 Atlanta applied enterprise semantic miningSql Saturday 111 Atlanta applied enterprise semantic mining
Sql Saturday 111 Atlanta applied enterprise semantic mining
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
 
Data Mining With Excel 2007 And SQL Server 2008
Data Mining With Excel 2007 And SQL Server 2008Data Mining With Excel 2007 And SQL Server 2008
Data Mining With Excel 2007 And SQL Server 2008
 
Solutions Linux 2013: SpagoBI and Talend jointly support Big Data scenarios
Solutions Linux 2013: SpagoBI and Talend jointly support Big Data scenarios Solutions Linux 2013: SpagoBI and Talend jointly support Big Data scenarios
Solutions Linux 2013: SpagoBI and Talend jointly support Big Data scenarios
 
[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...
[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...
[db tech showcase Tokyo 2018] #dbts2018 #B38 『Big Data and the Multi-model Da...
 
Data mining with excel 2010 and power pivot
Data mining with excel 2010 and power pivotData mining with excel 2010 and power pivot
Data mining with excel 2010 and power pivot
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
The Big Picture: Big Data for the New Wave of Analytics
The Big Picture: Big Data for the New Wave of AnalyticsThe Big Picture: Big Data for the New Wave of Analytics
The Big Picture: Big Data for the New Wave of Analytics
 
SQL Server Data Mining for SQL Server Professionals
SQL Server Data Mining for SQL Server Professionals SQL Server Data Mining for SQL Server Professionals
SQL Server Data Mining for SQL Server Professionals
 
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...
[db tech showcase Tokyo 2018] #dbts2018 #B36 『Design Your Databases straight ...
 
Big data oracle_introduccion
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccion
 
Organising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data WorldOrganising the Data Lake - Information Management in a Big Data World
Organising the Data Lake - Information Management in a Big Data World
 
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdfMAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
 
DITA and S1000D Two Paths to Structured Documentation
DITA and S1000D   Two Paths to Structured DocumentationDITA and S1000D   Two Paths to Structured Documentation
DITA and S1000D Two Paths to Structured Documentation
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
 
Enterprise Data Intelligence Foundation for Success
Enterprise Data IntelligenceFoundation for SuccessEnterprise Data IntelligenceFoundation for Success
Enterprise Data Intelligence Foundation for Success
 
Big Data
Big DataBig Data
Big Data
 

Más de Mark Tabladillo

How to find low-cost or free data science resources 202006
How to find low-cost or free data science resources 202006How to find low-cost or free data science resources 202006
How to find low-cost or free data science resources 202006Mark Tabladillo
 
Microsoft Build 2020: Data Science Recap
Microsoft Build 2020: Data Science RecapMicrosoft Build 2020: Data Science Recap
Microsoft Build 2020: Data Science RecapMark Tabladillo
 
201909 Automated ML for Developers
201909 Automated ML for Developers201909 Automated ML for Developers
201909 Automated ML for DevelopersMark Tabladillo
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated MLMark Tabladillo
 
201906 01 Introduction to ML.NET 1.0
201906 01 Introduction to ML.NET 1.0201906 01 Introduction to ML.NET 1.0
201906 01 Introduction to ML.NET 1.0Mark Tabladillo
 
201906 04 Overview of Automated ML June 2019
201906 04 Overview of Automated ML June 2019201906 04 Overview of Automated ML June 2019
201906 04 Overview of Automated ML June 2019Mark Tabladillo
 
201906 03 Introduction to NimbusML
201906 03 Introduction to NimbusML201906 03 Introduction to NimbusML
201906 03 Introduction to NimbusMLMark Tabladillo
 
201906 02 Introduction to AutoML with ML.NET 1.0
201906 02 Introduction to AutoML with ML.NET 1.0201906 02 Introduction to AutoML with ML.NET 1.0
201906 02 Introduction to AutoML with ML.NET 1.0Mark Tabladillo
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine LearningMark Tabladillo
 
201905 Azure Certification DP-100: Designing and Implementing a Data Science ...
201905 Azure Certification DP-100: Designing and Implementing a Data Science ...201905 Azure Certification DP-100: Designing and Implementing a Data Science ...
201905 Azure Certification DP-100: Designing and Implementing a Data Science ...Mark Tabladillo
 
Big Data Advanced Analytics on Microsoft Azure 201904
Big Data Advanced Analytics on Microsoft Azure 201904Big Data Advanced Analytics on Microsoft Azure 201904
Big Data Advanced Analytics on Microsoft Azure 201904Mark Tabladillo
 
Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904Mark Tabladillo
 
Training of Python scikit-learn models on Azure
Training of Python scikit-learn models on AzureTraining of Python scikit-learn models on Azure
Training of Python scikit-learn models on AzureMark Tabladillo
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureMark Tabladillo
 
Advanced Analytics with Power BI 201808
Advanced Analytics with Power BI 201808Advanced Analytics with Power BI 201808
Advanced Analytics with Power BI 201808Mark Tabladillo
 
Microsoft Cognitive Toolkit (Atlanta Code Camp 2017)
Microsoft Cognitive Toolkit (Atlanta Code Camp 2017)Microsoft Cognitive Toolkit (Atlanta Code Camp 2017)
Microsoft Cognitive Toolkit (Atlanta Code Camp 2017)Mark Tabladillo
 
Machine learning services with SQL Server 2017
Machine learning services with SQL Server 2017Machine learning services with SQL Server 2017
Machine learning services with SQL Server 2017Mark Tabladillo
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Mark Tabladillo
 
How Big Companies plan to use Our Big Data 201610
How Big Companies plan to use Our Big Data 201610How Big Companies plan to use Our Big Data 201610
How Big Companies plan to use Our Big Data 201610Mark Tabladillo
 
Georgia Tech Data Science Hackathon September 2016
Georgia Tech Data Science Hackathon September 2016Georgia Tech Data Science Hackathon September 2016
Georgia Tech Data Science Hackathon September 2016Mark Tabladillo
 

Más de Mark Tabladillo (20)

How to find low-cost or free data science resources 202006
How to find low-cost or free data science resources 202006How to find low-cost or free data science resources 202006
How to find low-cost or free data science resources 202006
 
Microsoft Build 2020: Data Science Recap
Microsoft Build 2020: Data Science RecapMicrosoft Build 2020: Data Science Recap
Microsoft Build 2020: Data Science Recap
 
201909 Automated ML for Developers
201909 Automated ML for Developers201909 Automated ML for Developers
201909 Automated ML for Developers
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
 
201906 01 Introduction to ML.NET 1.0
201906 01 Introduction to ML.NET 1.0201906 01 Introduction to ML.NET 1.0
201906 01 Introduction to ML.NET 1.0
 
201906 04 Overview of Automated ML June 2019
201906 04 Overview of Automated ML June 2019201906 04 Overview of Automated ML June 2019
201906 04 Overview of Automated ML June 2019
 
201906 03 Introduction to NimbusML
201906 03 Introduction to NimbusML201906 03 Introduction to NimbusML
201906 03 Introduction to NimbusML
 
201906 02 Introduction to AutoML with ML.NET 1.0
201906 02 Introduction to AutoML with ML.NET 1.0201906 02 Introduction to AutoML with ML.NET 1.0
201906 02 Introduction to AutoML with ML.NET 1.0
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
201905 Azure Certification DP-100: Designing and Implementing a Data Science ...
201905 Azure Certification DP-100: Designing and Implementing a Data Science ...201905 Azure Certification DP-100: Designing and Implementing a Data Science ...
201905 Azure Certification DP-100: Designing and Implementing a Data Science ...
 
Big Data Advanced Analytics on Microsoft Azure 201904
Big Data Advanced Analytics on Microsoft Azure 201904Big Data Advanced Analytics on Microsoft Azure 201904
Big Data Advanced Analytics on Microsoft Azure 201904
 
Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904
 
Training of Python scikit-learn models on Azure
Training of Python scikit-learn models on AzureTraining of Python scikit-learn models on Azure
Training of Python scikit-learn models on Azure
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
 
Advanced Analytics with Power BI 201808
Advanced Analytics with Power BI 201808Advanced Analytics with Power BI 201808
Advanced Analytics with Power BI 201808
 
Microsoft Cognitive Toolkit (Atlanta Code Camp 2017)
Microsoft Cognitive Toolkit (Atlanta Code Camp 2017)Microsoft Cognitive Toolkit (Atlanta Code Camp 2017)
Microsoft Cognitive Toolkit (Atlanta Code Camp 2017)
 
Machine learning services with SQL Server 2017
Machine learning services with SQL Server 2017Machine learning services with SQL Server 2017
Machine learning services with SQL Server 2017
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612
 
How Big Companies plan to use Our Big Data 201610
How Big Companies plan to use Our Big Data 201610How Big Companies plan to use Our Big Data 201610
How Big Companies plan to use Our Big Data 201610
 
Georgia Tech Data Science Hackathon September 2016
Georgia Tech Data Science Hackathon September 2016Georgia Tech Data Science Hackathon September 2016
Georgia Tech Data Science Hackathon September 2016
 

Applied enterprise semantic mining

  • 1. Mark Tabladillo Ph.D. Data Mining Scientist MarkTab Inc. Applied Enterprise Semantic Mining T E X T M I N I NG W I T H S Q L S E RVER 2 0 1 2 P R ESENTED AT AT L A NTA M I CROS OFT BU S I N ESS I N T EL LIGENCE G ROU P JA N UA RY 2 8 , 2 0 1 3 ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 2. About MarkTab http://marktab.com http://marktab.net @MarkTabNet ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 3. Introduction SQL Server 2012 has new Programmability Enhancements ◦ Statistical Semantic Search ◦ File Tables ◦ Full-Text Search Improvements These combined technologies make SQL Server 2012 a strong contender in text mining ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 4. Challenges Building and Maintaining Applications with relational and non-relational data is hard ◦ Complex integration ◦ Duplicated functionality ◦ Compensation for unavailable services 80% of all data is not stored in databases! Most of it is “unstructured” (2012, Michael Rys, Microsoft) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 5. Microsoft and Google ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 6. History July 2008 ◦ Microsoft purchases Powerset for US$100 Million ◦ Google Dismisses Semantic Search ◦ http://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine-powerset-for-100m- plus/ ◦ http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel-cx_ag_0701powerset.html ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 7. History March 2009 ◦ Google announces “snippets” as relevant to search ◦ The media picks this story up as “semantic search” ◦ http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google- results.html#!/2009/03/two-new-improvements-to-google-results.html ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 8. History February 2012 ◦ Google announces Knowledge Graph, an explicit application of semantic search ◦ http://mashable.com/2012/02/13/google-knowledge-graph-change-search/ ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 9. History April 2012 ◦ Microsoft purchases 800+ patents from AOL for US$1 Billion ◦ Among the patents are semantic search and metadata querying – older than Google ◦ http://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/ ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 10. New in SQL Server 2012 HT TP://MSDN.MICROSOFT.COM/EN -US/LIBRARY/CC645577.ASPX ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 11. Goals of Semantic Search Reduce the cost of managing all data Simplify the development of applications over all data Provide management and programming services for all data Make SQL Server the preferred choice for managing Unstructured Data and allow building Rich Application Experience on top (2012, Michael Rys, Microsoft) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 12. Statistical Semantic Search Identifies statistically relevant key phrases Based on these phrases, can identify (by score) similar documents ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 13. FileTables Built on existing SQL Server FILESTREAM technology Files and documents ◦ Stored in special tables in SQL Server ◦ Accessed if they were stored in the file system ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 14. Full-Text Search Enhancements Property search: search on tagged properties (such as author or title) Customizable NEAR: find words or phrases close to one another New Word Breakers and Stemmers (for many languages) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 15. From Documents to Output Office Varchar PDF NVarchar Rowset Output with Scores ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 16. “Beyond Relational” vs. “Adoption” Start with unstructured (meaning non-relational) data Use Windows technology ◦ Reading and Writing Files (Win32 API) ◦ iFilters for reading proprietary formats Develop indexed structure from unstructured data ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 17. (iFilter Required) iFilters Full-Text Documents Keyword Index “FTI” Semantic Key Phrase Semantic Index – Semantic Document Database Tag Index Similarity Index “DSI” “TI” ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 18. “iFilter”? IFilters are components that allow search services to index content of specific file types, letting you search for content in those files. They are intended for use with Microsoft Search Services (SharePoint, SQL, Exchange, Windows Search). ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 19. Microsoft Office 2010 Filters Pack Legacy Office Filter (97-2003; .doc, .ppt, .xls) Metro Office Filter (2007; .docx, .pptx, .xlsx) Zip Filter OneNote filter Visio Filter Publisher Filter Open Document Format Filter ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 20. Adobe PDF iFilter 9 for 64-bit platforms Allows PDF search Not currently supported for Windows 7 or 8 ◦ But I used it anyway  Add the Bin directory to your path ◦ Computer (right click), Properties, Advanced System Settings, Environment Variables ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 21. “Semantic Language Statistics Database”? This database contains the statistical language models required by semantic search. A single semantic language statistics database contains the language models for all the languages that are supported for semantic indexing. ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 22. Languages Currently Supported Traditional Chinese German English French Italian Brazilian Russian Swedish Simplified Chinese British English Portuguese Chinese (Hong Kong SAR, PRC) Spanish Chinese (Singapore) Chinese (Macau SAR) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 23. Phases of Semantic Indexing Full Text Keyword Index “FTI” Semantic Document Similarity Index “DSI” Semantic Key Phrase Index – Tag Index “TI” http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 24. Performance ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 25. Integrated Full Text Search (iFTS) Improved Performance and Scale: ◦ Scale-up to 350M documents for storage and search ◦ iFTS query performance 7-10 times faster than in SQL Server 2008 ◦ Worst-case iFTS query response times less than 3 sec for corpus ◦ Similar or better than main database search competitors (2012, Michael Rys, Microsoft) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 26. Linear Scale of FTI/TI/DSI First known linearly scaling end-to-end Search and Semantic product in the industry Time in Seconds vs. Number of Documents (2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 27. Conclusion SQL Server 2012 adds new text processing capabilities This technology scales linearly Microsoft invites millions of documents for enterprise-level applications ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 28. Network MarkTab Consulting ◦ http://marktab.com Blog ◦ http://marktab.net Twitter ◦ @marktabnet ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 29. Appendix ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 30. References Video ◦ http://channel9.msdn.com/Shows/DataBound/DataBound-Episode-2-Semantic-Search ◦ http://www.microsoftpdc.com/2009/SVR32 Semantic Search (Books Online) – explains the demo ◦ http://msdn.microsoft.com/en-us/library/gg492075.aspx Paper ◦ http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p213.pdf ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 31. Demo: My Semantic Search Sample http://mysemanticsearch.codeplex.com/ Requires: ◦ iFilters ◦ Semantic Language Statistics Database ◦ IIS7, IIS6, with Windows Authentication ◦ .NET 4.0 ◦ Silverlight 4.0 ◦ FILESTREAM (complete) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 32. Demo: T-SQL and Documents Naveen Garg Requires Adventure Works (from Codeplex) http://blogs.msdn.com/b/sqlfts/archive/2011/07/21/introducing-fulltext-statistical-semantic- search-in-sql-server-codename-denali-release.aspx ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 33. Abstract SQL Server 2012 debuts a new Semantic Platform (commonly known as the Semantic Search applied task). This text mining technology leverages the already established Full Text Index and builds semantic indexes in a two-phase process. This session's detailed description and demo give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demo is a web-based Silverlight application showing how to interactively use semantic search. Currently, the indexes work for 15 languages. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft text and data mining functionality. ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE