SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
The Analytics Challenges Posed 
by Big Data 
Roger Bradford 
Agilex Technologies 
15 April 2013
2 
Velocity 
Standard Big Data View 
Big Data 
Volume 
Traditional BI 
Source: Forrester Group
3 
Big Data - Volume Examples 
Activity Rate 
E-mail >300 Billion*/Day 
Text Messages > 24 Billion/Day 
Cell Phones > 10 Billion Calls/Day 
YouTube > 1 Million New Videos/day 
Twitter > 500 Million Tweets/Day 
Facebook > 1 Billion Posts/Day 
*Short Scale Billion = 1,000 Million = 109
4 
Big Data - Velocity Example
By Website Content By User Native Language 
English 
5 
Big Data Variety Example – 
Internet Language Usage 
Spanish 
German 
English 
Other 
French 
Chinese 
Japanese 
Russian 
Russian 
Other 
Portuguese 
Japanese 
Spanish 
Chinese 
French 
Arabic 
German
Big Data - Variability Example 
Functions of 17,209 Genes 6
7 
Structured and Unstructured Data 
Structured Unstructured 
Sales Data E-mail 
Financial Data Instant messaging 
Climate Data Tweets 
Census Data Audio 
Movie Ratings Images 
Sensor Measurements Video 
Unstructured Information Accounts for more than 
80% of all Data in Organizations and is Growing 
15X Faster than Structured Data
8 
Challenges: Big Data vs. Hard Problems 
Big Data 
Volume 
Velocity 
Variety 
Variability 
Hard Problems 
Ambiguity 
Nth-order Relations 
Cardinality 
Non-locality
9 
•Synonomy: 
Ambiguity in Text 
Common English Nouns have 6-8 Close Synonyms 
Common English Verbs have 9-11 
•Polysemy: 
The Word Strike has 30 Common Meanings 
•Entity Ambiguity: 
 There are more than 45,000 People Named John Smith in 
the United States 
 There are more than 300,000 People Named Zhang Wei 
in China 
•Entity Variability: 
Some Person Names in Collections of Interest Occur in over 100 
Variants
Name Variant Example 
Vladimir Putin Vladimir Poutine Vladimir V. Putin 
Vladmir Putin Valdimir Putin Vladimir Vladimirovich 
10 
Putin 
Vladamir Putin Vladimr Putin Vladimir Vladimirovitch 
Putin 
Vlaidimir Putin Vladimir Puttin Vladimir Vladimirovic 
Putin 
Vladimir Poutin Putin, Vladimir Putin, Vladimir 
Vladimirovitch 
Vladimir Puttin Vladamir Putin Putin, Vladimir 
Vladimirovich 
Vlademir Putin Vladimier Putin V.V. Putin
# of Relations in 
5,998 Documents: 
11 
John  Bob Relationship: 
First Order: 
Second Order: 
Third Order: 
JOHN 
BOB 
JOHN 
TOM 
TOM 
BOB 
JOHN 
TOM 
TOM 
DAVE 
51,474 
DAVE 
BOB 
11,026,553 
68,070,600 
Nth-order Relationships
12 
Cardinality Example – Alias Detection 
Arthur 
Bishop 
Raul 
Sanchez 
Joel 
Rifkin 
Jose 
Haddock 
William 
Bonin 
Arthur 
Bishop 
Raul 
Sanchez 
.0366 
Joel Rifkin -.0464 .0616 
Jose 
Haddock 
.0366 .9675 .0616 
William 
Bonin 
.1526 .0125 .0016 .0125 
Challenge: Many by Many Comparisons- 
Processing 10 Million Names Requires 50 Trillion 
Comparisons
Non-locality Example– Clustering Documents 
13
14 
Twitter Example
15 
The Tweet Analysis Problem 
• Volume – 500 Million Tweets per Day Worldwide 
• Challenges: 
Very Low Signal to Noise Ratio (31 Million People 
Follow Lady Gaga) 
Implicit Context (“Let’s all Meet at Bob’s House”) 
Incomplete, Conflicting, and Erroneous Information 
Deliberate Deception (50% of all Tweets are Machine-generated)
16 
Applicable Analytic Techniques 
• Statistical Analysis 
• Categorization 
• Clustering 
• NLP Techniques 
• Semantic Analysis 
In General, Application of such Techniques to 
Big Data Problems is Computationally Intensive
17 
Cloud Enabling 
Millions of Documents 
Semantic Indexing Time (in Hours) 
Datacenter 
Server 
Map – Reduce 
with 64 Nodes
18 
GPU Enabling 
CPU 
GPU 
CPU: Intel Xeon X5660 
GPU: Nvidia Quadro 2000 
Seconds (in Thousands) 
Elements (in Billions) 
kNN Calculation
Representation 
19 
Semantic Enabling 
Data 
Semantic 
Analysis 
Semantic 
Space 
• Accommodates Nth-order Relationships 
• Automatically Coalesces Term Variants 
• Supports Automated Entity Disambiguation 
• Identifies Subtle Relationships 
• Can Combine Structured and Unstructured Data 
But Not as Well Understood as Structured Data 
Analysis Techniques
20 
IBM WATSON Winning “Jeopardy” 
• Volume: “Only” 1TB of Data (Mostly Text) 
• Velocity: Meeting the 3-second Response 
__Requirement of Jeopardy Required 80 
__Teraflops of Processing Power 
Challenge: 
•Question Decomposition
21 
Music Genome 
Objective: Match Liked Songs to Recommended Ones 
•  400 Attributes per 
_Song 
• 10 Million Songs 
• Each Song 
_Represented by a 
_Vector of Elements 
• 140 Trillion Elements 
• Distance Function is 
_Calculated between All 
_Songs
22 
Literature-based Discovery 
• PubMed Abstracts 
• Gene – Function Relationships 
__Derived Semantically 
• 98,074,359 Potential Gene-function 
__Associations. 
Zukas, A., GO-Driven Literature-Based Discovery using Semantic Analysis, MS Thesis, George Mason University, 
2007.
23 
Literature-based Discovery (Cont’d) 
Latent Gene and Function Relationships from 
the June 2006 Gene Ontology Later 
Documented in the January 2007 Gene 
Ontology 
•Nth-order Relationships 
• Complexity of Relations 
Challenges:
24 
Patent 
Databases 
Online 
Technical 
Literature 
Internal 
Publications 
Semantic Representation 
Space 
 
 
 
 
  
Prior Art 
Analysis 
White 
Space 
Analysis 
Patent Analysis 
• Need for Conceptual Comparisons 
•Technical Terminology / Obfuscation 
• Convoluted Structure (Claims) 
Challenges:
25 
Concept-driven Discovery 
Incoming 
Reporting Stream 
Fraud 
Exemplars 
Semantic 
Representation 
Space 
Xxxxxxxxx 
Xxxxxxxxx 
defraud 
Xxxxxxxxx 
scheme 
Continuous Cycling 
through ALL Names 
Generate 
Alerts 
Issue: N a me Disambiguation
26 
Rapid Data Overview 
Clustering 
Political 
Economic 
Incoming 
Data 
Admin 
Technical 
Regulatory 
•Technical Information 
• Multilingual Data 
Challenges:
Docs in 13 Languages 
 English Examples 
Range of 
Human 
Performance 
27 
Crosslingual Document Categorization 
– Big Data Solution Accuracy + Completeness 
of Categorization 
English Docs  
English Examples 
Number of Simultaneous Languages
28 
Where is Big Data Analytics Going? 
• Real-time Analysis 
• Multimedia Collections 
 Text 
 Structured Data 
 Audio 
 Video 
 Sensor Data 
• Temporal and Spatial Data Integration 
• Interactive Visualization 
• Continuous Retrospective Analysis 
• Advanced Analytics (Especially Semantic Analysis)
29 
Integration of Multimedia Data 
Integrated 
Analytics 
Structured Data 
Images 
Multi-lingual 
Text 
Audio 
Sensor Data 
Video 
Buyer Seller Material Amount Date 
John 
Smith 
Ace 
Jewelers 
Diamond 
Ring 
3 Carat 8/18/06
30 
Spatiotemporal Data Integration 
•Fully Automatic Integration of Spatial, 
_Temporal, and _Semantic Information 
•Location Disambiguation 
Challenges:
31 
Questions or Comments 
Roger Bradford 
Agilex Technologies Inc 
1-703-889-3916 
r.bradford@agilex.com

Más contenido relacionado

Similar a The Analytics Challenges Posed by Big Data and Hard Problems

Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
Fairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedInFairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedInKrishnaram Kenthapadi
 
HPCC Systems Presentation to TDWI Chicago Chapter
HPCC Systems Presentation to TDWI Chicago ChapterHPCC Systems Presentation to TDWI Chicago Chapter
HPCC Systems Presentation to TDWI Chicago ChapterHPCC Systems
 
DataEd Online: Demystifying Big Data
DataEd Online: Demystifying Big DataDataEd Online: Demystifying Big Data
DataEd Online: Demystifying Big DataDATAVERSITY
 
Data-Ed: Demystifying Big Data
Data-Ed: Demystifying Big DataData-Ed: Demystifying Big Data
Data-Ed: Demystifying Big DataData Blueprint
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfphongnguyen312110237
 
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdfMeet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf09372002dedi
 
Big Data for One Big Family
Big Data for One Big FamilyBig Data for One Big Family
Big Data for One Big FamilyMatt Asay
 
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Sudhir Tonse
 
Big Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics PlatformBig Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics PlatformSudhir Tonse
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handoutYi-Shin Chen
 
Course Tech 2013, Mark Frydenberg, Drinking from the Fire Hose: Tools for Int...
Course Tech 2013, Mark Frydenberg, Drinking from the Fire Hose: Tools for Int...Course Tech 2013, Mark Frydenberg, Drinking from the Fire Hose: Tools for Int...
Course Tech 2013, Mark Frydenberg, Drinking from the Fire Hose: Tools for Int...Cengage Learning
 
HUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesJohn Mulhall
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014Raja Chiky
 
Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Peter Mika
 
巨量與開放資料之創新機會與關鍵挑戰-曾新穆
巨量與開放資料之創新機會與關鍵挑戰-曾新穆巨量與開放資料之創新機會與關鍵挑戰-曾新穆
巨量與開放資料之創新機會與關鍵挑戰-曾新穆台灣資料科學年會
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalIIIT Allahabad
 

Similar a The Analytics Challenges Posed by Big Data and Hard Problems (20)

Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Fairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedInFairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedIn
 
HPCC Systems Presentation to TDWI Chicago Chapter
HPCC Systems Presentation to TDWI Chicago ChapterHPCC Systems Presentation to TDWI Chicago Chapter
HPCC Systems Presentation to TDWI Chicago Chapter
 
DataEd Online: Demystifying Big Data
DataEd Online: Demystifying Big DataDataEd Online: Demystifying Big Data
DataEd Online: Demystifying Big Data
 
Data-Ed: Demystifying Big Data
Data-Ed: Demystifying Big DataData-Ed: Demystifying Big Data
Data-Ed: Demystifying Big Data
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdf
 
datamining-lect1.pptx
datamining-lect1.pptxdatamining-lect1.pptx
datamining-lect1.pptx
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdfMeet 1 - Introduction Data Mining - Dedi Darwis.pdf
Meet 1 - Introduction Data Mining - Dedi Darwis.pdf
 
Big Data for One Big Family
Big Data for One Big FamilyBig Data for One Big Family
Big Data for One Big Family
 
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
 
Big Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics PlatformBig Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics Platform
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handout
 
Course Tech 2013, Mark Frydenberg, Drinking from the Fire Hose: Tools for Int...
Course Tech 2013, Mark Frydenberg, Drinking from the Fire Hose: Tools for Int...Course Tech 2013, Mark Frydenberg, Drinking from the Fire Hose: Tools for Int...
Course Tech 2013, Mark Frydenberg, Drinking from the Fire Hose: Tools for Int...
 
HUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation Slides
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014
 
[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊
 
Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015
 
巨量與開放資料之創新機會與關鍵挑戰-曾新穆
巨量與開放資料之創新機會與關鍵挑戰-曾新穆巨量與開放資料之創新機會與關鍵挑戰-曾新穆
巨量與開放資料之創新機會與關鍵挑戰-曾新穆
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
 

Más de Dr. Haxel Consult

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementDr. Haxel Consult
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...Dr. Haxel Consult
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...Dr. Haxel Consult
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...Dr. Haxel Consult
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...Dr. Haxel Consult
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...Dr. Haxel Consult
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...Dr. Haxel Consult
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...Dr. Haxel Consult
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...Dr. Haxel Consult
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...Dr. Haxel Consult
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterDr. Haxel Consult
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCDr. Haxel Consult
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...Dr. Haxel Consult
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...Dr. Haxel Consult
 

Más de Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 

Último

Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 

Último (20)

Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Uttam Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 

The Analytics Challenges Posed by Big Data and Hard Problems

  • 1. The Analytics Challenges Posed by Big Data Roger Bradford Agilex Technologies 15 April 2013
  • 2. 2 Velocity Standard Big Data View Big Data Volume Traditional BI Source: Forrester Group
  • 3. 3 Big Data - Volume Examples Activity Rate E-mail >300 Billion*/Day Text Messages > 24 Billion/Day Cell Phones > 10 Billion Calls/Day YouTube > 1 Million New Videos/day Twitter > 500 Million Tweets/Day Facebook > 1 Billion Posts/Day *Short Scale Billion = 1,000 Million = 109
  • 4. 4 Big Data - Velocity Example
  • 5. By Website Content By User Native Language English 5 Big Data Variety Example – Internet Language Usage Spanish German English Other French Chinese Japanese Russian Russian Other Portuguese Japanese Spanish Chinese French Arabic German
  • 6. Big Data - Variability Example Functions of 17,209 Genes 6
  • 7. 7 Structured and Unstructured Data Structured Unstructured Sales Data E-mail Financial Data Instant messaging Climate Data Tweets Census Data Audio Movie Ratings Images Sensor Measurements Video Unstructured Information Accounts for more than 80% of all Data in Organizations and is Growing 15X Faster than Structured Data
  • 8. 8 Challenges: Big Data vs. Hard Problems Big Data Volume Velocity Variety Variability Hard Problems Ambiguity Nth-order Relations Cardinality Non-locality
  • 9. 9 •Synonomy: Ambiguity in Text Common English Nouns have 6-8 Close Synonyms Common English Verbs have 9-11 •Polysemy: The Word Strike has 30 Common Meanings •Entity Ambiguity: There are more than 45,000 People Named John Smith in the United States There are more than 300,000 People Named Zhang Wei in China •Entity Variability: Some Person Names in Collections of Interest Occur in over 100 Variants
  • 10. Name Variant Example Vladimir Putin Vladimir Poutine Vladimir V. Putin Vladmir Putin Valdimir Putin Vladimir Vladimirovich 10 Putin Vladamir Putin Vladimr Putin Vladimir Vladimirovitch Putin Vlaidimir Putin Vladimir Puttin Vladimir Vladimirovic Putin Vladimir Poutin Putin, Vladimir Putin, Vladimir Vladimirovitch Vladimir Puttin Vladamir Putin Putin, Vladimir Vladimirovich Vlademir Putin Vladimier Putin V.V. Putin
  • 11. # of Relations in 5,998 Documents: 11 John Bob Relationship: First Order: Second Order: Third Order: JOHN BOB JOHN TOM TOM BOB JOHN TOM TOM DAVE 51,474 DAVE BOB 11,026,553 68,070,600 Nth-order Relationships
  • 12. 12 Cardinality Example – Alias Detection Arthur Bishop Raul Sanchez Joel Rifkin Jose Haddock William Bonin Arthur Bishop Raul Sanchez .0366 Joel Rifkin -.0464 .0616 Jose Haddock .0366 .9675 .0616 William Bonin .1526 .0125 .0016 .0125 Challenge: Many by Many Comparisons- Processing 10 Million Names Requires 50 Trillion Comparisons
  • 15. 15 The Tweet Analysis Problem • Volume – 500 Million Tweets per Day Worldwide • Challenges: Very Low Signal to Noise Ratio (31 Million People Follow Lady Gaga) Implicit Context (“Let’s all Meet at Bob’s House”) Incomplete, Conflicting, and Erroneous Information Deliberate Deception (50% of all Tweets are Machine-generated)
  • 16. 16 Applicable Analytic Techniques • Statistical Analysis • Categorization • Clustering • NLP Techniques • Semantic Analysis In General, Application of such Techniques to Big Data Problems is Computationally Intensive
  • 17. 17 Cloud Enabling Millions of Documents Semantic Indexing Time (in Hours) Datacenter Server Map – Reduce with 64 Nodes
  • 18. 18 GPU Enabling CPU GPU CPU: Intel Xeon X5660 GPU: Nvidia Quadro 2000 Seconds (in Thousands) Elements (in Billions) kNN Calculation
  • 19. Representation 19 Semantic Enabling Data Semantic Analysis Semantic Space • Accommodates Nth-order Relationships • Automatically Coalesces Term Variants • Supports Automated Entity Disambiguation • Identifies Subtle Relationships • Can Combine Structured and Unstructured Data But Not as Well Understood as Structured Data Analysis Techniques
  • 20. 20 IBM WATSON Winning “Jeopardy” • Volume: “Only” 1TB of Data (Mostly Text) • Velocity: Meeting the 3-second Response __Requirement of Jeopardy Required 80 __Teraflops of Processing Power Challenge: •Question Decomposition
  • 21. 21 Music Genome Objective: Match Liked Songs to Recommended Ones • 400 Attributes per _Song • 10 Million Songs • Each Song _Represented by a _Vector of Elements • 140 Trillion Elements • Distance Function is _Calculated between All _Songs
  • 22. 22 Literature-based Discovery • PubMed Abstracts • Gene – Function Relationships __Derived Semantically • 98,074,359 Potential Gene-function __Associations. Zukas, A., GO-Driven Literature-Based Discovery using Semantic Analysis, MS Thesis, George Mason University, 2007.
  • 23. 23 Literature-based Discovery (Cont’d) Latent Gene and Function Relationships from the June 2006 Gene Ontology Later Documented in the January 2007 Gene Ontology •Nth-order Relationships • Complexity of Relations Challenges:
  • 24. 24 Patent Databases Online Technical Literature Internal Publications Semantic Representation Space Prior Art Analysis White Space Analysis Patent Analysis • Need for Conceptual Comparisons •Technical Terminology / Obfuscation • Convoluted Structure (Claims) Challenges:
  • 25. 25 Concept-driven Discovery Incoming Reporting Stream Fraud Exemplars Semantic Representation Space Xxxxxxxxx Xxxxxxxxx defraud Xxxxxxxxx scheme Continuous Cycling through ALL Names Generate Alerts Issue: N a me Disambiguation
  • 26. 26 Rapid Data Overview Clustering Political Economic Incoming Data Admin Technical Regulatory •Technical Information • Multilingual Data Challenges:
  • 27. Docs in 13 Languages English Examples Range of Human Performance 27 Crosslingual Document Categorization – Big Data Solution Accuracy + Completeness of Categorization English Docs English Examples Number of Simultaneous Languages
  • 28. 28 Where is Big Data Analytics Going? • Real-time Analysis • Multimedia Collections Text Structured Data Audio Video Sensor Data • Temporal and Spatial Data Integration • Interactive Visualization • Continuous Retrospective Analysis • Advanced Analytics (Especially Semantic Analysis)
  • 29. 29 Integration of Multimedia Data Integrated Analytics Structured Data Images Multi-lingual Text Audio Sensor Data Video Buyer Seller Material Amount Date John Smith Ace Jewelers Diamond Ring 3 Carat 8/18/06
  • 30. 30 Spatiotemporal Data Integration •Fully Automatic Integration of Spatial, _Temporal, and _Semantic Information •Location Disambiguation Challenges:
  • 31. 31 Questions or Comments Roger Bradford Agilex Technologies Inc 1-703-889-3916 r.bradford@agilex.com