According to Gartner, “through 2018, 80 percent of data lakes will not include effective metadata management capabilities, making them inefficient.” Tools within the Apache Spark ecosystem, such as SparkSQL, MlLib, and GraphX, are making ingesting, transforming, and querying data easier, but a missing link remains.
In order to harness the power of pattern-finding to discover unknown insights from the relationships between your historical Big Data and streaming Fast Data, a mature, proven metadata repository must be at the center of your organization’s data architecture.
In this webinar, Leon Guzenda, Chief Technical Marketing Officer at Objectivity, will discuss how to use metadata to not just follow paths, but also patterns, including more efficient shipping routes, recommendation engines, or ways to catch money laundering and other types of financial fraud. It will discuss the pros and cons of open source tools and how to leverage them with a metadata repository to reach the true potential of real-time relationship discovery.
Elevate Developer Efficiency & build GenAI Application with Amazon Q
Metadata and the Power of Pattern-Finding
1. 1
METADA T A AND T HE PO WER
OF PAT T ERN- F I NDI NG
M A Y 2 4 , 2 0 1 6 F O R D A T A V E R S I T Y
LEON GUZENDA
Chief Technology Marketing Officer
2. 2
A G E N D A
• Who We Are
• Open Source Big & Fast Data Analytics
• Our Core Technology & New Product
• Pattern Finding Examples
• Q & A
4. 4
O B J E C T I V I T Y I N C . O V E R V I E W
• Private company, headquartered in Silicon Valley since 1988
• Verticals:
• Government: Intelligence, defense, crime detection & prevention
• Financial Services
• Industrial Internet of Things (IIoT)
• Energy
• Healthcare
• Horizontals:
• Graph analytics
• Complex, distributed, scalable database applications
8. ...OPEN SOURCE ANALYTICS
PROS:
• Large community
• Lots of algorithms
• Model works at scale
• Low startup costs
• Cost effective
CONS:
• Most algorithms are based on
statistical correlation, clustering or
filtering
• Graph algorithms mainly tackle
theoretical problems
• Hadoop mostly targets files, not
metadata.
• Metadata tools focus on technical
parameters, not semantic content.
13. 13
O U R F O C U S
• Complex Objects at scale:
• Relationships are first class citizens
• Ultra-fast navigation and pathfinding
• Not restricted by available RAM
• Scalability, performance, reliability and flexibility:
• Distributed database and distributed processing
• Light, small database kernel - from embedded to cluster to cloud
14. 14
• 1,000’s of trillions of unique objects
• 1,000’s of petabytes of storage
• Resolving an ID fast and regardless
of the number of objects
D I S T R I B U T E D D A T A - S I N G L E L O G I C A L V I E W
Put the data and processing where it’s needed
15. 15
Put the data and processing where it’s needed
D I S T R I B U T E D P R O C E S S I N G
ThingSpan
Cache
Client Processes
18. • Uses Apache Spark open source processing engine
• In partnership with Cloudera, Databricks, HortonWorks and MapR
• Powerful object and relationship modeling
• Can store data in HDFS and/or POSIX
• Ultra-fast graph navigation, pathfinding and pattern finding
• REST Server and API for loading data and performing graph analytics
• Spark DataFrame support to leverage MLlib, GraphX, SQL etc.
T H I N G S P A N F E A T U R E S
19. D I S T R I B U T E D P R O C E S S I N G &
D A T A B A S E
Hadoop Distributed File System
Distributed from top to bottom
24. • Conventional Business Intelligence Analytics: Uses filters and statistical correlation to find relationships
between parameters.
• Graph Pattern Finding Analytics: Uses a combination of outlier, navigational and pathfinding queries.
• Find outliers with SQL or MLlib
• Navigational query can specify Vertex and Edge types to be included/excluded and can invoke
methods during the traversal, e.g. to compute transit time to a node.
• Pathfinding query can find shortest or all paths between two or more Vertices.
• Query type order depends upon the problem
P A T T E R N F I N D I N G T E C H N I Q U E S
25. CITY
LINK
• Mode
• Duration
• Cost
P A T H - F I N D I N G Q U E R Y
• Problem: Find the least expensive route between San Francisco and New
York for a 60 ton, very wide load that must arrive by Saturday and
minimizes mode transitions (road/rail/water etc.)
• Implied: We can avoid Rail connections.
26. • Financial: Money Laundering Detection
• Intelligence Analysis: Threat Detection
• AdTech: Recommendation Engine Support
• Industrial Internet of Things (IIoT): Network Congestion Analysis
P A T T E R N F I N D I N G E X A M P L E S
27. 1. Load Person, Account and Transaction data into ThingSpan
$
$
$
$
$
$
$
$
🏡🏡
F I N A N C I A L : M O N E Y L A U N D E R I N G D E T E C T I O N
P1
Acc 1
Acc 2
Acc 22
Acc 23
Acc 24
Acc 35
Acc 21
Acc 31
Acc 32
Acc 33
Acc 20
P2 P3
$
28. 2. Identify people with more than 5 accounts (centrality)
$ $
$
$
$
$
$
$
$
🏡🏡 🏡🏡
F I N A N C I A L : A P P L Y S P A R K G R A P H X
Acc 1
Acc 2
P1 P2
Acc 20
Acc 21
Acc 22
Acc 23
Acc 24
Acc 35
P3
Acc 31
Acc 32
Acc 33
29. 3. Look at all of that person's transactions to see if they terminate in just 1 or 2 offshore accounts
$ $$
$
$
$
$
$
4. INVESTIGATE
🏡🏡 🏡🏡
F I N A N C I A L : A P P L Y A N A V I G A T I O N A L Q U E R Y
Acc 1
Acc 2
P2
Acc 20
Acc 21
Acc 22
Acc 23
Acc 24
Acc 35
Acc 31
Acc 32
Acc 33
P1 P3
$
30. 1. Load People, Calls, Places and Sightings into the Graph
Seen2Seen1
PlaceZ
Seen3
Seen4
H U M I N T : T H R E A T D E T E C T I O N
P1 P2 P3 P5
P6 P7 P8
P9
P1
0
P1
2
P1
3
P1
1
P1
4
P1
5
P1
6
P1
8
P1
7 PlaceX
PlaceY
CDR1 CDR2 CDR3
CDR4 CDR5
CDR7
CDR13
CDR15 CDR16
CDR14
CDR6
CDR12
CDR10
CDR8
CDR11
CDR9
CDR17
31. 2. Use Spark GraphX to find "islands" of callers/callees.
P3CDR1 CDR1
CDR1 CDR1
CDR1
CDR1
CDR1 CDR1
P1
7
CDR1
CDR1
CDR1
CDR1
CDR1
CDR1
CDR1
CDR1 CDR2 CDR3
CDR4 CDR5 CDR6
CDR7
CDR8
CDR9
CDR10
CDR11 CDR12
CDR13 CDR14
CDR15 CDR16
H U M I N T : A P P L Y S P A R K G R A P H X
P1 P2
P6
P1
0
P1
6
P1
1
P7 P8
P1
4
P9
P1
2
P1
3
P1
5
P5
P1
8
CDR17
32. 3. Use a navigational query to see if any of those People have been seen
near Places that need to be protected.
PlaceX
CDR1 CDR1
CDR1 CDR1
CDR1
CDR1
CDR1 CDR1
P1
7
CDR1
CDR1
CDR1
CDR1
CDR1
CDR1
CDR1
Seen2Seen1
CDR2 CDR3
CDR4 CDR5 CDR6
CDR7
CDR8
CDR9 CDR10
CDR11 CDR12
CDR13 CDR14
CDR15 CDR16
PlaceY PlaceZ
Seen3
Seen4 CDR17
H U M I N T : A P P L Y A N A V I G A T I O N A L Q U E R Y
P1 CDR1 P2 P3 P5
P6
P1
0
P1
1
P7 P8
P9
P1
6
P1
4
P1
2
P1
3
P1
5
P1
8
33. CDR1
CDR1
4. P14 and P15 have been seen near potential target PlaceX, so
they plus P11, P7 and P8 should be put under surveillance.
PlaceX
CDR1 CDR1 CDR1
CDR1
CDR1
CDR1
CDR1 CDR1
CDR1
CDR1
CDR1
CDR1
CDR1
CDR1
Seen2Seen1
CDR2 CDR3
CDR4 CDR5 CDR6
CDR7
CDR8
CDR9 CDR10
CDR11 CDR12
CDR13 CDR14
CDR15 CDR16
PlaceZSeen4
H U M I N T : P L A N A C T I O N
P1 P2
P6
P3
P7 P8
P5
P9
P1
2
PlaceY
Seen3
P1
0
P1
6
P1
3
P1
7
CDR17
P1
8
P1
1
P1
4
P1
5
34. Joe Fred Mary Jane
1. Load Products, Orders, People and Social_Links into ThingSpan.
Bill
A D T E C H : P R E - P L A N N E D A D S
Pr
1
Pr
2
Pr
3
Pr
4
Pr
5
Pr
6
Sale2 Sale3 Sale4 Sale5
Follows Follows Follows
Sale1
35. Joe Fred Mary
2. We want to place adds for Product Pr2
Bill
A D T E C H : P R E - P L A N N E D A D S
Pr
2
Pr
4
Pr
5
Pr
6
Sale1 Sale2 Sale3 Sale4 Sale5
Follows Follows Follows
Jane
Pr
1
Pr
3
36. Joe Fred Mary Jane
3. Use ThingSpan to find bloggers who bought Pr2 and who also have followers.
Bill
Result: Fred bought Pr2. Mary follows Fred's blogs. Jane & Bill follow Mary's.
A D T E C H : W H O F O L L O W S B U Y E R S O F T H E P R O D U C T ?
Pr1 Pr2
Pr
3
Pr
4
Pr
5
Pr
6
Sale1 Sale2 Sale3 Sale4 Sale5
Follows
Follows
Follows
37. Joe Fred Mary Jane
4. Next time you spot Mary, Jane or Bill, display a personalized Ad for Pr2.
Bill
Result: Fred bought Pr2. Mary follows Fred's blogs. Jane & Bill follow Mary's.
💥💥
Buy
1!
A D T E C H : D I S P L A Y T H E A D
Pr1 Pr2
Pr
3
Pr
4
Pr
5
Pr
6
Sale1 Sale2 Sale3 Sale4 Sale5
Follows
Follows
Follows
38. 1. Load Location, Equipment, Link (+Load) into the graph
20% 20%
95%
65%
20%
50%
30%
25%
Link 2
Link 3
Link 4
Link 5 Link 7
Link 8
Link 9
Link 1
Off
Link 6
SAN JOSE SALT LAKE CITY CHICAGO NEW YORK
I I O T : T E L C O N E T W O R K C O N G E S T I O N
L1 L2 L3 L4
E1
E2
E3
E20
E21
E22
E30
E31
E32
E33
E40
39. 2. Use Spark SQL to find links that are over 90% loaded.
20%
95%
65%
20%
50%
30%
Off 25%
Link 2
Link 3
Link 4
Link 6
Link 7
Link 8
Link 9
Link 1
Link 5
SALT LAKE CITY CHICAGO NEW YORKSAN JOSE
I I O T : A P P L Y S P A R K S Q L
L1 L2 L3 L4
E1
E2
E3
E20
E21
E22
E31
E32
E33
E4020% E30
40. 3. Use a graph query to find the leaf nodes (branch ends)...
20% 20%
95%
65%
20%
50%
30%
25%
Link 2
Link 3
Link 4
Link 6
Link 7
Link 8
Link 9
Link 1
Link 5
Off
... Then Investigate...
SALT LAKE CITY CHICAGO NEW YORKSAN JOSE
I I O T : A P P L Y A T H I N G S P A N N A V I G A T I O N A L Q U E R Y
L1 L2 L3 L4
E1 E20 E30 E40
E31E21E2
E3 E22 E32
E33
41. 20% 20%
95%
65%
20%
50%
30%
25%
4. Aha! E2 and E3 in San Jose are streaming 8K UHDTV
video movies from MovieFlix in New York, overloading Link 6.
Link 1
Link 2
Link 3
Link 4
Link 6
Link 7
Link 8
Link 9
Off
Link 5
SALT LAKE CITY CHICAGO NEW YORKSAN JOSE
I I O T : D I A G N O S E
L1 L2 L3 L4
E1 E20 E30 E40
E31E21E2
E3 E22 E32
E33
42. 20% 20%
50%
65%
20%
50%
30%
25%
5. Solved - by switching on Link 5.
Link 1
Link 2
Link 3
Link 4
Link 6
Link 7
Link 8
Link 9
45%
Link 5
SALT LAKE CITY CHICAGO NEW YORKSAN JOSE
I I O T : F I X
L1 L2 L3 L4
E1 E20 E30 E40
E2 E21 E31
E3 E22 E32
E33
43. S U M M A R Y
• Open Source Big & Fast Data analytics tools are great at what they're
designed for.
• ThingSpan adds a Metadata Store and scalable graph analytics
• Ultra-fast navigation and pathfinding queries.
• It can interoperate with streaming systems and Big Data platforms
• ThingSpan is extensible to other open source systems