Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
1
Big Data Patterns
Ron Bodkin
Founder and President, Think Big
2
About Me
Ron Bodkin
Founder and President, Think Big
I have 9 years’ experience working with Big Data and Hadoop. In
201...
3
Agenda
•  Context
•  Patterns
•  Conclusions
10/4/15© 2015 Think Big, a Teradata Company
4
Big Data: The Key is Variety
Definition: Datasets so complex and large that they are
awkward to work with using standard...
5
How is Information Management Changing?
•  Schema on Read?
–  Yes… as step one
–  But data still has underlying structur...
6
Changes in the Platform
•  Entry Level Hadoop cluster circa 2015 (20 nodes)
–  240 cores
–  1 PB spinning disk
–  10 TB ...
7
Changes in Logical Modeling
•  JSON-like structures
–  Complex collections of relations, arrays, map of items
•  Graphs
...
8
Changes in Physical Modeling
•  Big Data “unpacks” the database metaphor
–  Data distribution: key design, sharding/dist...
9
Leading Financial Asset Manger
10/4/15© 2015 Think Big, a Teradata Company
​ Challenge
•  Siloed consumer analytics
•  L...
10
Leading Enterprise Tech Component Vendor
10/4/15© 2015 Think Big, a Teradata Company
​ Challenge
•  Data search parties...
11
Patterns
10/4/15
12
Important New Patterns
•  Denormalized Fact
•  Profile
•  Event History
•  Timeline
•  Assembly
•  Distributed Sources
...
13
Event id Actor id Time Event col’s Dim id’s Dim col’s Ext. Data
123 uid1 1/1/15
13:16:11
… … … { “TstA” : 1
…}
456 uid2...
14
Actor id Segments Ev1:id Ev1:fact Ev1:dims Ev1:ext Ev2:id …
uid1 [1, 3, 7] 123 1/1/15
13:16:11
… { “TstA” :
1 …}
789 …
...
15
Event Analytics
•  Propensity/segmentation
–  May be scored in real-time using Timeline/Profile
–  May be hybrid scored...
16
Event Data Management
•  Identity merge
–  Discovery of new identities (e.g., cookie logs in, Facebook connect)
–  Indi...
17
•  Ongoing status of configuration
–  Parts in assembly
–  Related items (versions)
–  Social groups
•  Can be people, ...
18
Distributed Sources
•  Unlike simple “all or nothing” feeds…
•  May have many distributed sources feeding data
•  It’s ...
19
Late Data
•  Data may be delayed due to
–  Upstream system failures (server down esp. with unreliable delivery,
network...
20
Conclusions
10/4/15
21
Probabilistic Data Structures
•  Increasingly valuable as an optimization technique, e.g.,
•  Bloom filters
–  Hashed k...
22
Anti-Patterns
•  3rd Normal Form, Star Schema, Snowflake Schema
•  Index lookups slow in general
–  Focus on partitione...
23
Conclusions
•  Much of Big Data today is trade-craft
–  Learned lore & derived from first principles
•  As we scale dat...
Próximo SlideShare
Cargando en…5
×

de

Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 1 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 2 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 3 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 4 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 5 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 6 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 7 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 8 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 9 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 10 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 11 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 12 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 13 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 14 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 15 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 16 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 17 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 18 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 19 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 20 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 21 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 22 Big Data Modeling and Analytic Patterns – Beyond Schema on Read Slide 23
Próximo SlideShare
Big Data Modeling
Siguiente
Descargar para leer sin conexión y ver en pantalla completa.

3 recomendaciones

Compartir

Descargar para leer sin conexión

Big Data Modeling and Analytic Patterns – Beyond Schema on Read

Descargar para leer sin conexión

In this Strata+Hadoop World 2015 presentation, Ron Bodkin, President of Think Big, a Teradata company, explains changes for data modeling on big data systems and five important new analytic patterns becoming more commonplace as companies grow their data driven capabilities.

Big Data Modeling and Analytic Patterns – Beyond Schema on Read

  1. 1. 1 Big Data Patterns Ron Bodkin Founder and President, Think Big
  2. 2. 2 About Me Ron Bodkin Founder and President, Think Big I have 9 years’ experience working with Big Data and Hadoop. In 2010, I founded Think Big to help companies realize measurable value from Big Data. Our expertise spans all facets of data science and data engineering and helps our customers drive maximum value from their Big Data initiatives. Patterns in this talk from large-scale deployments in high tech manufacturing & digital marketing. Follow me at @ronbodkin
  3. 3. 3 Agenda •  Context •  Patterns •  Conclusions 10/4/15© 2015 Think Big, a Teradata Company
  4. 4. 4 Big Data: The Key is Variety Definition: Datasets so complex and large that they are awkward to work with using standard tools and techniques Location Social Images Weblogs Videos Text Audio Sensor Size is not what is most important—it’s variety
  5. 5. 5 How is Information Management Changing? •  Schema on Read? –  Yes… as step one –  But data still has underlying structure –  It’s more like agile modeling – reflect as much structure as needed •  Loosely coupled schemas without platform guarantees but enable more application flexibility •  Data Modeling isn’t dead! •  Metadata is more important than ever •  Data Warehouses embracing Big Data principles (e.g., elasticity, JSON…) 10/4/15© 2015 Think Big, a Teradata Company
  6. 6. 6 Changes in the Platform •  Entry Level Hadoop cluster circa 2015 (20 nodes) –  240 cores –  1 PB spinning disk –  10 TB RAM –  10-40 GbE –  Low software cost •  Disk transfer times increasing => many disks => DAS (2005-2020) •  Distributed RAM increasingly important to expedite computation although data volumes increasing faster •  The network will be the computer (really!) => you can distribute disks separately across high bandwidth fabrics (2020+) •  Changes many assumptions in traditional physical modeling 10/4/15© 2015 Think Big, a Teradata Company
  7. 7. 7 Changes in Logical Modeling •  JSON-like structures –  Complex collections of relations, arrays, map of items •  Graphs –  Storing complex, dynamically changing not static relationships •  Binary/CLOB/specialized data –  Ability to execute specialized programs to interpret and process 10/4/15© 2015 Think Big, a Teradata Company
  8. 8. 8 Changes in Physical Modeling •  Big Data “unpacks” the database metaphor –  Data distribution: key design, sharding/distribution, file formats –  Multiple computational algorithms, e.g., MapReduce, Computational Graph (Spark, Tez), data flow, streaming, graph engines –  Integrity is an application concern •  Storage is cheap –  Denormalization and materialized views common •  Yet compression is popular often for IO savings •  Summarization is orders of magnitude more powerful •  Index lookups are increasingly costly •  Distributed systems impose eventual consistency, reconciliation demands 10/4/15© 2015 Think Big, a Teradata Company
  9. 9. 9 Leading Financial Asset Manger 10/4/15© 2015 Think Big, a Teradata Company ​ Challenge •  Siloed consumer analytics •  Lack of agility in analysis •  Slow ETL ​ Solution •  Scalable ETL •  Discovery analytics tech & process •  Cross-channel data science models •  Cloudera Enterprise, HBase, Greenplum Results •  Scalable Processing •  Extracted customer behavior signals from raw data for existing and new behavior models •  Faster time to insight Financial Services Photo courtesy of Flickr. Creative Commons
  10. 10. 10 Leading Enterprise Tech Component Vendor 10/4/15© 2015 Think Big, a Teradata Company ​ Challenge •  Data search parties waste engineers time •  Excess scrap waste, slow time to market •  Reactive analytics model ​ Solution •  Scalable data lake •  Search and deep analytic queries •  Integrated assembly insights for data science models •  Hive, Impala, Red Shift, Elastic search •  Big data training and “hackathons” Results •  Supply chain “line of sight” from R&D, manufacturing, to servicing at customer sites •  “End-to-end” proactive analytics: reduced development time, improved manufacturing yield, increased customer satisfaction •  Proactive, scale analytics led to better engineering theory High Tech Manufacturing Photo courtesy of Flickr. Creative Commons
  11. 11. 11 Patterns 10/4/15
  12. 12. 12 Important New Patterns •  Denormalized Fact •  Profile •  Event History •  Timeline •  Assembly •  Distributed Sources •  Late Data •  Deep Aggregates •  Recovery •  Multiple Active Clusters 10/4/15© 2015 Think Big, a Teradata Company
  13. 13. 13 Event id Actor id Time Event col’s Dim id’s Dim col’s Ext. Data 123 uid1 1/1/15 13:16:11 … … … { “TstA” : 1 …} 456 uid2 1/1/15 13:16:14 … … … { “TstB” : 1 …} •  Fact table about common events to allow e.g., cross-channel analytics in context –  E.g., clickstream, posts, purchases, content consumption, device activity •  Stored in columnar format (e.g., Parquet, ORCfile) •  Join as was value of slowly changing dimensions •  Often “extension” column of unparsed/not modeled JSON-like data •  Partitioned by event time buckets, perhaps also by other dimension(s) Event History 10/4/15© 2015 Think Big, a Teradata Company
  14. 14. 14 Actor id Segments Ev1:id Ev1:fact Ev1:dims Ev1:ext Ev2:id … uid1 [1, 3, 7] 123 1/1/15 13:16:11 … { “TstA” : 1 …} 789 … uid2 [2, 3] 456 1/1/15 13:16:14 … { “TstB” : 1 …} 0ab … •  Pivot on event history: table of actors with events over time –  Customer journey, device history –  Enable support/analysis on specific items, long-lived analysis •  May have hierarchy of actors (e.g., household, individual, device) •  May be array of events, many columns or subsorted (cluster key) •  Also stored in columnar format, may be partitioned •  May be updated in near real-time AND batch •  Often holds cached alogirthm values (combined Profile) Timeline 10/4/15© 2015 Think Big, a Teradata Company
  15. 15. 15 Event Analytics •  Propensity/segmentation –  May be scored in real-time using Timeline/Profile –  May be hybrid scored batch using Event History –  Trained from timeline •  Attribution –  Score impact of past events on new event (e.g., purchase, churn) –  Algorithms range from simple rules to Shapley value –  Natural in timeline •  Reporting, exploration –  Often via Deep Aggregates, using HyperLogLog •  Discovery 10/4/15© 2015 Think Big, a Teradata Company
  16. 16. 16 Event Data Management •  Identity merge –  Discovery of new identities (e.g., cookie logs in, Facebook connect) –  Indirection or rewrites –  Requires rescoring •  Expiration/archival –  Efficiency, policy requirements •  Governance –  Lineage & security 10/4/15© 2015 Think Big, a Teradata Company
  17. 17. 17 •  Ongoing status of configuration –  Parts in assembly –  Related items (versions) –  Social groups •  Can be people, devices etc. •  Maintain links in graph structure –  May be current or historical •  Use links to pull full context from Event History or Timeline •  Search -> simple query -> complex analytics –  E.g., transitive closure, impact analysis •  Technologies –  Giraph, GraphX –  TitanDB, Neo4j Network 10/4/15© 2015 Think Big, a Teradata Company
  18. 18. 18 Distributed Sources •  Unlike simple “all or nothing” feeds… •  May have many distributed sources feeding data •  It’s critical to know whether all (or enough) data has arrived •  Goals –  only produce analytic results when sufficient –  provide provenance – timeliness & completeness statistics •  Need –  SLA’s about timeliness and required fraction of data –  Control totals –  Metadata about process (expected lineage) –  Heartbeats/configuration •  Root cause of complexity of ingestion 10/4/15© 2015 Think Big, a Teradata Company
  19. 19. 19 Late Data •  Data may be delayed due to –  Upstream system failures (server down esp. with unreliable delivery, network outage) –  Offline/disconnected devices (endemic with mobile & IoT) •  Metadata to track lineage is critical •  Define delay time where with high confidence sufficient data has arrived •  Process “authoritative” derived data after that time –  May process incremental/incomplete data earlier (a la economic statistics) –  May re-process in emergency (restatement) –  May include changed data in later period •  Report on how much data has arrived late •  Implementation: bucket on event time, secondary on delay epoch (partitions for late data) 10/4/15© 2015 Think Big, a Teradata Company Zipfian Distribution
  20. 20. 20 Conclusions 10/4/15
  21. 21. 21 Probabilistic Data Structures •  Increasingly valuable as an optimization technique, e.g., •  Bloom filters –  Hashed key values for array –  Check key to see if may be present –  indexing/filtering sparse reads •  HyperLogLog, sketch sets –  Multiple hashes used to estimate count of unique items –  Far more space compact (KB’s to count billions of items +/- 2%) –  Can be composed (unlike exact unique counts) – e.g., across time, categories •  MinHash –  Least hashed value in common between two sets –  Used to identify duplicates, estimate overlap in arbitrary sets 10/4/15© 2015 Think Big, a Teradata Company
  22. 22. 22 Anti-Patterns •  3rd Normal Form, Star Schema, Snowflake Schema •  Index lookups slow in general –  Focus on partitioned reads not disk seeks •  Poor results in practice •  Not natural representations for repeating events, nested structure •  Use of SSD, maturing optimizers, platform updates (Kudu?) are slowly improving… an industry would love this to happen •  Expect data marts to work in Big Data before data warehouses do 10/4/15© 2015 Think Big, a Teradata Company
  23. 23. 23 Conclusions •  Much of Big Data today is trade-craft –  Learned lore & derived from first principles •  As we scale data lakes & analytics, critical to have common vocabulary, shared understandings •  I’d love your input on common patterns & practices •  Look for blogs with more depth on each pattern at http://thinkbig.teradata.com/author/rbodkin/ •  Reach me at @ronbodkin, ron.bodkin@thinkbiganalytics.com 10/4/15© 2015 Think Big, a Teradata Company
  • ilanchelian

    Apr. 12, 2018
  • JohnCarl4

    Mar. 21, 2017
  • WallStreetBrainTrust

    Oct. 26, 2015

In this Strata+Hadoop World 2015 presentation, Ron Bodkin, President of Think Big, a Teradata company, explains changes for data modeling on big data systems and five important new analytic patterns becoming more commonplace as companies grow their data driven capabilities.

Vistas

Total de vistas

811

En Slideshare

0

De embebidos

0

Número de embebidos

3

Acciones

Descargas

85

Compartidos

0

Comentarios

0

Me gusta

3

×