SlideShare a Scribd company logo
1 of 23
Unraveling the Mystery of Big Data
Lionel Silberman
May 22, 2014
Copyleft – Share Alike
1
http://www.bigstockphoto.com/image-12115340/stock-photo-binary-stream
Who Am I?
Lionel Silberman, currently the Senior Data Architect at Compuware
• 30 years in Software Development
• Statistical modeling, DBMS, Big Data, Data Architecture, tech/product/management.
• Diverse, deep data management:
– All of the major RDBMS vendors and internals.
– Data modeling and data parallelism techniques.
– OLAP, OLTP, MPP, NoSql systems like Hadoop and Cassandra.
– Scaling and Performance tuning of distributed and federated applications.
• Current interest is integrating products in the enterprise at the data level
that deliver more value than their individual pieces.
• Active interest in big data metadata privacy issues.
• Who are you? What’s your interest in this talk tonight?
2
3
Unraveling the Mystery of Big Data Agenda
4
• What is Big Data?
 Business Value
 Technical Definitions
 Sizes and Applications
• What Big Data is Not (or why isn’t everything just “data”)?
• Architectural Underpinnings
• Some Useful Architectural Distinctions
• Technology stacks and ecosystems
• Data Modeling Example
• Gotchas - What 12 things to watch out for?
• References and more info
• Questions?
Lionel Silberman
What is Big Data?
Business Value
Enabling new products
• Sensors everywhere!
• Nowcasting.
• Ever-narrower segmentation of customers
Analytics - taking data from input through to decision
• Correlation in real time
• New insights from previously hidden data:
• Social
• Geographical data
• Recommendations.
• Finding needles in haystacks.
• In 2010, industry was worth more than $100 billion.
• Almost 10 percent a year growth rate;
• or, about twice as fast as the software business as a whole.
5
What is Big Data?
A Technical Definition
• Data that exceeds the processing capacity of conventional database
systems in volume, velocity or variety*
or the 3Vs!
Volume - Sheer size and growth.
Velocity - how fast it moves.
Variety - the inability to derive structure or frequency of change.
* META Group (now Garner) analyst Doug Laney
6
What is Big Volume?
• 1970s: Megabytes
• Now: Many organizations are approaching an Exabyte
• Examples:
• Google – capacity for 15 Exabytes
• NSA – capacity for a Yottabyte in Utah
• AWS – 1 Trillion objects in 2012
• Facebook – 500 Terabytes/day
• Scientific Pursuits:
• Large Hadron Collider at CERN - last year 30 Petabytes
• The NASA Center for Climate Simulation - 32 petabytes of climate observations
and simulations on the Discover supercomputing cluster.
• Sloan Digital Sky Survey (SDSS) 140 terabytes. 200 GB per night.
Large Synoptic Telescope anticipated to acquires 140TB every five days.
• eBay.com - 90 Petabytes in data warehouses
7
What is Big Velocity?
• Financial Trading volume
• Retail – Cyber Monday
(every click and interaction: not just the final sales).
• Government – Affordable Care Act
• Smartphone - geolocated imagery and audio data
• Fraud (complex event processing)
o Credit Card Traffic patterns
o Phone Slam/Crams
• Streaming – Netflix, Snapfish
• Retail - Walmart handles more than 1 million customer
transactions per hour.
• Compuware APM (my firm) 25K transactions per second.
• MMOG – Massive Multiplayer Online Games
http://www.livestream.com/ibmpartnerworld/video?clipId=flv_052b14ea-7d5a-40f5-9e61-e287b0ce5d9c
8
What is Big Variety?
• Diverse source and destinations:
o Document Backup or Archival – HP, EMC, AWS
o Pictures and Video – Facebook 50 Billion Photos
o Sensor sources – GE, NetApp
o Multi-device - Dropbox and Sugarsync
• Big Data is messy:
o structure aids meaning, but can change frequently.
o multiple sources (e.g. financial feeds, browser incompatibilities)
o Application integration issues (e.g. Fitbit)
o Entity resolution issues (e.g. Portland, dog)
o Visualization increasingly important.
9
Or a 4th or 5th V?
10
Big Data and Visualization - Wikipedia
11
http://infodisiac.com/Wikimedia/Visualizations/
What Big Data is Not (or why isn’t everything just “data”)?
• There may not be the need in traditional systems
o Payroll
o Human Resources
o Shop machine sampling?
• Some tradeoffs required for the technology of Big Data:
o Bleeding edge vs. established technology
o Subtle definition of consistency
o Complexity
o New and hard-to-find skills
• Make sure the business case warrants and can tolerate the
tradeoffs…
12
Big Data is Everywhere?
13
Architectural Underpinnings: CAP Theorem
14
High Availability (A)
of data for writes.
Consistency (C)
a single up-to-date copy
of the data.
Partition Tolerance (P)
the system continues to
operate despite arbitrary
message loss or failure of
parts of the system.
NoSQL
DBMS
X
NoSql and Eventual Consistency
• Relaxed or weaker consistency to achieve High Availability and
Partition Tolerance.
• Eventual Consistency:
– An unbounded delay in propagating changes across partitions.
– No ordering guarantees at all, thus a lower-level transaction
atomicity.
From system perspective means an operational system is
ALWAYS in an inconsistent state.
• Many NoSql systems (e.g. Cassandra, Hadoop):
– support self-healing or restartability.
– allow ease of scale and Disaster Recovery
– are schema-free.
– No standard way of retrieving data (e.g. SQL).
15
Business and Infrastructure Architecture Decisions
16
Infrastructure:
• Amazon Web Services (AWS)
• Storage
• Elasticity
• Availability
• Data division:
• Parallelism (sharding)
• Redundancy
• Application Servers
• Data affinity
• Stateless protocols
Business Issues:
• Research to Production Pipeline?
• 3rd party integration needs?
• Flexibility?
• Radical Transparency?
Data Architecture Decisions
17
Data In-transit:
• high write vs. high reads
• Encryption and security
• Distributed vs. centralized
• Visualization tools
Data Stores:
• Documents
• key/value pairs
• Graphs
• In-memory
Technology Stacks (Data Stores)
• Hadoop: Distributed File System, Job Scheduler, MapReduce programming model
– Pros: fault-tolerant, disaster protection, data parallelism fits many applications, ecosystem.
– Best used: long-term data storage, research, basis of other flexible data stores.
• Cassandra: Big Table, key-value
– Pros: Fast writes, no single-point of failure, fault-tolerant, disaster, columns and column families.
– Best used: when you write more than you read, Financial industry, real-time data analysis.
• Riak: key-value
– Pros: Cassandra-like, less complex, single-site scalability, availability & fault-tolerance
– Best used: Point-of-sales, factory control systems, high writes.
• Redis: in-memory, key-value
– Pros: fast, transactional, expiring values.
– Best used: Rapidly changing data in memory (stock prices, analytics, real-time data collection.
• Dynamo: Big Table, key-value
– Pros: Fast reads and writes, no single-point of failure, fault-tolerant, disaster, eventually consistent
– Best used: Always available (e.g. Amazon).
• CouchDB: Documents
– Pros: bi-directional replication, conflict detection, previous versions
– Best used: Accumulating occasionally changing data, pre-defined queries, versioning
• MongoDB: document store.
• Pros: update-in-place, defined index, built-in sharding, geospatial indexing.
• Best used: Dynamic queries, schema-less, data changes a lot,
• HBase: Big Table
– Pros: huge datasets, map-reduce Hadoop/HDFS stack.
– Best used: Analyzing log data
• Memcached/Membase: in-memory and multi-node.
– Pros: low-latency, high-concurrency and availability.
– Best used: zynga – online gaming.
• Neo4j: GraphDB
– Pros: highly scalable, robust, ACID
– Best used: social, routing, recommendation questions (e.g. How do I get to Linz?)
18
Related Data Management Ecosystems
• Apache Hadoop stack (Cloudera)
– MapReduce on HDFS
– Pig, Hive, Hbase – SQL-like DB interfaces on top of HDFS.
– Flume, RabbitMQ – Data conduits and message queues.
– Splunk – Operational Analytics and Log Processing.
– Sqoop – Bulk Data transfer to DBs
– Puppet, Chef – Configuration Management and DevOps Orchestration.
– Visualization, BI and ETL - Informatica, Talend, Pentaho, Tableau
• Cloud computing infrastructure (Amazon Web Services)
e.g. EC2, Elastic MapReduce, RDS
• Cassandra (Datastax)
• High-scale, distributed and hybrid RDBMS:
- Teradata
- Netezza
- EMC/Greenplum
- Aster Data
- Vertica
- VoltDB
- RDF Triple Stores
- Hadapt
19
Data Modeling Example: Twitter Publishers and Subscribers
20
• Relational DB: One table that has people relationships and tags
whether a publisher or subscriber.
Pros: No duplicated data, ACID transaction
Cons: Does not scale, SPOF
• NoSQL: Separate Indices for Subscribers and Publishers.
Pros:
– Partition Independence enables scale-out and no single point of failure for
both reads and writes.
– No Schema allows quick development.
Cons:
– Eventual consistency requires care in application layer and presentation of
user experience, and future evolution.
– Redundant storage.
Gotchas - What 12 Things to Watch Out for?
1. Privacy
2. Abuse – e.g. HFT/Front running.
3. Immature technologies and companies.
4. Business and Product Changes affect on architecture.
5. Data in-transit vs. at-rest - replication, mirroring, streaming,
reprocessing.
6. Data security in-transit and at-rest.
7. Blurring of high availability, performance and disaster recovery.
8. Replacing sampling and aggregation with ALL of the data!
9. Correlation is not Causation – e.g. Google Flu
10. Data Snooping (or Confirmation Bias) http://tylervigen.com
11. Irrelevance
12. Veracity - how do you check and reproduce results?
“The process of making is iterative” - Cesar A. Hidalgo
21
References and More Info
• http://en.wikipedia.org/wiki/Big_data
• http://highscalability.com/blog/2012/9/11/how-big-is-a-petabyte-exabyte-zettabyte-or-a-yottabyte.html
• http://strata.oreilly.com/2012/01/what-is-big-data.html
• http://googlesystem.blogspot.com/2006/09/how-much-data-does-google-store.html
• http://www.cnet.com/news/nsa-to-store-yottabytes-in-utah-data-centre/
• http://aws.typepad.com/aws/2012/06/amazon-s3-the-first-trillion-objects.html
• http://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500-terabytes-a-day/
• http://iveybusinessjournal.com/topics/strategy/why-big-data-is-the-new-competitive-
advantage#.U12f8aPD9jo
• http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
• Animation: Large Hadron Collider at CERN: http://home.web.cern.ch/about/updates/2013/04/animation-
shows-lhc-data-processing
• http://ivoroshilin.com/2012/12/13/brewers-cap-theorem-explained-base-versus-acid/
• http://www.scientificamerican.com/article/saving-big-data-from-big-
mouths/?utm_content=buffer53ae6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
• http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz31AZrOtI6
• Copyleft – Share Alike - http://creativecommons.org/licenses/by-sa/3.0/
22
Questions? Use Cases? Technology Adoption?
Feedback or follow-up: Lionel.Silberman@Compuware.com
23

More Related Content

More from AnalyticsWeek

Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
AnalyticsWeek
 
Rethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modelingRethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modeling
AnalyticsWeek
 

More from AnalyticsWeek (7)

Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsMaking sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into things
 
Reimagining the role of data in government
Reimagining the role of data in governmentReimagining the role of data in government
Reimagining the role of data in government
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 
Rethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modelingRethinking classical approaches to analysis and predictive modeling
Rethinking classical approaches to analysis and predictive modeling
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
 
Big Data Introduction to D3
Big Data Introduction to D3Big Data Introduction to D3
Big Data Introduction to D3
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Unraveling the mystery of big data

  • 1. Unraveling the Mystery of Big Data Lionel Silberman May 22, 2014 Copyleft – Share Alike 1 http://www.bigstockphoto.com/image-12115340/stock-photo-binary-stream
  • 2. Who Am I? Lionel Silberman, currently the Senior Data Architect at Compuware • 30 years in Software Development • Statistical modeling, DBMS, Big Data, Data Architecture, tech/product/management. • Diverse, deep data management: – All of the major RDBMS vendors and internals. – Data modeling and data parallelism techniques. – OLAP, OLTP, MPP, NoSql systems like Hadoop and Cassandra. – Scaling and Performance tuning of distributed and federated applications. • Current interest is integrating products in the enterprise at the data level that deliver more value than their individual pieces. • Active interest in big data metadata privacy issues. • Who are you? What’s your interest in this talk tonight? 2
  • 3. 3
  • 4. Unraveling the Mystery of Big Data Agenda 4 • What is Big Data?  Business Value  Technical Definitions  Sizes and Applications • What Big Data is Not (or why isn’t everything just “data”)? • Architectural Underpinnings • Some Useful Architectural Distinctions • Technology stacks and ecosystems • Data Modeling Example • Gotchas - What 12 things to watch out for? • References and more info • Questions? Lionel Silberman
  • 5. What is Big Data? Business Value Enabling new products • Sensors everywhere! • Nowcasting. • Ever-narrower segmentation of customers Analytics - taking data from input through to decision • Correlation in real time • New insights from previously hidden data: • Social • Geographical data • Recommendations. • Finding needles in haystacks. • In 2010, industry was worth more than $100 billion. • Almost 10 percent a year growth rate; • or, about twice as fast as the software business as a whole. 5
  • 6. What is Big Data? A Technical Definition • Data that exceeds the processing capacity of conventional database systems in volume, velocity or variety* or the 3Vs! Volume - Sheer size and growth. Velocity - how fast it moves. Variety - the inability to derive structure or frequency of change. * META Group (now Garner) analyst Doug Laney 6
  • 7. What is Big Volume? • 1970s: Megabytes • Now: Many organizations are approaching an Exabyte • Examples: • Google – capacity for 15 Exabytes • NSA – capacity for a Yottabyte in Utah • AWS – 1 Trillion objects in 2012 • Facebook – 500 Terabytes/day • Scientific Pursuits: • Large Hadron Collider at CERN - last year 30 Petabytes • The NASA Center for Climate Simulation - 32 petabytes of climate observations and simulations on the Discover supercomputing cluster. • Sloan Digital Sky Survey (SDSS) 140 terabytes. 200 GB per night. Large Synoptic Telescope anticipated to acquires 140TB every five days. • eBay.com - 90 Petabytes in data warehouses 7
  • 8. What is Big Velocity? • Financial Trading volume • Retail – Cyber Monday (every click and interaction: not just the final sales). • Government – Affordable Care Act • Smartphone - geolocated imagery and audio data • Fraud (complex event processing) o Credit Card Traffic patterns o Phone Slam/Crams • Streaming – Netflix, Snapfish • Retail - Walmart handles more than 1 million customer transactions per hour. • Compuware APM (my firm) 25K transactions per second. • MMOG – Massive Multiplayer Online Games http://www.livestream.com/ibmpartnerworld/video?clipId=flv_052b14ea-7d5a-40f5-9e61-e287b0ce5d9c 8
  • 9. What is Big Variety? • Diverse source and destinations: o Document Backup or Archival – HP, EMC, AWS o Pictures and Video – Facebook 50 Billion Photos o Sensor sources – GE, NetApp o Multi-device - Dropbox and Sugarsync • Big Data is messy: o structure aids meaning, but can change frequently. o multiple sources (e.g. financial feeds, browser incompatibilities) o Application integration issues (e.g. Fitbit) o Entity resolution issues (e.g. Portland, dog) o Visualization increasingly important. 9
  • 10. Or a 4th or 5th V? 10
  • 11. Big Data and Visualization - Wikipedia 11 http://infodisiac.com/Wikimedia/Visualizations/
  • 12. What Big Data is Not (or why isn’t everything just “data”)? • There may not be the need in traditional systems o Payroll o Human Resources o Shop machine sampling? • Some tradeoffs required for the technology of Big Data: o Bleeding edge vs. established technology o Subtle definition of consistency o Complexity o New and hard-to-find skills • Make sure the business case warrants and can tolerate the tradeoffs… 12
  • 13. Big Data is Everywhere? 13
  • 14. Architectural Underpinnings: CAP Theorem 14 High Availability (A) of data for writes. Consistency (C) a single up-to-date copy of the data. Partition Tolerance (P) the system continues to operate despite arbitrary message loss or failure of parts of the system. NoSQL DBMS X
  • 15. NoSql and Eventual Consistency • Relaxed or weaker consistency to achieve High Availability and Partition Tolerance. • Eventual Consistency: – An unbounded delay in propagating changes across partitions. – No ordering guarantees at all, thus a lower-level transaction atomicity. From system perspective means an operational system is ALWAYS in an inconsistent state. • Many NoSql systems (e.g. Cassandra, Hadoop): – support self-healing or restartability. – allow ease of scale and Disaster Recovery – are schema-free. – No standard way of retrieving data (e.g. SQL). 15
  • 16. Business and Infrastructure Architecture Decisions 16 Infrastructure: • Amazon Web Services (AWS) • Storage • Elasticity • Availability • Data division: • Parallelism (sharding) • Redundancy • Application Servers • Data affinity • Stateless protocols Business Issues: • Research to Production Pipeline? • 3rd party integration needs? • Flexibility? • Radical Transparency?
  • 17. Data Architecture Decisions 17 Data In-transit: • high write vs. high reads • Encryption and security • Distributed vs. centralized • Visualization tools Data Stores: • Documents • key/value pairs • Graphs • In-memory
  • 18. Technology Stacks (Data Stores) • Hadoop: Distributed File System, Job Scheduler, MapReduce programming model – Pros: fault-tolerant, disaster protection, data parallelism fits many applications, ecosystem. – Best used: long-term data storage, research, basis of other flexible data stores. • Cassandra: Big Table, key-value – Pros: Fast writes, no single-point of failure, fault-tolerant, disaster, columns and column families. – Best used: when you write more than you read, Financial industry, real-time data analysis. • Riak: key-value – Pros: Cassandra-like, less complex, single-site scalability, availability & fault-tolerance – Best used: Point-of-sales, factory control systems, high writes. • Redis: in-memory, key-value – Pros: fast, transactional, expiring values. – Best used: Rapidly changing data in memory (stock prices, analytics, real-time data collection. • Dynamo: Big Table, key-value – Pros: Fast reads and writes, no single-point of failure, fault-tolerant, disaster, eventually consistent – Best used: Always available (e.g. Amazon). • CouchDB: Documents – Pros: bi-directional replication, conflict detection, previous versions – Best used: Accumulating occasionally changing data, pre-defined queries, versioning • MongoDB: document store. • Pros: update-in-place, defined index, built-in sharding, geospatial indexing. • Best used: Dynamic queries, schema-less, data changes a lot, • HBase: Big Table – Pros: huge datasets, map-reduce Hadoop/HDFS stack. – Best used: Analyzing log data • Memcached/Membase: in-memory and multi-node. – Pros: low-latency, high-concurrency and availability. – Best used: zynga – online gaming. • Neo4j: GraphDB – Pros: highly scalable, robust, ACID – Best used: social, routing, recommendation questions (e.g. How do I get to Linz?) 18
  • 19. Related Data Management Ecosystems • Apache Hadoop stack (Cloudera) – MapReduce on HDFS – Pig, Hive, Hbase – SQL-like DB interfaces on top of HDFS. – Flume, RabbitMQ – Data conduits and message queues. – Splunk – Operational Analytics and Log Processing. – Sqoop – Bulk Data transfer to DBs – Puppet, Chef – Configuration Management and DevOps Orchestration. – Visualization, BI and ETL - Informatica, Talend, Pentaho, Tableau • Cloud computing infrastructure (Amazon Web Services) e.g. EC2, Elastic MapReduce, RDS • Cassandra (Datastax) • High-scale, distributed and hybrid RDBMS: - Teradata - Netezza - EMC/Greenplum - Aster Data - Vertica - VoltDB - RDF Triple Stores - Hadapt 19
  • 20. Data Modeling Example: Twitter Publishers and Subscribers 20 • Relational DB: One table that has people relationships and tags whether a publisher or subscriber. Pros: No duplicated data, ACID transaction Cons: Does not scale, SPOF • NoSQL: Separate Indices for Subscribers and Publishers. Pros: – Partition Independence enables scale-out and no single point of failure for both reads and writes. – No Schema allows quick development. Cons: – Eventual consistency requires care in application layer and presentation of user experience, and future evolution. – Redundant storage.
  • 21. Gotchas - What 12 Things to Watch Out for? 1. Privacy 2. Abuse – e.g. HFT/Front running. 3. Immature technologies and companies. 4. Business and Product Changes affect on architecture. 5. Data in-transit vs. at-rest - replication, mirroring, streaming, reprocessing. 6. Data security in-transit and at-rest. 7. Blurring of high availability, performance and disaster recovery. 8. Replacing sampling and aggregation with ALL of the data! 9. Correlation is not Causation – e.g. Google Flu 10. Data Snooping (or Confirmation Bias) http://tylervigen.com 11. Irrelevance 12. Veracity - how do you check and reproduce results? “The process of making is iterative” - Cesar A. Hidalgo 21
  • 22. References and More Info • http://en.wikipedia.org/wiki/Big_data • http://highscalability.com/blog/2012/9/11/how-big-is-a-petabyte-exabyte-zettabyte-or-a-yottabyte.html • http://strata.oreilly.com/2012/01/what-is-big-data.html • http://googlesystem.blogspot.com/2006/09/how-much-data-does-google-store.html • http://www.cnet.com/news/nsa-to-store-yottabytes-in-utah-data-centre/ • http://aws.typepad.com/aws/2012/06/amazon-s3-the-first-trillion-objects.html • http://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500-terabytes-a-day/ • http://iveybusinessjournal.com/topics/strategy/why-big-data-is-the-new-competitive- advantage#.U12f8aPD9jo • http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed • Animation: Large Hadron Collider at CERN: http://home.web.cern.ch/about/updates/2013/04/animation- shows-lhc-data-processing • http://ivoroshilin.com/2012/12/13/brewers-cap-theorem-explained-base-versus-acid/ • http://www.scientificamerican.com/article/saving-big-data-from-big- mouths/?utm_content=buffer53ae6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer • http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz31AZrOtI6 • Copyleft – Share Alike - http://creativecommons.org/licenses/by-sa/3.0/ 22
  • 23. Questions? Use Cases? Technology Adoption? Feedback or follow-up: Lionel.Silberman@Compuware.com 23