Curious about the Big Data hype? Want to find out just how big, BIG is? Who's using Big Data for what, and what can you use it for? How about the architecture underpinnings and technology stacks? Where might you fit in the stack? Maybe some gotchas to avoid? Lionel Silberman, a seasoned Data Architect spreads some light on it. A good and wholesome refresher into Big Data and what all it can do.
Our guest speaker:
Lionel Silberman,
Senior Data Architect, Compuware
Lionel Silberman has over thirty years of experience in big data product development. He has expert knowledge of relational databases, both internals and applications, performance tuning, modeling, and programming. His product and development experience encompasses the major RDBMS vendors, object-oriented, time-series, OLAP, transaction-driven, MPP, distributed and federated database applications, data appliances, NoSQL systems Hadoop and Cassandra, as well as data parallel and mathematical algorithm development. He is currently employed at Compuware, integrating enterprise products at the data level. All are welcome to join us.
1. Unraveling the Mystery of Big Data
Lionel Silberman
May 22, 2014
Copyleft – Share Alike
1
http://www.bigstockphoto.com/image-12115340/stock-photo-binary-stream
2. Who Am I?
Lionel Silberman, currently the Senior Data Architect at Compuware
• 30 years in Software Development
• Statistical modeling, DBMS, Big Data, Data Architecture, tech/product/management.
• Diverse, deep data management:
– All of the major RDBMS vendors and internals.
– Data modeling and data parallelism techniques.
– OLAP, OLTP, MPP, NoSql systems like Hadoop and Cassandra.
– Scaling and Performance tuning of distributed and federated applications.
• Current interest is integrating products in the enterprise at the data level
that deliver more value than their individual pieces.
• Active interest in big data metadata privacy issues.
• Who are you? What’s your interest in this talk tonight?
2
4. Unraveling the Mystery of Big Data Agenda
4
• What is Big Data?
Business Value
Technical Definitions
Sizes and Applications
• What Big Data is Not (or why isn’t everything just “data”)?
• Architectural Underpinnings
• Some Useful Architectural Distinctions
• Technology stacks and ecosystems
• Data Modeling Example
• Gotchas - What 12 things to watch out for?
• References and more info
• Questions?
Lionel Silberman
5. What is Big Data?
Business Value
Enabling new products
• Sensors everywhere!
• Nowcasting.
• Ever-narrower segmentation of customers
Analytics - taking data from input through to decision
• Correlation in real time
• New insights from previously hidden data:
• Social
• Geographical data
• Recommendations.
• Finding needles in haystacks.
• In 2010, industry was worth more than $100 billion.
• Almost 10 percent a year growth rate;
• or, about twice as fast as the software business as a whole.
5
6. What is Big Data?
A Technical Definition
• Data that exceeds the processing capacity of conventional database
systems in volume, velocity or variety*
or the 3Vs!
Volume - Sheer size and growth.
Velocity - how fast it moves.
Variety - the inability to derive structure or frequency of change.
* META Group (now Garner) analyst Doug Laney
6
7. What is Big Volume?
• 1970s: Megabytes
• Now: Many organizations are approaching an Exabyte
• Examples:
• Google – capacity for 15 Exabytes
• NSA – capacity for a Yottabyte in Utah
• AWS – 1 Trillion objects in 2012
• Facebook – 500 Terabytes/day
• Scientific Pursuits:
• Large Hadron Collider at CERN - last year 30 Petabytes
• The NASA Center for Climate Simulation - 32 petabytes of climate observations
and simulations on the Discover supercomputing cluster.
• Sloan Digital Sky Survey (SDSS) 140 terabytes. 200 GB per night.
Large Synoptic Telescope anticipated to acquires 140TB every five days.
• eBay.com - 90 Petabytes in data warehouses
7
8. What is Big Velocity?
• Financial Trading volume
• Retail – Cyber Monday
(every click and interaction: not just the final sales).
• Government – Affordable Care Act
• Smartphone - geolocated imagery and audio data
• Fraud (complex event processing)
o Credit Card Traffic patterns
o Phone Slam/Crams
• Streaming – Netflix, Snapfish
• Retail - Walmart handles more than 1 million customer
transactions per hour.
• Compuware APM (my firm) 25K transactions per second.
• MMOG – Massive Multiplayer Online Games
http://www.livestream.com/ibmpartnerworld/video?clipId=flv_052b14ea-7d5a-40f5-9e61-e287b0ce5d9c
8
9. What is Big Variety?
• Diverse source and destinations:
o Document Backup or Archival – HP, EMC, AWS
o Pictures and Video – Facebook 50 Billion Photos
o Sensor sources – GE, NetApp
o Multi-device - Dropbox and Sugarsync
• Big Data is messy:
o structure aids meaning, but can change frequently.
o multiple sources (e.g. financial feeds, browser incompatibilities)
o Application integration issues (e.g. Fitbit)
o Entity resolution issues (e.g. Portland, dog)
o Visualization increasingly important.
9
11. Big Data and Visualization - Wikipedia
11
http://infodisiac.com/Wikimedia/Visualizations/
12. What Big Data is Not (or why isn’t everything just “data”)?
• There may not be the need in traditional systems
o Payroll
o Human Resources
o Shop machine sampling?
• Some tradeoffs required for the technology of Big Data:
o Bleeding edge vs. established technology
o Subtle definition of consistency
o Complexity
o New and hard-to-find skills
• Make sure the business case warrants and can tolerate the
tradeoffs…
12
14. Architectural Underpinnings: CAP Theorem
14
High Availability (A)
of data for writes.
Consistency (C)
a single up-to-date copy
of the data.
Partition Tolerance (P)
the system continues to
operate despite arbitrary
message loss or failure of
parts of the system.
NoSQL
DBMS
X
15. NoSql and Eventual Consistency
• Relaxed or weaker consistency to achieve High Availability and
Partition Tolerance.
• Eventual Consistency:
– An unbounded delay in propagating changes across partitions.
– No ordering guarantees at all, thus a lower-level transaction
atomicity.
From system perspective means an operational system is
ALWAYS in an inconsistent state.
• Many NoSql systems (e.g. Cassandra, Hadoop):
– support self-healing or restartability.
– allow ease of scale and Disaster Recovery
– are schema-free.
– No standard way of retrieving data (e.g. SQL).
15
16. Business and Infrastructure Architecture Decisions
16
Infrastructure:
• Amazon Web Services (AWS)
• Storage
• Elasticity
• Availability
• Data division:
• Parallelism (sharding)
• Redundancy
• Application Servers
• Data affinity
• Stateless protocols
Business Issues:
• Research to Production Pipeline?
• 3rd party integration needs?
• Flexibility?
• Radical Transparency?
17. Data Architecture Decisions
17
Data In-transit:
• high write vs. high reads
• Encryption and security
• Distributed vs. centralized
• Visualization tools
Data Stores:
• Documents
• key/value pairs
• Graphs
• In-memory
18. Technology Stacks (Data Stores)
• Hadoop: Distributed File System, Job Scheduler, MapReduce programming model
– Pros: fault-tolerant, disaster protection, data parallelism fits many applications, ecosystem.
– Best used: long-term data storage, research, basis of other flexible data stores.
• Cassandra: Big Table, key-value
– Pros: Fast writes, no single-point of failure, fault-tolerant, disaster, columns and column families.
– Best used: when you write more than you read, Financial industry, real-time data analysis.
• Riak: key-value
– Pros: Cassandra-like, less complex, single-site scalability, availability & fault-tolerance
– Best used: Point-of-sales, factory control systems, high writes.
• Redis: in-memory, key-value
– Pros: fast, transactional, expiring values.
– Best used: Rapidly changing data in memory (stock prices, analytics, real-time data collection.
• Dynamo: Big Table, key-value
– Pros: Fast reads and writes, no single-point of failure, fault-tolerant, disaster, eventually consistent
– Best used: Always available (e.g. Amazon).
• CouchDB: Documents
– Pros: bi-directional replication, conflict detection, previous versions
– Best used: Accumulating occasionally changing data, pre-defined queries, versioning
• MongoDB: document store.
• Pros: update-in-place, defined index, built-in sharding, geospatial indexing.
• Best used: Dynamic queries, schema-less, data changes a lot,
• HBase: Big Table
– Pros: huge datasets, map-reduce Hadoop/HDFS stack.
– Best used: Analyzing log data
• Memcached/Membase: in-memory and multi-node.
– Pros: low-latency, high-concurrency and availability.
– Best used: zynga – online gaming.
• Neo4j: GraphDB
– Pros: highly scalable, robust, ACID
– Best used: social, routing, recommendation questions (e.g. How do I get to Linz?)
18
19. Related Data Management Ecosystems
• Apache Hadoop stack (Cloudera)
– MapReduce on HDFS
– Pig, Hive, Hbase – SQL-like DB interfaces on top of HDFS.
– Flume, RabbitMQ – Data conduits and message queues.
– Splunk – Operational Analytics and Log Processing.
– Sqoop – Bulk Data transfer to DBs
– Puppet, Chef – Configuration Management and DevOps Orchestration.
– Visualization, BI and ETL - Informatica, Talend, Pentaho, Tableau
• Cloud computing infrastructure (Amazon Web Services)
e.g. EC2, Elastic MapReduce, RDS
• Cassandra (Datastax)
• High-scale, distributed and hybrid RDBMS:
- Teradata
- Netezza
- EMC/Greenplum
- Aster Data
- Vertica
- VoltDB
- RDF Triple Stores
- Hadapt
19
20. Data Modeling Example: Twitter Publishers and Subscribers
20
• Relational DB: One table that has people relationships and tags
whether a publisher or subscriber.
Pros: No duplicated data, ACID transaction
Cons: Does not scale, SPOF
• NoSQL: Separate Indices for Subscribers and Publishers.
Pros:
– Partition Independence enables scale-out and no single point of failure for
both reads and writes.
– No Schema allows quick development.
Cons:
– Eventual consistency requires care in application layer and presentation of
user experience, and future evolution.
– Redundant storage.
21. Gotchas - What 12 Things to Watch Out for?
1. Privacy
2. Abuse – e.g. HFT/Front running.
3. Immature technologies and companies.
4. Business and Product Changes affect on architecture.
5. Data in-transit vs. at-rest - replication, mirroring, streaming,
reprocessing.
6. Data security in-transit and at-rest.
7. Blurring of high availability, performance and disaster recovery.
8. Replacing sampling and aggregation with ALL of the data!
9. Correlation is not Causation – e.g. Google Flu
10. Data Snooping (or Confirmation Bias) http://tylervigen.com
11. Irrelevance
12. Veracity - how do you check and reproduce results?
“The process of making is iterative” - Cesar A. Hidalgo
21
22. References and More Info
• http://en.wikipedia.org/wiki/Big_data
• http://highscalability.com/blog/2012/9/11/how-big-is-a-petabyte-exabyte-zettabyte-or-a-yottabyte.html
• http://strata.oreilly.com/2012/01/what-is-big-data.html
• http://googlesystem.blogspot.com/2006/09/how-much-data-does-google-store.html
• http://www.cnet.com/news/nsa-to-store-yottabytes-in-utah-data-centre/
• http://aws.typepad.com/aws/2012/06/amazon-s3-the-first-trillion-objects.html
• http://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500-terabytes-a-day/
• http://iveybusinessjournal.com/topics/strategy/why-big-data-is-the-new-competitive-
advantage#.U12f8aPD9jo
• http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
• Animation: Large Hadron Collider at CERN: http://home.web.cern.ch/about/updates/2013/04/animation-
shows-lhc-data-processing
• http://ivoroshilin.com/2012/12/13/brewers-cap-theorem-explained-base-versus-acid/
• http://www.scientificamerican.com/article/saving-big-data-from-big-
mouths/?utm_content=buffer53ae6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
• http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz31AZrOtI6
• Copyleft – Share Alike - http://creativecommons.org/licenses/by-sa/3.0/
22
23. Questions? Use Cases? Technology Adoption?
Feedback or follow-up: Lionel.Silberman@Compuware.com
23