2. 2
"Without big data analytics, companies are blind and deaf,
wandering out onto the web like a deer on a freeway."
- Geoffrey Moore
3. 3
Business Agility
Through Data
Requirements Based
Top-Down Design
Integration and Reuse
Competence Centers
Better Decisions
Enterprise Focus
Opportunity-Oriented
Experimentation
Throwaway
Hackathons
Business Innovation
Functional Focus
Traditional Big Data
5. To manage risk and create agility, embrace all data
….uncertainty of new information is growing
alongside its complexity
Volume Variety Velocity
Data at
Scale
Terabytes to
petabytes of data
Data in
Many Forms
Structured,
unstructured, text,
multimedia
Data in
Motion
Analysis of
streaming data to
enable decisions
within fractions of a
second
Veracity
Data
Uncertainty
Managing the
reliability and
predictability of
inherently imprecise
data types
7. But, Most of the data you might need… you do not own
60% of determinants of health
Volume, Variety, Velocity, Veracity
30% of determinants of health
Volume
10% of determinants of health
Variety
Clinical data
Genomics data
Exogenous data
(Behavior, Socio-economic,
Environmental, ...)
1100 Terabytes
Generated per lifetime
6 TB
Per lifetime
0.4 TB
Per lifetime
Source: "The Relative Contribution of Multiple Determinants to Health Outcomes", Lauren McGover et al., Health Affairs, 33, no.2 (2014)
8. Big Data
Who are my brand advocates, fence
sitters and adversaries?
Are my employees effective at
engaging with customers?
Which customers are likely to defect
to my competitor?
How are my customers and prospects
engaging with my products and
services?
What is the customer sentiment
regarding my brand?
What new products and features does my
customer desire?
How do my customers feel about my
competitors products?
Fuels Insights That Enable For the Enterprise
Higher Sales
Conversion Rates
Sales
Marketing
Customer Service
Product Development
Workforce Optimization
Improved
Customer Service
Higher Loyalty
Enhanced Online
Accuracy
New Product
Innovation
New Demand
Generation
Risk Mitigation
The Business Agility Process
… Align big data to business outcomes
9. But How?But How?
Is Information
Accurate?
Takes too Long
Can’t find the right
information
Data Quality
Problems
Ease of Use
Integration of
Different Systems
Traditional Barriers
Big Data Barriers
10. 10
The 3 R’s of Success with
Big Data Analytics
• Revolution
• Responsiveness
• Resiliency
11. “If you are not
moving at the
speed of the
marketplace
you’re already
dead – you just
haven’t stopped
breathing yet”
Jack Welch
12. Revolution – Managing Disruption
12
Data Products Need to BeData Products Need to Be
Built DifferentlyBuilt Differently
Give Data Back in PowerfulGive Data Back in Powerful
WaysWays
We Don’t Have Time to Do It Right,We Don’t Have Time to Do It Right,
But We Have Time to Do It OverBut We Have Time to Do It Over
Decide on Where to StartDecide on Where to Start
Building Your ApplicationBuilding Your Application
Create and Die By Your ProductCreate and Die By Your Product
Pre-Flight ChecklistPre-Flight Checklist
- DJ Patel, US Chief Data Scientist
- Ruslan Belkin, VP Engineering Salesforce
13. Responsiveness – Data Sensitivity
13
In-House Data
Typically Structured
External Data
Unstructured, but
converted to structured.
Unfamiliar External Data
Leveraged As-Is
Homemade Data
Solution Augmentation
14. Big Data - Examples
In-House Data
External Data
.
Unfamiliar External
Data
Homemade Data
Leads Most Likely to Generate New Sales
Analysis of Customer Transactions Over Time
Understanding Customer Loyalty Patterns
Market Basket Analysis on Short & Long Term Behavior
Targeted Advertising Use Browsing History
Targeted Discounts via Phone Recognition of Possible Attrition
Social Media Sentiment / Buzz on Your Reputation
Pharmaceutical Drug Analytics Through Refill Patterns
“Personalized” Credit Offers per Customer
Hospital and Physician Quality Ratings
Experimentation for Customized Landing Pages
Patient Claim Analysis Based on Proximity to “poor” Locales
15. 15
Elastic Provisioning
Pay-as-You-Go
Manage High Volume External Data Sources
Self-Service Through a Browser
SQL / NOSQL – Unstructured Data
Access Data Anywhere, Anytime
Leverage Current Cloud Apps
Resiliency – Leveraging Cloud Elasticity
16. Big Data Analytics – Reference Architecture
Sensors
Internet
Social Media
Services
Customer
Conversations
Public and
Internal Sources
BackOffice
Applications
Old and
New Sources
Information
Ingestion
16
Data
Connection
/ Movement
16
Data
Shaping /
Cleansing
Real-Time
Data
Streaming
Distributed
Messaging
System
DataWorks
Apache
Kafka
Streams
17. Information
Ingestion
Analytic Sources
17
Data
Connection
/ Movement
17
Data
Shaping /
Cleansing
17
Logical
Data
Warehouse
17
Interactive Queries
and Iterative
Data Processing
17
Batch
Processing
Framework
Real-Time
Data
Streaming
Distributed
Messaging
System
Sensors
Internet
Social Media
Services
Customer
Conversations
Public and
Internal Sources
BackOffice
Applications
Old and
New Sources
DashDB
Cloudant
Postgres DB2
MongoDB
ReThinkDB
Redis
Big Data Analytics – Reference Architecture
18. 18
Information
Ingestion
Analytic Sources Interactive
Analytics
Alerting, Reporting
and Planning
Visualization &
Collaboration
Real-Time
Decision Mgmt.
Systems of
Engagement
Accelerators
Metadata
Catalog
Insight
Hub
Activity
Hub
Content
Hub
Master &
Reference
Data Hubs
Information
Interaction
18
Logical
Data
Warehouse
18
Interactive Queries
and Iterative
Data Processing
18
Batch
Processing
Framework
18
Data
Connection
/ Movement
18
Data
Shaping /
Cleansing
Real-Time
Data
Streaming
Distributed
Messaging
System
Sensors
Internet
Social Media
Services
Customer
Conversations
Public and
Internal Sources
BackOffice
Applications
Old and
New Sources
Predictive
Analytics
D3
Embeddable
Reporting
Apache
Zeppelin
DashDB
Cloudant
Postgres DB2
MongoDB
ReThinkDB
Redis
DataWorks
Apache
Kafka
Streams
Big Data Analytics – Reference Architecture
19. Example – Ford: Integrated Health Management Platform
19
Information
Ingestion
Analytic Sources
Metadata
Catalog
Insight
Hub
Activity
Hub
Content
Hub
Master &
Reference
Data Hubs
Information
Interaction
Vehicle Device
Sensors
Dongle
Information
from Parking
Spots
Vehicle & User
Information
Maintenance
History
Old and
New Sources
Streams
Cloudant
Predictive
Analytics
20. Example – Integrated Health Management Platform
20
Information
Ingestion
Analytic Sources
Metadata
Catalog
Insight
Hub
Activity
Hub
Content
Hub
Master &
Reference
Data Hubs
Information
Interaction
Clinical and
Wearable Device
Sensors
Fitbit, Jawbone
Device Data
Lab Results and
Patient
Conversations
Health Records
from RDBMS
Old and
New Sources
DataWorks
Streams
DashDB
Cloudant
D3
From the viewpoint of health outcome determinants, almost 60% of data are exogenous, and never captured as part of EMR systems now.
Inserting IBM in the dataflow, and enabling the generation/capture of this exogenous data is crucial for any emerging health ecosystems. Two important aspects of this data that plays directly to IBM strengths:
traditional “big data” characteristics – volume, velocity, variety
all data is generated in uncontrolled environments (that is, no hospital or supply side control) – highly fragmented value chain that needs a neutral entity that can collect, store, manage, curate, analyze data for insights.
Revolution is about managing disruption when warranted and pushing your organization towards maxmized throughput when it isn't.
1. 400 Variations of the same jobs at IBM – Linkedin. If you are not thinking about how to keep your data clean you are screwed I guarantee it.
2.To the user. If you give them too much information, they will be in paralsis. Linked in- who viewed you.
3. Never try to launch a complicated data product on a fixed schedule.
4. Every single company I've worked at and talked to has the same problem without a single exception so far — poor data quality, especially tracking data,” he says.“Either there's incomplete data, missing tracking data, duplicative tracking data.”
5. A. Product has to Work
B. Has to work for the user, makes sense to them.
C.Has to Feel Safe, not creepy
D. User needs to feel in control
Responsiveness means being sensitive to which pieces of data are best suited for analysis, and conscious of the level of data quality and speed required to remain agile.
Resiliency means leveraging the elasticity of the cloud to best augment your internal capabilities, giving your company the ultimate in staying power.
Enterprise applications already hosted in the cloud: If, like many organizations -- especially small and midmarket businesses -- you use cloud-based applications from an external service provider, much of your source transactional data is already in a public cloud. If you have deep historical data on that cloud platform, it might already have accumulated in big data magnitudes. To the extent the service provider or one of its partners offers a value-added analytics service -- such as churn analysis, marketing optimization, or off-site backup and archiving of customer data -- it might make sense to leverage that rather than host it all in-house.
High-volume external data sources that require considerable preprocessing: If, for example, you're doing customer sentiment monitoring on aggregated feeds of social media data, you probably don't have the server, storage, or bandwidth capacity in-house to do it justice. That's a clear example of an application where you'd want to leverage the social media filtering service provided by a public-cloud-based, big-data-powered service.
Tactical applications beyond your on-premises, big data capabilities: If you already have an on-premises big data platform dedicated for one application (such as a dedicated Hadoop cluster for high-volume ETL on unstructured data sources), it might make sense to use a public cloud to address new applications (say, multichannel marketing, social media analytics, geospatial analytics, query-able archiving, elastic data-science sandboxing) for which the current platform is unsuited or for which an as-needed, on-demand service is more robust or cost effective. In fact, a public cloud offering might be the only feasible option if you need petabyte-scale, streaming, multistructured, big data capability ASAP.
Elastic provisioning of very large but short-lived analytic sandboxes: If you have a short-turnaround, short-term data science project that requires an exploratory data mart (aka sandbox) that's an order-of-magnitude larger than the norm, the cloud may be your only feasible or affordable option. You can quickly spin up cloud-based storage and processing power for the duration of the project, then just as rapidly deprovision it all when the project is over. I call this the "bubble mart" deployment model, and it's
3. You can access data anywhere
With business trips, outsourcing, and branching out to new markets, your company needs big data available at any location. You could spend the time, money, and manpower to set up an elaborate VPN, but the cloud already comes with built-in secure global access. Even if you don't need it now, as your company expands, some of the work may need to be outsourced, and new offices could open on the other side of the globe. The cloud is perfect for these global needs.
4. Pay as you go
It's simple: you only pay for the resources you actually use. On quiet days, things will be on the cheaper side, on crazy days with sharp usage spikes, you'll pay the right price. No more, no less. Consider the alternative -- buying super-expensive hardware to handle a sudden rise in demand. Most of the time the hardware will gather dust, its capacity barely used, putting the big bucks spent on it to waste. Add maintenance prices, and you'll soon realize that using big data in the cloud saves money on system resources and directs it where your company actually needs it.
Resource pooling: Cloud architectures enable the efficient creation of groups of shared resources that make the cloud economically viable.
Self-service: With self-service, the user of a cloud resource is able to use a browser or a portal interface to acquire the resources needed, say, to run a huge predictive model. This is dramatically different than how you might gain resources from a data center, where you would have to request the resources from IT operations.
Pay as you go: A typical billing option for a cloud provider is Pay as You Go, which means that you are billed for resources used based on instance pricing. This can be useful if you’re not sure what resources you need for your big data project.
Fault tolerance: Cloud service providers should have fault tolerance built into their architecture, providing uninterrupted services despite the failure of one or more of the system’s components.
The technical stack an enterprise chooses is dictated by the type of data they need to store, and the type of data is dictated by business requirements.
The RDBMS is good for managing structured, highly relational data and will continue to be the software of choice for many requirements.
For the growing amount of unstructured data produced by social media, sensor networks, and federated analytics data-and for constantly changing data that needs to be replicated to other operating sites or mobile workers-NoSQL technologies better fit those use-cases. Unstructured data can be terabytes or even petabytes in size.
The IBM DataWorks™ data refinery transforms raw data into relevant information. It includes IBM DataWorks Forge, an app primarily for knowledge workers, as well as APIs for application developers. IBM DataWorks leverages a highly performant and scalable engine to discover, profile, enrich, mask and deliver data to applications.
Forge (Beta)
A data-rich app that empowers knowledge workers - including business analysts, data scientists and non technical users - to find data, visualize it, and prepare it for use. By automatically profiling, classifying, and scoring data, Forge guides you through the process of enriching and improving the quality of data using actions such as removing duplicates, filtering and joining. After you prepare and enrich your data, Forge makes it easy for you to deliver data to applications and systems.
APIs
Flexible, REST-based APIs enable developers to quickly access data and ensure it is fit for purpose. Using the IBM DataWorks APIs, you can quickly create higher-quality applications that load data between data sources (such as SQL Database, Object Storage, dashDB, IBM Analytics for Hadoop, DB2 and Oracle); mask data while loading; securely load on-premises data to cloud environments; cleanse US postal addresses; and classify and profile data.
Streaming Analytics is powered by IBM InfoSphere® Streams, an advanced analytic platform that you can use to ingest, analyze, and correlate information as it arrives from data sources in real time. When you create an instance of the Streaming Analytics service, you get your own instance of InfoSphere Streams running in the Bluemix cloud, ready to run your InfoSphere Streams applications.
You can use the Streaming Analytics service in two ways:
Interactively by using the Streaming Analytics console.
Programmatically in the context of a Bluemix application by using the Streaming Analytics service instance REST API.
You can also combine these two methods. Your Bluemix application can use the service programmatically, while you use the console to monitor the status of your applications.
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.
Fast
A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
Scalable
Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
Durable
Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
Data storageStored in a relational model, with rows and columns. Rows contain all of the information about one specific entry/entity, and columns are all the separate data points; for example, you might have a row about a specific car, in which the columns are ‘Make’, ‘Model’, ‘Colour’ and so on.The term “NoSQL” encompasses a host of databases, each with different data storage models. The main ones are: document, graph, key-value and columnar. More on the distinctions between them below.
Schemas and FlexibilityEach record conforms to fixed schema, meaning the columns must be decided and locked before data entry and each row must contain data for each column. This can be amended, but it involves altering the whole database and going offline.Schemas are dynamic. Information can be added on the fly, and each ‘row’ (or equivalent) doesn’t have to contain data for each ‘column’.
ScalabilityScaling is vertical. In essence, more data means a bigger server, which can get very expensive. It is possible to scale an RDBMS across multiple servers, but this is a difficult and time-consuming process.Scaling is horizontal, meaning across servers. These multiple servers can be cheap commodity hardware or cloud instances, making it alot more cost-effective than vertical scaling. Many NoSQL technologies also distribute data across servers automatically.
Apache Spark is the shiny new toy on the Big Data playground, but there are still use cases for using Hadoop MapReduce.
Spark has excellent performance and is highly cost-effective thanks to in-memory data processing. It’s compatible with all of Hadoop’s data sources and file formats, and thanks to friendly APIs that are available in several languages, it also has a faster learning curve. Spark even includes graph processing and machine-learning capabilities.
Hadoop MapReduce is a more mature platform and it was built for batch processing. It can be more cost-effective than Spark for truly Big Data that doesn’t fit in memory and also due to the greater availability of experienced staff. Furthermore, the Hadoop MapReduce ecosystem is currently bigger thanks to many supporting projects, tools and cloud services.
But even if Spark looks like the big winner, the chances are that you won’t use it on its own—you still need HDFS to store the data and you may want to use HBase, Hive, Pig, Impala or other Hadoop projects. This means you’ll still need to run Hadoop and MapReduce alongside Spark for a full Big Data package.
The notion of running a data warehouse in the cloud was a pretty novel thing when Amazon Web Services launched its Redshift service in November of 2012. Most on-premises data warehouse (DW) platforms are appliance-based, which makes them difficult to expand, and the resulting need to leave room for growth also makes them expensive to acquire. In the cloud though, economics are better, elasticity is realistic and logistics are streamlined. Combine that with the ability to handle "big data" volumes with the familiar SQL/relational model that Redshift uses and it's hardly surprising that the service has been one of Amazon's fastest growing since its launch.
Company Background
Ford Motor Company is an American automaker based in Dearborn, Michigan, a suburb of Detroit, the automaker was founded by Henry Ford, on June 16, 1903.
As the fifth-largest automobile company in the world, Ford Motor Company represents a $164 billion multinational business empire. Known primarily as a manufacturer of automobiles, Ford also operates Ford Credit, which generates more than $3 billion in income.
In London, Ford is working to make parking easier for drivers and the city. Drivers voluntarily use plug-in devices that create live data on traffic and parking. The City Dash app tells users whether they are legally parked. If not, the app recommends the nearest open spot. It allows drivers to pay for parking meters by mobile phone, and identifies the closest available parking spots to the driver’s final destination.
Success Criteria
Online/offline sync & replication critical
Data needs to reside on the device and in the cloud and be available regardless of connectivity.
Advanced Geo Spatial capability
Ford needs to be able to help drivers find parking the nearest open relevant parking space.
Solution & Results
Cloudant’s advanced geospatial capabilities Ford can help drivers find open parking spots in London where parking is at a premium and only show relevant spots instead of simply the nearest which may not be easy to get to.
With Cloudant’s Replication and Sync Capabilities Ford can be certain the app will function even if the is a loss of cellular signal.
Ford is also installing numerous sensors in their cars to monitor behaviour. They install over 74 sensors in cars including sonar, cameras, radar, accelerometers, temperature sensors and rain sensors. As a result, it Energi line of plug-in hybrid cars generate over 25 gigabytes of data every hour. This data is returned back to the factory for real-time analysis and returned to the driver via a mobile app. The cars in its testing facility even generate up to 250 gigabytes of data per hour from smart cameras and sensors.
SHARED OPERAITONAL INFORMATION
Vehicle registration
Vehicle Management
User Registration
Usage record
Maintenance history
Utilizations
ANALYTICS
Driving behavior
Trajectory
Origin / Destination
Pattern Analytics
A Boston based holding company, to be capitalized with $25 MM and established for the purpose of investing into a portfolio of North American healthcare companies, to create capabilities necessary for an Integrative Disease Management Platform focused primarily on Metabolic and Chronic Diseases.
A patient, Sarah, is being monitored for cardio vascular disease (CVD) and has been diagnosed with hyperlipidemia (high cholesterol) and has authorized her provider to access the data being gathered by her fitbit. Her provider has decided to not prescribe a statin and treat her with increased monitoring, a nutritionist, and to begin increasing her physical activity level via step goals with a Fitbit.
The provider has prescribed a lipid panel be performed every 3 months to monitor progress.
At the 6th month of increased physical activity and visits with the nutritionist the Application has been collecting data about adherence to attending scheduled visits and the fitbit data has been aggregated. When the lab results for her lipid panel come back normal for the first time, the system alerts the physician that his patient is doing well!
Logging into his application he then triggers the app to send a notification with a personal message to his patient to interpret the numbers for her and reinforce that her behavior change is having an impact.
Provider Console
Patient Console
provider console
The application offers an integrated experience for the provider to see EHR / Lab / Patient social device data.
Example Functionality to be spec’d out:risk notifications, adherence algorithms, provider/patient communication (electronic/direct mailing).
Provider is also given historical reporting to follow patient progress (fed by EDW).
The application offers ability for patients to authorize the integration of personal devices to providers.
Integration of EHR/Labs/device data.
Alerts / Communication with provider.
Mobile App
Experience fitted to the smaller screen.
• Well, today the gap has been bridged…
• You can meet your goals
• Capture the opportunity...