The document discusses operationalizing data lakes by integrating MongoDB with Hadoop to enable both real-time and batch processing capabilities. It describes how MongoDB can be used to power operational applications with low-latency access to analytics models generated from raw data stored in Hadoop, while Hadoop is still used for its batch processing and analytics capabilities on large datasets. By combining both technologies, companies can unlock insights from their data lakes and avoid being part of the 70% of Hadoop projects that fail to meet objectives due to skills and integration challenges.
2. 2
The World is Changing
Digital Natives & Digital Transformation
Volume
Velocity
Variety
Iterative
Agile
Short Cycles
Always On
Secure
Global
Open-Source
Cloud
Commodity
Data Time
Risk Cost
6. 6
• 24% CAGR: Hadoop,
Spark & Streaming
• 18% CAGR: Databases
• Databases are key
components within the
big data landscape
“Big Data” is More than Just Hadoop
9. 9
How to Avoid Being in the 70%?
1. Unify data lake analytics with
the operational applications
2. Create smart, contextually
aware, data-driven apps &
insights
3. Integrate a database layer with
the data lake
10. 10
Why a Database + Hadoop?
Distributed Processing & Analytics
• Data stored as large files (64MB-128MB
blocks). No indexes
• Write-once-read-many, append-only
• Designed for high throughput scans
across TB/PB of data.
• Multi-minute latency
Common Attributes
• Schema-on-read
• Multiple replicas
• Horizontal scale
• High throughput
• Low TCO
11. 11
Why a Database + Hadoop?
Distributed Processing & Analytics
• Random access to subsets of data
• Millisecond latency
• Expressive querying, rich
aggregations & flexible indexing
• Update fast changing data, avoid re-
write / re-compute entire data set
• Data stored as large files (64MB-128MB
blocks). No indexes
• Write-once-read-many, append-only
• Designed for high throughput scans
across TB/PB of data.
• Multi-minute latency
Common Attributes
• Schema-on-read
• Multiple replicas
• Horizontal scale
• High throughput
• Low TCO
12. 12
MongoDB & Hadoop: What’s Common
Distributed Processing & Analytics
Common Attributes
• Schema-on-read
• Multiple replicas
• Horizontal scale
• High throughput
• Low TCO
13. 13
Bringing it Together
Online Services
powered by
Back-end machine learning
powered by
• User account & personalization
• Product catalog
• Session management & shopping cart
• Recommendations
• Customer classification & clustering
• Basket analysis
• Brand sentiment
• Price optimization
MongoDB
Connector for
Hadoop
14. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in
128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
15. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in
128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Configure where to
land incoming data
16. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in
128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Raw data processed to
generate analytics models
17. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in
128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
MongoDB exposes
analytics models to
operational apps.
Handles real time
updates
18. MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets of data.
Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data stored in
128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Compute new
models against
MongoDB &
HDFS
19. 19
Operational Database Requirements
1 “Smart” integration with the data lake
2 Powerful real-time analytics
3 Flexible, governed data model
4 Scale with the data lake
5 Sophisticated management & security
21. 21
Problem Why MongoDB ResultsProblem Solution Results
Existing EDW with nightly
batch loads
No real-time analytics to
personalize user experience
Application changes broke ETL
pipeline
Unable to scale as services
expanded
Microservices architecture running on AWS
All application events written to Kafka queue,
routed to MongoDB and Hadoop
Events that personalize real-time experience (ie
triggering email send, additional questions,
offers) written to MongoDB
All event data aggregated with other data
sources and analyzed in Hadoop, updated
customer profiles written back to MongoDB
2x faster delivery of new
services after migrating to new
architecture
Enabled continuous delivery:
pushing new features every
day
Personalized user experience,
plus higher uptime and
scalability
UK’s Leading Price Comparison Site
Out-pacing Internet search giants with continuous delivery pipeline
powered by microservices & Docker running MongoDB, Kafka and
Hadoop in the cloud
22. 22
Problem Why MongoDB Results
Problem Solution Results
Customer data scattered across
100+ different systems
Poor customer experience: no
personalization, no consistent
experience across brands or
devices
No way to analyze customer
behavior to deliver targeted offers
Selected MongoDB over HBase for
schema flexibility and rich query support
MongoDB stores all customer profiles,
served to web, mobile & call-center apps
Distributed across multiple regions for DR
and data locality
All customer interactions stored in
MongoDB, loaded into Hadoop for
customer segmentation
Unified processing pipeline with Spark
running across MongoDB and Hadoop
Single profile created for each
customer, personalizing
experience in real time
Revenue optimization by
calculating best ticket prices
Reduce competitive pressures
by identifying gaps in product
offerings
Customer Data Management
Single view and real-time analytics with MongoDB,
Spark, & Hadoop
Leading
Global
Airline
23. 23
Problem Why MongoDB Results
Problem Solution Results
Commercialize a national security
platform
Massive volumes of multi-
structured data: news, RSS &
social feeds, geospatial, geological,
health & crime stats
Requires complex analysis,
delivered in real time, always on
Apache NiFI for data ingestion, routing
& metadata management
Hadoop for text analytics
HANA for geospatial analytics
MongoDB correlates analytics with
user profiles & location data to deliver
real-time alerts to corporate security
teams & individual travelers
Enables Prescient to uniquely
blend big data technology with its
security IP developed in
government
Dynamic data model supports
indexing 38k data sources,
growing at 200 per day
24x7 continuous availability
Scalability to PBs of data
World’s Most Sophisticated
Traveler Safety Platform
Analyzing PBs of Data with MongoDB, Hadoop, Apache NiFi
& SAP HANA
24. 24
Problem Why MongoDB Results
Problem Solution Results
Requirement to analyze data over
many different dimensions to detect
real time threat profiles
HBase unable to query data
beyond primary key lookups
Lucene search unable to scale with
growth in data
MongoDB + Hadoop to collect and
analyze data from internet sensors in
real time
MongoDB dynamic schema enables
sensor data to be enriched with
geospatial tags
Auto-sharding to scale as data
volumes grow
Run complex, real-time analytics on
live data
Improved query performance by
over 3x
Scale to support doubling of data
volume every 24 months
Deploy across global data
centers for low latency user
experience
Engineering teams have more
time to develop new features
Powering Global Threat
Intelligence
Cloud-based real-time analytics with MongoDB & Hadoop
26. Conclusion
1 Data lakes enable
enterprises to affordably
capture & analyze more data
2 Operational and analytical
workloads are converging
3 MongoDB is the key
technology to operationalize
the data lake
27. 27
MongoDB Compass MongoDB Connector for BI
MongoDB Enterprise Server
MongoDB Enterprise Advanced24x7Support
(1hourSLA)
CommercialLicense
(NoAGPLCopyleftRestrictions)
Platform
Certifications
MongoDB Ops Manager
Monitoring &
Alerting
Query
Optimization
Backup &
Recovery
Automation &
Configuration
Schema Visualization
Data Exploration
Ad-Hoc Queries
Visualization
Analysis
Reporting
Authorization Auditing
Encryption
(In Flight & at Rest)
Authentication
REST APIEmergency
Patches
Customer
Success
Program
On-Demand
Online Training
Warranty
Limitation of
Liability
Indemnification
28. 28
Resources to Learn More
• Guide: Operational Data Lake
• Whitepaper: Real-Time
Analytics with Apache Spark &
MongoDB
29.
30. 30
For More Information
Resource Location
Case Studies mongodb.com/customers
Presentations mongodb.com/presentations
Free Online Training education.mongodb.com
Webinars and Events mongodb.com/events
Documentation docs.mongodb.org
MongoDB Downloads mongodb.com/download
Additional Info info@mongodb.com
31. 31
Problem Why MongoDB Results
Problem Solution Results
System failures in online banking
systems creating customer sat
issues
No personalization experience
across channels
No enrichment of user data with
social media chatter
Apache Flume to ingest log data &
social media streams, Apache Spark
to process log events
MongoDB to persist log data and
KPIs, immediately rebuild user
sessions when a service fails
Integration with MongoDB query
language and secondary indexes to
selectively filter and query data in real
time
Improved user experience, with
more customers using online,
self-service channels
Improved services following
deeper understanding of how
users interact with systems
Greater user insight by adding
social media insights
One of World’s Largest Banks
Creating new customer insights with MongoDB & Spark
32. 32
Fare Calculation Engine
One of World’s Largest Airlines Migrates from Oracle to
MongoDB and Apache Spark to Support 100x performance
improvement
Problem Why MongoDB Results
Problem Solution Results
China Eastern targeting 130,000 seats
sold every day across its web and
mobile channels
New fare calculation engine needed to
support 20,000 search queries per
second, but current Oracle platform
supported only 200 per second
Apache Spark used for fare
calculations, using business rules
stored in MongoDB
Fare calculations written to MongoDB
for access by the search application
MongoDB Connector for Apache Spark
allows seamless integration with data
locality awareness across the cluster
Cluster of less than 20 API, Spark &
MongoDB nodes supports 180m fare
calculations & 1.6 billion searches per
day
Each node delivers 15x higher
performance and 10x lower latency
than existing Oracle servers
MongoDB Enterprise Advanced
provided Ops Manager for operational
automation and access to expert
technical support
33. 33
MongoDB Connector for Apache Spark
• Native Scala connector, certified by Databricks
• Exposes all Spark APIs & libraries
• Efficient data filtering with predicate pushdown,
secondary indexes, & in-database
aggregations
• Locality awareness to reduce data movement
“We reduced 100+ lines of integration code to just a
single line after moving to the MongoDB Spark connector.”
- Early Access Tester, Multi-National Banking Group Group
35. 35
Query and Data Model
MongoDB Relational Column Family
(i.e. HBase)
Rich query language & secondary
indexes
Yes Yes Requires integration
with separate Spark /
Hadoop cluster
In-Database aggregations & search Yes Yes Requires integration
with separate Spark /
Hadoop cluster
Dynamic schema Yes No Partial
Data validation Yes Yes App-side code
• Why it matters
– Query & Aggregations: Rich, real time analytics against operational data
– Dynamic Schema: Manage multi-structured data
– Data Validation: Enforce data governance between data lake & operational apps
36. 36
Data Lake Integration
MongoDB Relational Column Family
(i.e. HBase)
Hadoop + secondary indexes Yes Yes: Expensive No secondary
indexes
Spark + secondary indexes Yes Yes: Expensive No secondary
indexes
Native BI connectivity Yes Yes 3rd-party connectors
Workload isolation Yes Yes: Expensive Load data to
separate
Spark/Hadoop
cluster
• Why it matters
– Hadoop + Spark: Efficient data movement between data lake, processing layer & database
– Native BI Connectivity: Visualizing operational data
– Workload isolation: separation between operational and analytical workloads
37. 37
Operationalizing for Scale & Security
MongoDB Relational Column Family
(i.e. HBase)
Robust security controls Yes Yes Yes
Scale-out on commodity hardware Yes No Yes
Sophisticated management platform Yes Yes Monitoring only
• Why it matters
– Security: Data protection for regulatory compliance
– Scale-Out: Grow with the data lake
– Management: Reduce TCO with platform automation, monitoring, disaster recovery
Seen rapid growth in adoption of the data lake – a centralized repository for many new data sources orgs now collecting
But not without challenges – primary challenge is how to make analytics generated by the data lake available to our real time, operational apps
So we are going to cover
Rise of data lake
Challenges presented in getting most biz value out of data lake
Role that databases play, and requirements
Case studies who are unlockig insight from the data lake
As enterprises bring more products and services on line as part of digital transformation initiatives, one thing don’t lack today is data – from streams of sensor readings, to social sentiment, to machine logs, mobile apps, and more.
Analysts estimate volumes growing at 40% per annum, with 80% of all data unstructured.
Same time – we see more pressure on time to market, on exposing apps to global audiences, and in reducing cost of delivering new services
Trends fundamentally changes how enterprises build and run modern apps
What all of this new data available, we are creating an insight economy
Uncovering new insights by collecting and analyzing this data carries the promise of competitive advantage and efficiency savings. Better understand customers by predicting what they might buy based on behavior, on demographics – could be optimizing supply chain to better or faster routes. Reducing risk of fraud by identifying suspicious behavior – its all about that data
Those that don’t harness data are at major disadvantage
understand the past, monitor the present, and predict the future
MIT: data-driven decision environments have 5% higher productivity, 6% higher profit and up to 50% higher market value than other businesses.
Traditional source of data from operational apps has been DW, take all this data in, then create analytics from it
However, the traditional Enterprise Data Warehouse (EDW) is straining under the load, overwhelmed by the sheer volume and variety of data pouring into the business. Costs – hundreds to thousands of $ per TB v 10s to hundreds in commodity systems
Becaise of these challenges many organizations have turned to Hadoop as a centralized repository for this new data, creating what many call a data lake. Not are replacement – adjunct – stores all new data – apply new analytics which combined with traditional reporting coming from the DW
Gartner estimate around 50% of ents have or are in the process of rolling out data lakes
When we think about data lakes, think about big data, and big data often associated with Hadoop – reality is more than just Hadoop
Market growth forecast by wikibon – “big data revenues” growing from $19bn 2016 to $92bn in 2026. S/W outpacing h/w and PS. IDC forecasr Just under $50bn by 2019, 23% CAGR. Software growing fastest
Leading charge, Hadoop and spark. Closely followed by databases – key part of big data landscape – because they operationalize the data lake – link between backend data lake and front end apps that consume analytics to make those apps smarter
Hadoop – well established, celebrates 10th anniversry this year
Grown from HDFS and MR into dozens of projects - Gartner identify 19 common projects supported by 4 leading distros. Avg distro has many more projects – processing frameworks, to search, to provisionng and mgmt, to security to file formats to integration
Each project is developed independenytly – own roadmap, own dependencies – incredible complexity
HDFS is the common storage layer – against which processing frameworks run to produce outputs you see on the slide
While something like 50% of enterprises either have or are evaluating Hadoop to create new classes of app, not without its challenges
Appears in a number of Gartner analysis, any by the press
One of the fundamental challenges in integration is how to integrate data lake with your operational systems
Operational apps run the business – how do you expose analytics created in the data lake to better serve customers with more relevant products and offers, to better drive efficiency savings from IoT-enabled smart factory
Unify data lake analytics with the operational applications
Enables you to create smart, contextually aware, data-driven apps
Integrated database layer operationalizes the data lake
Differences come in how data is stored, accessed and updated. Hadoop is a file system – it stores data in files in blocks – has no knowledge of that underlying data – its has no indexes. If you want to access a specific record, scan all the data that stored in the file where the record is located – could be tens of MBs
HDFS characteristics
WORM, ie update customer data, rewite all that customer data, not just individual customers
Hadoop excels at generating analytics models by scanning and processing large datasets, is not designed to provide real-time, random access by operational applications.
the time to read the whole dataset is more important than the latency in reading the first record.
http://stackoverflow.com/questions/15675312/why-hdfs-is-write-once-and-read-multiple-times/37300268#37300268\
But MongoDB more than just a filesystem. Full database, so gives you a whole bunch of things hdfs doesn’t give –
Millisecond latency query responsiveness.
Random access to indexed subsets of data.
Expressive querying & flexible indexing: Supporting complex queries and aggregations against the data in real time, making online applications smarter and contextual.
Updating fast-changing data in real time as users interact with online applications, without having to rewrite the entire data set.
fine-grained access with complex filtering logic,
Use distributed processing libs against it – mongo collection or doc looks like an input or output in hdfs. Rather than load a file, load a dataframe. Hive sees Mongodb as a table
Longer jobs
Batch analytics
Append only files
Great for scanning all data or large subsets in files
Obvious question is why do we need a database when we have Hadoop. Comes down to how each platform persists and accesses data. HDFS is a file system – accesses data in batches of 128MB blocks. MongoDB is a database which provides fine grained access to data at the level of individual records – gives each system very different properties – talk through.
Despite those differences, lots of similarities – in how we process data – MR, Spark. These are unopinionated on underlying persistence layer – could be HDFS, could be MongoDB. Means can unify analytics across data lake and in your database
Both MongoDB and HDFS – common atrributes provide: Schema on Read, multiple replicas for fault tolerance horizontal scale, low TCO.
But have different characteristics in how they store and access data – means suited to different parts of the data lake deployment
When you bring the database and the data lake together, you can build powerful, data driven apps
Take a real life example – data lake of a large retailers
Online store front and ecomm engine is powered by MongoDB – handling customer profiles, sessions, baskets, product catalogs – presenting recommendations and offers
As they browse the ite, all of their activity is being written back to Hadoop –blending it with other data sources – social feeds, demogragpahics, market data, credit scores, currency feeds, to segment and cluster customers
These can then be exposed to MongoDB, so when customers come back, presented with personalized experience – based on what you have browsed before – what you are likely want to purchase next.
Could not serve that operational app that is dealing individual customers from hdfs – not real time, no indexes to access just the customer details you need. No way of updating customer record –everything is rewritten and recomputed
Regression and classification for customer clustering
Lets go deeper and wider
This is a design pattern for the data lake – multiple components that collectively handle ingest, storage, processing and analysis of data, then serving it to consuming operational apps
Step thru
Data ingestion: Data streams are ingested to a pub/sub message queue, which routes all raw data into HDFS.
Often also have event processing running against the queue to find interesting events that need to be consumed by the operational apps immediately - displaying an offer to a user browsing a product page, or alarms generated against vehicle telemetry from an IoT apps, are routed to MongoDB for immediate consumption by operational applications.
Raw data is loaded into the data lake where we can use Hadoop jobs – MR or Spark, generate analytics models from the raw data – see examples in the layer above HDFS
MongoDB exposes these models to the operational processes, serving indexed queries and updates against them with real-time latency
The distributed processing frameworks can re-compute analytics models, against data stored in either HDFS or MongoDB, continuously flowing updates from the operational database to analytics models
Look at some examples of users who have deployed this type of design pattern little later
Beyond low latency performance, specific requirements. Need much more than just a datastore, fully-featured database serving as a System of Record for online applications
Tight integration between MongoDB and the data lake – minimize data movement between them, fullt exploit native capabilities of each part of the system
Need to be able to serve operational workloads, run analytics against live operational data –ie top trending articles now so I know where to place my ads, how many widgets coming off my produiction line are failing QA, is that up or down with previous trends. Gartner calls it HTAP (Hybrid Transactional and Analytical Processing), Forrester = transalytics – to do that, need: Powerful query language, secondary indexes, aggregations & transformations all within the database – not ETL into a warehouse
Workload isolation: operational & analytics – so don’t contend for the same resource
Flexible schema to handle multi-structured data, but need to enforce governance to that data
Secure access to the data: – the operational DB typically accessed by a much broader audience than Hadoop, so security controls critical – robust access controls – LDAP, kerberos, RBAC
Auditing of all events for reg compliance. Encr of data in motion and at rest, all built into the database
Need to scale as the data lake scales – means scaling out on commodity hardware, often across geo regions
To simplify the envrionment, need sophisticated mgmt tools: to automate database deployment, scaling, monitoring and alerting, and disaster recovery.
Tight integration: not enough just to move data between analytics and operational layers – need to move it efficiently. Connectors should allow selective filtering by using secondary indexes to extract and process only the range of data it needs – for example, retrieving all customers located in a specific geography. This is very different from other databases that do not support secondary indexes. In these cases, Spark and Hadoop jobs are limited to extracting all data based on a simple primary key, even if only a subset of that data is required for the query. This means more processing overhead, more hardware, and longer time-to-insight for the user.
Workload isolation: provision database clusters with dedicated analytic nodes, allowing users to simultaneously run real-time analytics and reporting queries against live data, without impacting nodes servicing the operational application.
Flexible data model to store data of any structure, and easily evolve the model to capture new attribs – ie enriched user profiles with geospatial data. Also need to ensure data quality by enforcing validation rules against the data – to ensure it is appropriated typed, contains all attribs needed by the app
Expressive queries developers to build applications that can query and analyze the data in multiple ways – by single keys, ranges, text search, and geospatial queries through to complex aggregations and MapReduce jobs, returning responses in milliseconds. Complex queries are executed natively in the database without having to use additional analytics frameworks or tools, and avoiding the latency that comes from moving data between operational and analytical engines. Secondary indexes give oppt to filter data in any way you need – key for low latency operational queries
Robust security controls: govern access, provide audit trails and enc data in flight and at rest
Scale-out – match scale out of data lake, as it grows, add new nodes to service higher data volumes or user load
Advanced management platform. To reduce data lake TCO and risk of application downtime, powerful tooling to automate database deployment, scaling, monitoring and alerting, and disaster recovery.
Look at examples in action
CTM – UK’s leading price comparisons sites – moved from an on-prem RDBMS based monlithic app to microservices architecture powered by MongoDB with Hadoop at the back end providing analytics – enabled them better personalize customer experience and deepen relationships
Read through bullets
2nd example leading global airline. Thru M&A – multiple brands to service different countries and market sectors, but customer data spread across 100 different systems.
By using Hadoop and Spark, brought that data together to create a single view, and that is loaded into MongoDB which powers the online apps – web and mobile, as well as call center – so users get a consistent experience however they interact. All user data and ticket data is stored in MongoDB, then written back into Hadoop to run advanced analytics that allow ticket price optimization, identify offers, and gaps in product portfolio
Read bullets
Provide traveler safety platform for corp customers – if natural disaster or security incident while traveler away on biz, able to send real time alerts and advise on how to get to safety
Platform built for national govts, now launched for commercial usage - Analyzing PBs of Data with MongoDB, Hadoop, Apache NiFi & SAP HANA
Read bullets
McAfee – built its cloud based threat intelligence platform on MongoDB. Platform monitor threat activity for clients in RT – identifies attacks are taking place, identifies when users maybe interacting with insecure or suspicious sites
All RT activity is captured in MongoDB – provide alerting to security teams, sent to Hadoop for further backend analytics, with updated threat profiles written back to mongo
MongoDB is open source – also provide EA
Collection of software and support to run in production at scale
The Stratio Apache Spark-certified Big Data (BD) platform is used by an impressive client list including BBVA, Just Eat, Santander, SAP, Sony, and Telefonica. The company has implemented a unified real-time monitoring platform for a multinational banking group operating in 31 countries with 51 million clients all over the world. The bank wanted to ensure a high quality of service and personalized experience across its online channels, and needed to continuously monitor client activity to check service response times and identify potential issues. The application was built on a modern technology foundation including:
Apache Flume to aggregate log data
Apache Spark to process log events in real time
MongoDB to persist log data, processed events and Key Performance Indicators (KPIs).
The aggregated KPIs, stored by MongoDB enable the bank to analyze client and systems behavior in real time in order to improve the customer experience. Collecting raw log data allows the bank to immediately rebuild user sessions if a service fails, with analysis generated by MongoDB and Spark providing complete traceability to quickly identify the root cause of any issue.
The project required a database that provided always-on availability, high performance, and linear scalability. In addition, a fully dynamic schema was needed to support high volumes of rapidly changing semi-structured and unstructured JSON data being ingested from a variety of logs, clickstreams, and social networks. After evaluating the project’s requirements, Stratio concluded MongoDB was the best fit. With MongoDB’s query projections and secondary indexes, analytic processes run by the Stratio BD platform avoid the need to scan the entire data set, which is not the case with other databases.
China Eastern
Industry: Travel and Hospitality, Airline
Use Case: Search
While its impt to provide low latency access to data, not enough to just support simple K-V lookups – demand is to get insights from data faster – so this is the role of RT analytics - track in RT where vehicles in your fleet, what social sentiment to an announcement you’ve just made, Correlate patterns of real time fraud attempts against specific domains – so this is where expressive query lang, secondary indexes, aggs in database are valuable.
MongoDB and RDBMS both have strong features – RDBMS further ahead – column family – little more than k-v. Need to move data out to other query frameworks or analytics nodes to get any intelligence – adds latency, adds complexity – more moving parts
RDBMS good in many areas, but lacks data model flexibility needed to handle rapidly changing, multi-structured data is where it falls downs.
CF – more schema flexibility than relational, but still need to pre-define columns, restrict speed to evolve apps
Data validtion – apply rules to data structures operational database stores – apps creates single view of your customer – data maybe spread across many repositories – loaded into data lake, creates single view, loads in mongo to serve operational apps – needs to ensure docs contains mandatory fields: unique customer identifiers, typed and formed in a specific way, ie ID is always an integer, email address always contains @. Doc validation in mongo enables you to do this. RDBMS full schema validation, so a little ahead – have to enforce govn in code in a CF database
Look at aggregrated scores – relationla abnd mongo evenly matched, with CF, much simpler datastore, long way behind
Hadoop and Spark integration: need to do more than just move vast amounts of data between each layer of the stack – need intelligent connectors that can push down predicates, filter data with secondary indexes – ie access all customers in a specific geo, without being able to access the DBs secondary indexe, and pre-aggregate data, moving a ton of data backward and forward – more processing cycles, longer latency.
MngoDB connector for Hadoop, and for Spark, both support these capabilities. CF doesn’t offer secondary indexes or aggs, so nothing to filter the data
RDBMS offers these capabilities in its connectors, but generally only available as expensive add-ons, hence downgraded
Workload isolation – ability to perform real time analytics on live operational data, without interfering with operational apps – don’t want some type of aggregation looking at how many deliveries your fleet of trucks has made with how quickly you can detect from sensor data than a vehicle has developed a fault – key to do this is distribute queries to dedicated nodes in the database cluster – some provisioned for operational data, then replicating to nodes dedicated to analytics. MongoDB – up to 50 members in a single replica set – configure analytics as hidden so never hit by op queries. CF, restricted to just 3 data replicas – there for HA, not for separation of different workloads. RDBMS, expensive add-on
Native BI connectivity – may not be relevant in all cases, but many orgs want to be able to create live dashboards reporting current state from op systems. MongoDB had a native BI connector that exposes database as an ODBC data source – visualize in anything from tableau to biz objects to excel. Rich tooling in relational world. CF, connector exist, 3rd party, don’t push down queries to the database, instead extract all data – so more computationally and network intensive to power dashboards
Security: data from operational databases exposed to apps and potentially millions of users – need to provide robust access controls, may include integration with LDAP, kerberos, PKI environments and RBAC to titghly seggregate who can do what in the DB. Enc data in flight and at rest, need to maintain a log of activity in the DB for forensic analysis
All solutions do well – big investment in Hadoop ecosystem, rapidly gainining ground on RDBMS, but doing it at much lower cost
Scale out – need to be able to scale as data lake scales, and more digital services opened up to users – non-relational databaes, core strenght. Fundamental challenge in RDBMS requires scale up, limited headroom, very expenive in proprietary h/w
Mgmt – Hadoop is complex, mgmt tools still primitive. For op database, need a platform that provides powerful tooling to automate database deployment, scaling, fine grained monitoring and alerting, and disaster recovery with point in time backups and automated restores. Rich tooling in relational world – big investment from Mongo to close that gap
Left hand side – maintained attribs of relational – blended with innovation from NoSQL
Uniquely differentiates mongodb from its peers in the non-relational DB market
Invest in tech that has production proven deployments, broad skills availability
With availability of Hadoop skills cited by Gartner analysts as a top challenge, it is essential you choose an operational database with a large available talent pool. This enables you to find staff who can rapidly build differentiated big data applications. Across multiple measures, including DB Engines Rankings, 451 Group NoSQL Skills Index and the Gartner Magic Quadrant for Operational Databases, MongoDB is the leading non-relational database.