SlideShare a Scribd company logo
1 of 54
Download to read offline
ADDRESSING DATA MANAGEMENT ON THE CLOUD:
TACKLINGTHE BIG DATA CHALLENGES
GENOVEVAVARGAS SOLAR
FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE
Genoveva.Vargas@imag.fr
http://www.vargas-solar.com
DIGITAL INFORMATION SCALE
Unit Size Meaning
Bit (b) 1 or 0 Short for binary digit, after the binary code (1 or 0) computers use to store and process data
Byte (B) 8 bits Enough information to create an English letter or number in computer code. It is the basic unit of
computing
Kilobyte (KB) 210 bytes From “thousand’ in Greek. One page of typed text is 2KB
Megabyte (MB) 220 bytes From “large” in Greek.The complete works of Shakespeare total 5 MB.A typical pop song is 4 MB.
Gigabyte (GB) 230 bytes From “giant” in Greek.A two-hour film ca be compressed into 1-2 GB.
Terabyte (TB) 240 bytes From “monster” in Greek.All the catalogued books in America’s Library of Congress total 15TB.
Petabyte (PB) 250 bytes All letters delivered in America’s postal service this year will amount ca. 5PB. Google processes
1PB per hour.
Exabyte (EB) 260 bytes Equivalent to 10 billion copies of The Economist
Zettabyte (ZB) 270 bytes The total amount of information in existence this year is forecast to be around 1,27ZB
Yottabyte (YB) 280 bytes Currently too big to imagine
Megabyte(106) Gigabyte (109)Terabytes (1012), petabytes (1015), exabytes (1018) and zettabytes (1021)
2
MASSIVE DATA
Data sources
¡  Information-sensing mobile devices, aerial sensory
technologies (remote sensing)
¡  Software logs, posts to social media sites
¡  Telescopes, cameras, microphones, digital pictures
and videos posted online
¡  Transaction records of online purchases
¡  RFID readers, wireless sensor networks (number
of sensors increasing by 30% a year)
¡  Cell phone GPS signals (increasing 20% a year)
Massive data
¡  Sloan Digital Sky Survey (2000-) its archive contains 140
terabytes. Its successor: Large Synoptic Survey Telescope
(2016-), will acquire that quantity of data every five days !
¡  Facebook hosts 140 billion photos, and will add 70 billion
this year (ca. 1 petabyte). Every 2 minutes today we snap as
many photos as the whole of humanity took in the 1800s !
¡  Wal-Mart, handles more than 1 million customer
transactions every hour, feeding databases estimated at
more than 2.5 petabytes (1015)
¡  The Large Hadron Collider (LHC): nearly 15 million billion
bytes per year -15 petabytes (1015).These data require
70,000 processors to be processed!
http://blog.websourcing.fr/infographie-­‐la-­‐vrai-­‐taille-­‐dinternet/	
  
3
WHO PRODUCES DATA ? DIGITAL SHADOW
¡  3,146 billion mail addresses, of which 360 million Hotmail
¡  95.5 million domains.com
¡  2.1 billion Internet users, including 922 million in Asia and 485 million in China
¡  2.4 billion accounts on social networks
¡  1000 billion videos viewed onYouTube, about 140 average per capita on the planet!
¡  60 images posted on Instagram every second
¡  250 million tweets per day in October
75% of the information is generated by individuals — writing documents, taking pictures,
downloading music, etc. — but is far less than the amount of information being created about
them in the digital universe
4
BIG DATA
¡  Collection of data sets so large and complex that it becomes difficult to process using on-hand database management
tools or traditional data processing applications.
¡  Challenges include capture, curation, storage, search, sharing, analysis, and visualization within a tolerable elapsed
time
¡  Data growth challenges and opportunities are three-dimensional (3Vs model)
¡  increasing volume (amount of data)
¡  velocity (speed of data in and out)
¡  variety (range of data types and sources)
"Big data are high volume, high velocity, and/or high variety information assets that require new forms of
processing to enable enhanced decision making, insight discovery and process optimization.”
5
AGENDA
¡  Big data all around
¡  Building the boat
¡  Cloud architectures
¡  Storage vs.Access
¡  Data management trends
¡  Big data economic and technological phenomenon
¡  Outlook
6
DATA MANAGEMENT WITH RESOURCES CONSTRAINTS
STORAGE	
  
SUPPORT	
  
Systems	
  
ARCHITECTURE	
  &	
  
RESOURCES	
  AWARE	
  
RAM	
  
Algorithms	
  
Efficiently manage and exploit data sets according to given specific storage, memory and computation
resources
7
DATA MANAGEMENT WITHOUT RESOURCES CONSTRAINTS
8
Costly manage and exploit data sets according to unlimited storage, memory and computation resources
Systems	
  Algorithms	
  
COSTAWARE	
  
ELASTIC	
  
CLOUD IN A NUTSHELL
9
DEPLOYMENT
MODELS
PIVOT
ISSUES
DELIVERY
MODELS
1.  Cloud Software as a Service (SaaS)
2.  Cloud Platform as a Service (PaaS)
3.  Cloud Infrastructure as a Service (IaaS)
1.  Private cloud
2.  Public cloud
3.  Hybrid cloud
4.  Community Cloud
1.  Virtualization
2.  Autonomics (automation)
3.  Grid Computing (job scheduling)
10
CLOUD COMPUTINGPackaged
Software
Storage
Servers
Networking
O/S
Middleware
Virtualization
Data
Applications
Runtime
Youmanage
Infrastructure
(as a Service)
Storage
Servers
Networking
O/S
Middleware
Virtualization
Data
Applications
Runtime
Managedbyvendor
Youmanage
Platform
(as a Service)
Managedbyvendor
Youmanage
Storage
Servers
Networking
O/S
Middleware
Virtualization
Applications
Runtime
Data
Software
(as a Service)
Managedbyvendor
Storage
Servers
Networking
O/S
Middleware
Virtualization
Applications
Runtime
Data
CLOUD COMPUTING CHARACTERISTICS
¡  On-demand self-service
¡  Available computing capabilities without human interaction
¡  Broad network access
¡  Accessibility from heterogeneous clients (mobile phones, laptops, etc.)
¡  Rapid elasticity
¡  Possibility to automatically and rapidly increase or decrease capabilities
¡  Measured service
¡  Automatic control and optimization of resources
¡  Resource pooling (virtualization, multi-location)
¡  Multi-tenant resources (CPU, disk, memory, network) with a sense of location independence
D	
  
C	
  
Time	
  
R	
  
11
AGENDA
¡  Big data all around
¡  Building the boat
¡  Cloud architectures
¡  Storage vs.Access
¡  Data management trends
¡  Big data economic and technological phenomenon
¡  Outlook
12
13
¡  Build a MyNet app based on a distributed database for building an integrated contact
directory and posts from social networks
MyNet	
  DB	
  MyNet	
  App	
   Social	
  network	
  
CLASSIC CLOUD AWARE SOLUTION
A step on the cloud (http://gemynetproject.cloudapp.net:8080/)
¡  creating a database on a relational DBMS deployed on the cloud
¡  building a simple database application exported as a service
¡  deploying the service on the cloud and implement an interface
14
MyNet	
  DB	
  
MyNet	
  App	
  
Contact'
idContact:"Integer"
firstName:"String"
lastName:"String"
Service	
  
App	
  
15
DEALING WITH HUGE AMOUNTS OF DATA
Peta	
  1015	
  
Exa	
  1018	
  
Zetta	
  1021	
  
Yota	
  1024	
  
RAID	
  
Disk	
  
Cloud	
  
16
DEALING WITH HUGE AMOUNTS OF DATA
17
Peta	
  1015	
  
Exa	
  1018	
  
Zetta	
  1021	
  
Yota	
  1024	
  
RAID	
  
Disk	
  
Cloud	
  
Concurrency	
  
Consistency	
  
Atomicity	
  
Relational	
  
Graph	
  
Key	
  value	
  
Columns	
  
18
19
POLYGLOT PERSISTENCE
¡  Polyglot Programming: applications should be written in a mix of languages to take
advantage of different languages are suitable for tackling different problems
¡  Polyglot persistence: any decent sized enterprise will have a variety of different data
storage technologies for different kinds of data
¡  a new strategic enterprise application should no longer be built assuming a relational
persistence support
¡  the relational option might be the right one - but you should seriously look at other
alternatives
20
M.	
  Fowler	
  and	
  P.	
  Sadalage.	
  NoSQL	
  Distilled:	
  A	
  Brief	
  Guide	
  to	
  the	
  Emerging	
  World	
  of	
  Polyglot	
  Persistence.	
  Pearson	
  Education,	
  Limited,	
  2012	
  
WHEN IS POLYGLOT PERSISTENCE PERTINENT?
¡  Application essentially composing and serving web pages
¡  They only looked up page elements by ID, they had no need for transactions, and no need to
share their database
¡  A problem like this is much better suited to a key-value store than the corporate relational
hammer they had to use
¡  Scaling to lots of traffic gets harder and harder to do with vertical scaling
¡  Many NoSQL databases are designed to operate over clusters
¡  They can tackle larger volumes of traffic and data than is realistic with a single server
21
22
(Katsov-2012)
Use the right tool for the right job…
How do I know which is the
right tool for the right job?
23
GENERATING NOSQL PROGRAMS FROM HIGH LEVEL
ABSTRACTIONS
24
UML class diagram
Spring Roo
Java	
  
web	
  
App	
   Spring Data
Graph database
Relational database
High-level abstractions
Low-level abstractions
http://code.google.com/p/model2roo/	
  	
  
DATA DISTRIBUTION
¡  sharding a relational DBMS
¡  dealing with shards synchronization
¡  deploying the service on the cloud and implement an interface
25
MyNet	
  DB	
  
MyNet	
  
Service	
  
webSite:(URI(
socialNetworkID:(URI((
(
Basic(Info(
Contact'
idContact:(Integer(
lastName:(String(
givenName:(String(
society:(String( <<'hasBasicInfo'>>'
1' *'
postID:(Integer(
timeStamp:(Date(
geoStamp:(String(
Post'
contentID:(Integer(
text:(String(
Content'
<<'hasContent'>>'
<<publishesPost>>'
*'
1'
1' 1'
street:(String,((
number:(Integer,(
City:(Sting,((
Zipcode:(Integer(
Address'
*'
email:(String(
type:({pers,(prof}(
Email'
<<'hasAddress'>>'
*'
1'
<<'hasEmail'>>'
ID:(Integer,(
Lang:(String(
Language'
*'
<<'speaksLanguage'>>'
Français	
  
Español	
  
English	
  
Others	
  
REST	
  
MyNetContacts	
  
Internet services
REST	
  
JSON	
  documents	
  
Contact	
  directory	
  
MyNet	
  
ANALYSIS LAYER
(SOFTWARE AS A SERVICE)
INTEGRATION LAYER
(DATABASE AS A SERVICE)
MyPosts	
  
POLYGLOT APPROACH TO PERSISTENCE
26
EXTRACTING SCHEMATA FROM NOSQL PROGRAMS CODE
http://code.google.com/p/exschema/	
  	
  
27
Metalayer(
Struct( Relationship(
Attribute(Set(
*( *(
*(
*(
*( *(
*(
*(*(
*(
*(
*(
*(
*(
Declarations	
  
analyzer	
  
Updates	
  
analyzer	
  
Repositories	
  
analyzer	
  
Schema1 Schema2
Schema3
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.UserInfo
Struct
Attribute
name : userId
Attribute
name : firstName
Attribute
name : lastName
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.Tweet
Struct
Attribute
name : text
Attribute
name : userId
Set
Attribute
implementation : Neo4j
Struct
Attribute
name : fr.imag.twitter.domain.User
Str
Attribute
name : nodeId
Attribute
name : userName
A
na
Relatio
start end
Schema1 Schema2
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.UserInfo
Struct
Attribute
name : userId
Attribute
name : firstName
Attribute
name : lastName
Set
Attribute
implementation : Spring Repository
Attribute
name : fr.imag.twitter.domain.Tweet
Attribute
name : text
Attri
name :
Schema1 Schema2
Schema3
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.UserInfo
Struct
Attribute
name : userId
Attribute
name : firstName
Attribute
name : lastName
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.Tweet
Struct
Attribute
name : text
Attribute
name : userId
Set
Attribute
implementation : Neo4j
Struct
Attribute
name : fr.imag.twitter.domain.User
Struct
Attribute
name : nodeId
Attribute
name : userName
Attribute
name : userId
Attribute
name : password
Relationship
start end
Attribute
name : followers
Spring Data
AGENDA
¡  Big data all around
¡  Building the boat
¡  Cloud architectures
¡  Storage vs.Access
¡  Data management trends
¡  Big data economic and technological phenomenon
¡  Outlook
28
HOW ARE THIS DATA EXPLOITED?
Domain : Meteorology, genomics, complex physics simulations, biological and environmental research,
Internet search,Web mining, Social networks, finance and business informatics,Archaeological, Culture
¡  Tesco collects 1.5 billion pieces of data every month and uses them to adjust prices and promotions
¡  Williams-Sonoma uses its knowledge of its 60 million customers (which includes such details as their income
and the value of their houses) to produce different iterations of its catalogue
¡  30% of Amazon's sales are generated by its recommendation engine (“you may also like”)
¡  Placecast is developing technologies that allow them to track potential consumers and send them enticing
offers when they get within a few yards of a Starbucks
29
DATA ACCESS REQUIREMENTS
¡  Capture, search, discovery, and analysis tools for unstructured data (90% of the digital universe)
¡  Metadata management
¡  Most of tools can create data about data automatically.
¡  Metadata, is growing twice as fast as the digital universe as a whole
¡  Business intelligence tools (also with real-time data)
¡  Storage management tools (duplication, auto-tiering, and virtualization, what exactly to store)
¡  Security practices and tools
¡  identify the information that needs to be secured and at what level of security
¡  specific threat protection devices and software, fraud management systems, or reputation protection services
30
BIG LINKED DATA
¡  Is about using the Web to
¡  connect related data that wasn't previously linked
¡  lower the barriers to linking data currently linked using
other methods
¡  Exposing, sharing, and connecting pieces of data,
information, and knowledge on the Semantic Web using
URIs and RDF
¡  A typical case of a large Linked Dataset is DBPedia,
which, essentially, makes the content of Wikipedia
available in RDF
31
… WITH RESOURCES CONSTRAINTS
32
Q1:Which are the most popular products at Starbucks ?
Q2:Which are the consumption rules of
Starbucks clients ?
Distribution and organization of
data on disk
Query and data processing
on server
Swap memory– disk
Data transfer
•  Efficiency	
  =>	
  time	
  cost	
  
•  Optimizing	
  memory	
  and	
  computing	
  
cost	
  
Efficiently manage and exploit data sets according to given specific storage, memory and computation
resources
WITHOUT RESOURCES CONSTRAINTS …
33
Costly	
  =>	
  minimizing	
  cost,	
  energy	
  
consumption	
  
¡  Query evaluationà How and under which limits ?
¡  Is not longer completely constraint by resources availability: computing, RAM, storage, network services
¡  Decision making process determined by resources consumption and consumer requirements
¡  Data involved in the query, particularly in the result can have different costs: top 5 gratis and the rest available
in return to a credit card number
¡  Results storage and exploitation demands more resources
AGENDA
¡  Big data all around
¡  Building the boat
¡  Cloud architectures
¡  Storage vs.Access
¡  Data management trends
¡  Big data economic and technological phenomenon
¡  Outlook
34
DATA MANAGEMENT SYSTEMS ARCHITECTURES
Physical	
  model	
  
Logic	
  model	
  
External	
  model	
  
ANSI/SPARC	
  
Storage	
  
Manager	
  
Schema	
  
Manager	
  
Query	
  
Engine	
  
Transaction	
  
Manager	
  
DBMS	
  
Customisable	
  points	
  
Custom	
  components	
  
Glue	
  
code	
  
Data	
  	
  
services	
  
Access	
  
services	
  
Storage	
  
services	
  
Additional	
  
Extension	
  
services	
  
Other	
  
services	
  
Extension	
  services	
  
Streaming, XML, procedures,
queries, replication
Physical	
  model	
  
Logic	
  model	
  
External	
  model	
  
Physical	
  model	
  
Logic	
  model	
  
External	
  model	
  
35
B-TREE: PHYSICAL LEVEL
36
«MEMCACHED»
¡  «memcached» is a memory management protocol based on a cache:
¡  Uses the key-value notion
¡  Information is completly stored in RAM
¡  «memcached» protocol for:
¡  Creating, retrieving, updating, and deleting information from the database
¡  Applications with their own «memcached» manager (Google, Facebook,
YouTube, FarmVille,Twitter,Wikipedia)
37
STORAGE ON DISC (1)
¡  For efficiency reasons, information is stored using the RAM:
¡  Work information is in RAM in order to answer to low latency requests
¡  Yet, this is not always possible and desirable
Ø  The process of moving data from RAM to disc is called "eviction”; this
process is configured automatically for every bucket
38
STORAGE ON DISC (2)
¡  NoSQL servers support the storage of key-value pairs on disc:
¡  Persistency–can be executed by loading data, closing and reinitializing it
without having to load data from another source
¡  Hot backups– loaded data are sotred on disc so that it can be
reinitialized in case of failures
¡  Storage on disc– the disc is used when the quantity of data is higher
thant the physical size of the RAM, frequently used information is
maintained in RAM and the rest es stored on disc
39
STORAGE ON DISC (3)
¡  Strategies for ensuring:
¡  Each node maintains in RAM information on the key-value pairs it stores. Keys:
¡  may not be found, or
¡  they can be stored in memory or on disc
¡  The process of moving information from RAM to disc is asynchronous:
¡  The server can continue processing new requests
¡  A queue manages requests to disc
Ø  In periods with a lot of writing requests, clients can be notified that the server is
termporaly out of memory until information is evicted
40
REPLICATION
41
EXECUTION MODEL: MAP-REDUCE
¡  Programming model for expressing distributed computations on massive amounts of data and an execution
framework for large-scale data processing on clusters of commodity servers
¡  Open-source implementation called Hadoop, whose development was led byYahoo (now an Apache project)
¡  Divide and conquer principle: partition a large problem into smaller subproblems
¡  To the extent that the sub-problems are independent, they can be tackled in parallel by different workers
¡  intermediate results from each individual worker are then combined to yield the final output
¡  Large-data processing requires bringing data and code together for computation to occur:
à no small feat for datasets that are terabytes and perhaps petabytes in size!
42
MAP-REDUCE ELEMENTS
¡  Stage 1:Apply a user-specified computation over all input records in a dataset.
¡  These operations occur in parallel and yield intermediate output (key-value couples)
¡  Stage 2:Aggregate intermediate output by another user-specified computation
¡  Recursively applies a function on every pair of the list
¡  Execution framework coordinates the actual processing
¡  Implementation of the programming model and the execution framework
43
MAP REDUCE EXAMPLE
44
Count the number of occurrences of every word in a text collection
EXECUTION FRAMEWORK
¡  Important idea behind MapReduce is separating the what of distributed processing from the how
¡  A MapReduce program (job) consists of
¡  code for mappers and reducers packaged together with
¡  configuration parameters (such as where the input lies and where the output should be stored)
¡  Execution framework responsibilities: scheduling
¡  Each MapReduce job is divided into smaller units called tasks
¡  In large jobs, the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently à
manage tasks queues
¡  Coordination among tasks belonging to different jobs
45
MAP-REDUCE ADDITIONAL ELEMENTS
¡  Partitioners are responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to
reducers
¡  the partitioner species the task to which an intermediate key-value pair must be copied
¡  Combiners are an optimization in MapReduce that allow for local aggregation before the shuffle and sort phase
46
AGENDA
¡  Big data all around
¡  Building the boat
¡  Cloud architectures
¡  Storage vs.Access
¡  Data management trends
¡  Big data economic and technological phenomenon
¡  Outlook
47
BIGVALUE FROM BIG DATA
¡  Big data is not only about the original content stored or being consumed but also about the
information around its consumption.
¡  Gigabytes of stored content can generate a petabyte or more of transient data that we typically don't
store digital TV signals we watch but don't record voice calls that are made digital in the network
backbone during the duration of a call
¡  Big data technologies describe a new generation of technologies and architectures, designed to
economically extract value from very large volumes of a wide variety of data, by enabling
high-velocity capture, discovery, and/or analysis
http://www.emc.com/collateral/analyst-­‐reports/idc-­‐extracting-­‐value-­‐from-­‐chaos-­‐ar.pdf	
  
Big data is not a "thing” but instead a dynamic/activity that crosses many IT borders	

48
WHAT ARE THE FORCES BEHIND THE EXPLOSIVE GROWTH OF
THE DIGITAL UNIVERSE?
¡  Technology has helped by driving the cost of creating, capturing, managing, and storing
information
¡  … Prime mover is financial
¡  Since 2005, the investment by enterprises in the digital universe has increased 50% — to $4
trillion
¡  spent on hardware, software, services, and staff to create, manage, and store — and derive revenues from
the digital universe
¡  the trick is to generate value by extracting the right information from the digital universe
49
BIG DATA TECH ECONOMY
¡  Data-driven world guided by a rapid ROD (Return on Data)
à reducing cost, complexity, risk and increasing the value of your holdings thanks to a mastery of
technologies
¡  The key is how quickly data can be turned in to currency by:
¡  Analysing patterns and spotting relationships/trends that enable decisions to be made faster with more
precision and confidence
¡  Identifying actions and bits of information that are out of compliance with company policies can avoid
millions in fines
¡  Proactively reducing the amount of data you pay ($18,750/gigabyte to review in eDiscovery) by identifying
only the relevant pieces of information
¡  Optimizing storage by deleting or offloading non-critical assets to cheaper cloud storage thus saving
millions in archive solutions
50
AGENDA
¡  Big data all around
¡  Building the boat
¡  Cloud architectures
¡  Storage vs.Access
¡  Data management trends
¡  Big data economic and technological phenomenon
¡  Outlook
51
OUTLOOK
¡  Good news is Big Data is here
¡  Bad news is we are struggling to capture, storage, search, share, analyze, and visualize it
¡  Current technology is not adequate to cope with such large amounts of data (requiring massively parallel
software running on tens, hundreds, or even thousands of servers)
¡  Expertise in a wide range of topics including
¡  Machine architectures (HPC), Service-oriented-architecture
¡  Distributed/cloud computing, operating and database systems
¡  Software Engineering, algorithmic techniques and networking
“With hardware, networks and software having been commoditized to the point that all are essentially free, it was inevitable that the
trajectory toward maximum entropy would bring us the current age of Big Data.”
-Shomit Ghose, Big Data,The Only Business ModelTech has Left; CIO Network
http://www.bigdatabytes.com/big-data-is-the-new-tech-economy/
52
Contact: GenovevaVargas-Solar, CNRS, LIG-LAFMIA
Genoveva.Vargas@imag.fr	
  
http://vargas-­‐solar.imag.fr	
  	
  
http://code.google.com/p/exschema/	
  	
  
Want to put cloud data management in practice ?
cf. http://vargas-­‐solar.imag.fr/academika/cloud-­‐data-­‐management/	
  	
  
53
http://code.google.com/p/model2roo/	
  	
  
Open source polyglot persistence tools
54

More Related Content

What's hot

Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache HadoopSuman Saurabh
 
introduction to big data frameworks
introduction to big data frameworksintroduction to big data frameworks
introduction to big data frameworksAmal Targhi
 
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageGeospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageSteven Ramage
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)eXascale Infolab
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 
The evolution of data analytics
The evolution of data analyticsThe evolution of data analytics
The evolution of data analyticsNatalino Busa
 
Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3mcacicio
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 

What's hot (20)

Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
 
introduction to big data frameworks
introduction to big data frameworksintroduction to big data frameworks
introduction to big data frameworks
 
A Brief History Of Data
A Brief History Of DataA Brief History Of Data
A Brief History Of Data
 
Big data frameworks
Big data frameworksBig data frameworks
Big data frameworks
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageGeospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
The evolution of data analytics
The evolution of data analyticsThe evolution of data analytics
The evolution of data analytics
 
Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3Aginity Big Data Research Lab V3
Aginity Big Data Research Lab V3
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 

Similar to Addressing dm-cloud

Big Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioBig Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioCHAKER ALLAOUI
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its ChallengesKathirvel Ayyaswamy
 
Mongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner AltinMongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner Altinmustafa sarac
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Datawaheed751
 
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setSoner Altin
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataSitaram Kotnis
 
Big Data Story - From An Engineer's Perspective
Big Data Story - From An Engineer's PerspectiveBig Data Story - From An Engineer's Perspective
Big Data Story - From An Engineer's PerspectiveHien Luu
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and InternetSanoj Kumar
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 

Similar to Addressing dm-cloud (20)

1 mapreduce-fest
1 mapreduce-fest1 mapreduce-fest
1 mapreduce-fest
 
Big Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioBig Data to SMART Data : Process Scenario
Big Data to SMART Data : Process Scenario
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
 
Mongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner AltinMongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner Altin
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Data
 
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature set
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big Data Story - From An Engineer's Perspective
Big Data Story - From An Engineer's PerspectiveBig Data Story - From An Engineer's Perspective
Big Data Story - From An Engineer's Perspective
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
NoSQL Basics - a quick tour
NoSQL Basics - a quick tourNoSQL Basics - a quick tour
NoSQL Basics - a quick tour
 

More from Genoveva Vargas-Solar

Talk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial IntelligenceTalk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial IntelligenceGenoveva Vargas-Solar
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYGenoveva Vargas-Solar
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYGenoveva Vargas-Solar
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPCGenoveva Vargas-Solar
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtGenoveva Vargas-Solar
 

More from Genoveva Vargas-Solar (9)

Aiccsa 2021-w-stem
Aiccsa 2021-w-stemAiccsa 2021-w-stem
Aiccsa 2021-w-stem
 
Talk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial IntelligenceTalk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial Intelligence
 
Data w-steamm
Data w-steammData w-steamm
Data w-steamm
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPC
 
3 map reduce perspectives
3 map reduce perspectives3 map reduce perspectives
3 map reduce perspectives
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbt
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Addressing dm-cloud

  • 1. ADDRESSING DATA MANAGEMENT ON THE CLOUD: TACKLINGTHE BIG DATA CHALLENGES GENOVEVAVARGAS SOLAR FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE Genoveva.Vargas@imag.fr http://www.vargas-solar.com
  • 2. DIGITAL INFORMATION SCALE Unit Size Meaning Bit (b) 1 or 0 Short for binary digit, after the binary code (1 or 0) computers use to store and process data Byte (B) 8 bits Enough information to create an English letter or number in computer code. It is the basic unit of computing Kilobyte (KB) 210 bytes From “thousand’ in Greek. One page of typed text is 2KB Megabyte (MB) 220 bytes From “large” in Greek.The complete works of Shakespeare total 5 MB.A typical pop song is 4 MB. Gigabyte (GB) 230 bytes From “giant” in Greek.A two-hour film ca be compressed into 1-2 GB. Terabyte (TB) 240 bytes From “monster” in Greek.All the catalogued books in America’s Library of Congress total 15TB. Petabyte (PB) 250 bytes All letters delivered in America’s postal service this year will amount ca. 5PB. Google processes 1PB per hour. Exabyte (EB) 260 bytes Equivalent to 10 billion copies of The Economist Zettabyte (ZB) 270 bytes The total amount of information in existence this year is forecast to be around 1,27ZB Yottabyte (YB) 280 bytes Currently too big to imagine Megabyte(106) Gigabyte (109)Terabytes (1012), petabytes (1015), exabytes (1018) and zettabytes (1021) 2
  • 3. MASSIVE DATA Data sources ¡  Information-sensing mobile devices, aerial sensory technologies (remote sensing) ¡  Software logs, posts to social media sites ¡  Telescopes, cameras, microphones, digital pictures and videos posted online ¡  Transaction records of online purchases ¡  RFID readers, wireless sensor networks (number of sensors increasing by 30% a year) ¡  Cell phone GPS signals (increasing 20% a year) Massive data ¡  Sloan Digital Sky Survey (2000-) its archive contains 140 terabytes. Its successor: Large Synoptic Survey Telescope (2016-), will acquire that quantity of data every five days ! ¡  Facebook hosts 140 billion photos, and will add 70 billion this year (ca. 1 petabyte). Every 2 minutes today we snap as many photos as the whole of humanity took in the 1800s ! ¡  Wal-Mart, handles more than 1 million customer transactions every hour, feeding databases estimated at more than 2.5 petabytes (1015) ¡  The Large Hadron Collider (LHC): nearly 15 million billion bytes per year -15 petabytes (1015).These data require 70,000 processors to be processed! http://blog.websourcing.fr/infographie-­‐la-­‐vrai-­‐taille-­‐dinternet/   3
  • 4. WHO PRODUCES DATA ? DIGITAL SHADOW ¡  3,146 billion mail addresses, of which 360 million Hotmail ¡  95.5 million domains.com ¡  2.1 billion Internet users, including 922 million in Asia and 485 million in China ¡  2.4 billion accounts on social networks ¡  1000 billion videos viewed onYouTube, about 140 average per capita on the planet! ¡  60 images posted on Instagram every second ¡  250 million tweets per day in October 75% of the information is generated by individuals — writing documents, taking pictures, downloading music, etc. — but is far less than the amount of information being created about them in the digital universe 4
  • 5. BIG DATA ¡  Collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. ¡  Challenges include capture, curation, storage, search, sharing, analysis, and visualization within a tolerable elapsed time ¡  Data growth challenges and opportunities are three-dimensional (3Vs model) ¡  increasing volume (amount of data) ¡  velocity (speed of data in and out) ¡  variety (range of data types and sources) "Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” 5
  • 6. AGENDA ¡  Big data all around ¡  Building the boat ¡  Cloud architectures ¡  Storage vs.Access ¡  Data management trends ¡  Big data economic and technological phenomenon ¡  Outlook 6
  • 7. DATA MANAGEMENT WITH RESOURCES CONSTRAINTS STORAGE   SUPPORT   Systems   ARCHITECTURE  &   RESOURCES  AWARE   RAM   Algorithms   Efficiently manage and exploit data sets according to given specific storage, memory and computation resources 7
  • 8. DATA MANAGEMENT WITHOUT RESOURCES CONSTRAINTS 8 Costly manage and exploit data sets according to unlimited storage, memory and computation resources Systems  Algorithms   COSTAWARE   ELASTIC  
  • 9. CLOUD IN A NUTSHELL 9 DEPLOYMENT MODELS PIVOT ISSUES DELIVERY MODELS 1.  Cloud Software as a Service (SaaS) 2.  Cloud Platform as a Service (PaaS) 3.  Cloud Infrastructure as a Service (IaaS) 1.  Private cloud 2.  Public cloud 3.  Hybrid cloud 4.  Community Cloud 1.  Virtualization 2.  Autonomics (automation) 3.  Grid Computing (job scheduling)
  • 10. 10 CLOUD COMPUTINGPackaged Software Storage Servers Networking O/S Middleware Virtualization Data Applications Runtime Youmanage Infrastructure (as a Service) Storage Servers Networking O/S Middleware Virtualization Data Applications Runtime Managedbyvendor Youmanage Platform (as a Service) Managedbyvendor Youmanage Storage Servers Networking O/S Middleware Virtualization Applications Runtime Data Software (as a Service) Managedbyvendor Storage Servers Networking O/S Middleware Virtualization Applications Runtime Data
  • 11. CLOUD COMPUTING CHARACTERISTICS ¡  On-demand self-service ¡  Available computing capabilities without human interaction ¡  Broad network access ¡  Accessibility from heterogeneous clients (mobile phones, laptops, etc.) ¡  Rapid elasticity ¡  Possibility to automatically and rapidly increase or decrease capabilities ¡  Measured service ¡  Automatic control and optimization of resources ¡  Resource pooling (virtualization, multi-location) ¡  Multi-tenant resources (CPU, disk, memory, network) with a sense of location independence D   C   Time   R   11
  • 12. AGENDA ¡  Big data all around ¡  Building the boat ¡  Cloud architectures ¡  Storage vs.Access ¡  Data management trends ¡  Big data economic and technological phenomenon ¡  Outlook 12
  • 13. 13 ¡  Build a MyNet app based on a distributed database for building an integrated contact directory and posts from social networks MyNet  DB  MyNet  App   Social  network  
  • 14. CLASSIC CLOUD AWARE SOLUTION A step on the cloud (http://gemynetproject.cloudapp.net:8080/) ¡  creating a database on a relational DBMS deployed on the cloud ¡  building a simple database application exported as a service ¡  deploying the service on the cloud and implement an interface 14 MyNet  DB   MyNet  App   Contact' idContact:"Integer" firstName:"String" lastName:"String" Service   App  
  • 15. 15
  • 16. DEALING WITH HUGE AMOUNTS OF DATA Peta  1015   Exa  1018   Zetta  1021   Yota  1024   RAID   Disk   Cloud   16
  • 17. DEALING WITH HUGE AMOUNTS OF DATA 17 Peta  1015   Exa  1018   Zetta  1021   Yota  1024   RAID   Disk   Cloud   Concurrency   Consistency   Atomicity   Relational   Graph   Key  value   Columns  
  • 18. 18
  • 19. 19
  • 20. POLYGLOT PERSISTENCE ¡  Polyglot Programming: applications should be written in a mix of languages to take advantage of different languages are suitable for tackling different problems ¡  Polyglot persistence: any decent sized enterprise will have a variety of different data storage technologies for different kinds of data ¡  a new strategic enterprise application should no longer be built assuming a relational persistence support ¡  the relational option might be the right one - but you should seriously look at other alternatives 20 M.  Fowler  and  P.  Sadalage.  NoSQL  Distilled:  A  Brief  Guide  to  the  Emerging  World  of  Polyglot  Persistence.  Pearson  Education,  Limited,  2012  
  • 21. WHEN IS POLYGLOT PERSISTENCE PERTINENT? ¡  Application essentially composing and serving web pages ¡  They only looked up page elements by ID, they had no need for transactions, and no need to share their database ¡  A problem like this is much better suited to a key-value store than the corporate relational hammer they had to use ¡  Scaling to lots of traffic gets harder and harder to do with vertical scaling ¡  Many NoSQL databases are designed to operate over clusters ¡  They can tackle larger volumes of traffic and data than is realistic with a single server 21
  • 22. 22
  • 23. (Katsov-2012) Use the right tool for the right job… How do I know which is the right tool for the right job? 23
  • 24. GENERATING NOSQL PROGRAMS FROM HIGH LEVEL ABSTRACTIONS 24 UML class diagram Spring Roo Java   web   App   Spring Data Graph database Relational database High-level abstractions Low-level abstractions http://code.google.com/p/model2roo/    
  • 25. DATA DISTRIBUTION ¡  sharding a relational DBMS ¡  dealing with shards synchronization ¡  deploying the service on the cloud and implement an interface 25 MyNet  DB   MyNet   Service   webSite:(URI( socialNetworkID:(URI(( ( Basic(Info( Contact' idContact:(Integer( lastName:(String( givenName:(String( society:(String( <<'hasBasicInfo'>>' 1' *' postID:(Integer( timeStamp:(Date( geoStamp:(String( Post' contentID:(Integer( text:(String( Content' <<'hasContent'>>' <<publishesPost>>' *' 1' 1' 1' street:(String,(( number:(Integer,( City:(Sting,(( Zipcode:(Integer( Address' *' email:(String( type:({pers,(prof}( Email' <<'hasAddress'>>' *' 1' <<'hasEmail'>>' ID:(Integer,( Lang:(String( Language' *' <<'speaksLanguage'>>' Français   Español   English   Others  
  • 26. REST   MyNetContacts   Internet services REST   JSON  documents   Contact  directory   MyNet   ANALYSIS LAYER (SOFTWARE AS A SERVICE) INTEGRATION LAYER (DATABASE AS A SERVICE) MyPosts   POLYGLOT APPROACH TO PERSISTENCE 26
  • 27. EXTRACTING SCHEMATA FROM NOSQL PROGRAMS CODE http://code.google.com/p/exschema/     27 Metalayer( Struct( Relationship( Attribute(Set( *( *( *( *( *( *( *( *(*( *( *( *( *( *( Declarations   analyzer   Updates   analyzer   Repositories   analyzer   Schema1 Schema2 Schema3 Set Attribute implementation : Spring Repository Set Attribute name : fr.imag.twitter.domain.UserInfo Struct Attribute name : userId Attribute name : firstName Attribute name : lastName Set Attribute implementation : Spring Repository Set Attribute name : fr.imag.twitter.domain.Tweet Struct Attribute name : text Attribute name : userId Set Attribute implementation : Neo4j Struct Attribute name : fr.imag.twitter.domain.User Str Attribute name : nodeId Attribute name : userName A na Relatio start end Schema1 Schema2 Set Attribute implementation : Spring Repository Set Attribute name : fr.imag.twitter.domain.UserInfo Struct Attribute name : userId Attribute name : firstName Attribute name : lastName Set Attribute implementation : Spring Repository Attribute name : fr.imag.twitter.domain.Tweet Attribute name : text Attri name : Schema1 Schema2 Schema3 Set Attribute implementation : Spring Repository Set Attribute name : fr.imag.twitter.domain.UserInfo Struct Attribute name : userId Attribute name : firstName Attribute name : lastName Set Attribute implementation : Spring Repository Set Attribute name : fr.imag.twitter.domain.Tweet Struct Attribute name : text Attribute name : userId Set Attribute implementation : Neo4j Struct Attribute name : fr.imag.twitter.domain.User Struct Attribute name : nodeId Attribute name : userName Attribute name : userId Attribute name : password Relationship start end Attribute name : followers Spring Data
  • 28. AGENDA ¡  Big data all around ¡  Building the boat ¡  Cloud architectures ¡  Storage vs.Access ¡  Data management trends ¡  Big data economic and technological phenomenon ¡  Outlook 28
  • 29. HOW ARE THIS DATA EXPLOITED? Domain : Meteorology, genomics, complex physics simulations, biological and environmental research, Internet search,Web mining, Social networks, finance and business informatics,Archaeological, Culture ¡  Tesco collects 1.5 billion pieces of data every month and uses them to adjust prices and promotions ¡  Williams-Sonoma uses its knowledge of its 60 million customers (which includes such details as their income and the value of their houses) to produce different iterations of its catalogue ¡  30% of Amazon's sales are generated by its recommendation engine (“you may also like”) ¡  Placecast is developing technologies that allow them to track potential consumers and send them enticing offers when they get within a few yards of a Starbucks 29
  • 30. DATA ACCESS REQUIREMENTS ¡  Capture, search, discovery, and analysis tools for unstructured data (90% of the digital universe) ¡  Metadata management ¡  Most of tools can create data about data automatically. ¡  Metadata, is growing twice as fast as the digital universe as a whole ¡  Business intelligence tools (also with real-time data) ¡  Storage management tools (duplication, auto-tiering, and virtualization, what exactly to store) ¡  Security practices and tools ¡  identify the information that needs to be secured and at what level of security ¡  specific threat protection devices and software, fraud management systems, or reputation protection services 30
  • 31. BIG LINKED DATA ¡  Is about using the Web to ¡  connect related data that wasn't previously linked ¡  lower the barriers to linking data currently linked using other methods ¡  Exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF ¡  A typical case of a large Linked Dataset is DBPedia, which, essentially, makes the content of Wikipedia available in RDF 31
  • 32. … WITH RESOURCES CONSTRAINTS 32 Q1:Which are the most popular products at Starbucks ? Q2:Which are the consumption rules of Starbucks clients ? Distribution and organization of data on disk Query and data processing on server Swap memory– disk Data transfer •  Efficiency  =>  time  cost   •  Optimizing  memory  and  computing   cost   Efficiently manage and exploit data sets according to given specific storage, memory and computation resources
  • 33. WITHOUT RESOURCES CONSTRAINTS … 33 Costly  =>  minimizing  cost,  energy   consumption   ¡  Query evaluationà How and under which limits ? ¡  Is not longer completely constraint by resources availability: computing, RAM, storage, network services ¡  Decision making process determined by resources consumption and consumer requirements ¡  Data involved in the query, particularly in the result can have different costs: top 5 gratis and the rest available in return to a credit card number ¡  Results storage and exploitation demands more resources
  • 34. AGENDA ¡  Big data all around ¡  Building the boat ¡  Cloud architectures ¡  Storage vs.Access ¡  Data management trends ¡  Big data economic and technological phenomenon ¡  Outlook 34
  • 35. DATA MANAGEMENT SYSTEMS ARCHITECTURES Physical  model   Logic  model   External  model   ANSI/SPARC   Storage   Manager   Schema   Manager   Query   Engine   Transaction   Manager   DBMS   Customisable  points   Custom  components   Glue   code   Data     services   Access   services   Storage   services   Additional   Extension   services   Other   services   Extension  services   Streaming, XML, procedures, queries, replication Physical  model   Logic  model   External  model   Physical  model   Logic  model   External  model   35
  • 37. «MEMCACHED» ¡  «memcached» is a memory management protocol based on a cache: ¡  Uses the key-value notion ¡  Information is completly stored in RAM ¡  «memcached» protocol for: ¡  Creating, retrieving, updating, and deleting information from the database ¡  Applications with their own «memcached» manager (Google, Facebook, YouTube, FarmVille,Twitter,Wikipedia) 37
  • 38. STORAGE ON DISC (1) ¡  For efficiency reasons, information is stored using the RAM: ¡  Work information is in RAM in order to answer to low latency requests ¡  Yet, this is not always possible and desirable Ø  The process of moving data from RAM to disc is called "eviction”; this process is configured automatically for every bucket 38
  • 39. STORAGE ON DISC (2) ¡  NoSQL servers support the storage of key-value pairs on disc: ¡  Persistency–can be executed by loading data, closing and reinitializing it without having to load data from another source ¡  Hot backups– loaded data are sotred on disc so that it can be reinitialized in case of failures ¡  Storage on disc– the disc is used when the quantity of data is higher thant the physical size of the RAM, frequently used information is maintained in RAM and the rest es stored on disc 39
  • 40. STORAGE ON DISC (3) ¡  Strategies for ensuring: ¡  Each node maintains in RAM information on the key-value pairs it stores. Keys: ¡  may not be found, or ¡  they can be stored in memory or on disc ¡  The process of moving information from RAM to disc is asynchronous: ¡  The server can continue processing new requests ¡  A queue manages requests to disc Ø  In periods with a lot of writing requests, clients can be notified that the server is termporaly out of memory until information is evicted 40
  • 42. EXECUTION MODEL: MAP-REDUCE ¡  Programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers ¡  Open-source implementation called Hadoop, whose development was led byYahoo (now an Apache project) ¡  Divide and conquer principle: partition a large problem into smaller subproblems ¡  To the extent that the sub-problems are independent, they can be tackled in parallel by different workers ¡  intermediate results from each individual worker are then combined to yield the final output ¡  Large-data processing requires bringing data and code together for computation to occur: à no small feat for datasets that are terabytes and perhaps petabytes in size! 42
  • 43. MAP-REDUCE ELEMENTS ¡  Stage 1:Apply a user-specified computation over all input records in a dataset. ¡  These operations occur in parallel and yield intermediate output (key-value couples) ¡  Stage 2:Aggregate intermediate output by another user-specified computation ¡  Recursively applies a function on every pair of the list ¡  Execution framework coordinates the actual processing ¡  Implementation of the programming model and the execution framework 43
  • 44. MAP REDUCE EXAMPLE 44 Count the number of occurrences of every word in a text collection
  • 45. EXECUTION FRAMEWORK ¡  Important idea behind MapReduce is separating the what of distributed processing from the how ¡  A MapReduce program (job) consists of ¡  code for mappers and reducers packaged together with ¡  configuration parameters (such as where the input lies and where the output should be stored) ¡  Execution framework responsibilities: scheduling ¡  Each MapReduce job is divided into smaller units called tasks ¡  In large jobs, the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently à manage tasks queues ¡  Coordination among tasks belonging to different jobs 45
  • 46. MAP-REDUCE ADDITIONAL ELEMENTS ¡  Partitioners are responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers ¡  the partitioner species the task to which an intermediate key-value pair must be copied ¡  Combiners are an optimization in MapReduce that allow for local aggregation before the shuffle and sort phase 46
  • 47. AGENDA ¡  Big data all around ¡  Building the boat ¡  Cloud architectures ¡  Storage vs.Access ¡  Data management trends ¡  Big data economic and technological phenomenon ¡  Outlook 47
  • 48. BIGVALUE FROM BIG DATA ¡  Big data is not only about the original content stored or being consumed but also about the information around its consumption. ¡  Gigabytes of stored content can generate a petabyte or more of transient data that we typically don't store digital TV signals we watch but don't record voice calls that are made digital in the network backbone during the duration of a call ¡  Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis http://www.emc.com/collateral/analyst-­‐reports/idc-­‐extracting-­‐value-­‐from-­‐chaos-­‐ar.pdf   Big data is not a "thing” but instead a dynamic/activity that crosses many IT borders 48
  • 49. WHAT ARE THE FORCES BEHIND THE EXPLOSIVE GROWTH OF THE DIGITAL UNIVERSE? ¡  Technology has helped by driving the cost of creating, capturing, managing, and storing information ¡  … Prime mover is financial ¡  Since 2005, the investment by enterprises in the digital universe has increased 50% — to $4 trillion ¡  spent on hardware, software, services, and staff to create, manage, and store — and derive revenues from the digital universe ¡  the trick is to generate value by extracting the right information from the digital universe 49
  • 50. BIG DATA TECH ECONOMY ¡  Data-driven world guided by a rapid ROD (Return on Data) à reducing cost, complexity, risk and increasing the value of your holdings thanks to a mastery of technologies ¡  The key is how quickly data can be turned in to currency by: ¡  Analysing patterns and spotting relationships/trends that enable decisions to be made faster with more precision and confidence ¡  Identifying actions and bits of information that are out of compliance with company policies can avoid millions in fines ¡  Proactively reducing the amount of data you pay ($18,750/gigabyte to review in eDiscovery) by identifying only the relevant pieces of information ¡  Optimizing storage by deleting or offloading non-critical assets to cheaper cloud storage thus saving millions in archive solutions 50
  • 51. AGENDA ¡  Big data all around ¡  Building the boat ¡  Cloud architectures ¡  Storage vs.Access ¡  Data management trends ¡  Big data economic and technological phenomenon ¡  Outlook 51
  • 52. OUTLOOK ¡  Good news is Big Data is here ¡  Bad news is we are struggling to capture, storage, search, share, analyze, and visualize it ¡  Current technology is not adequate to cope with such large amounts of data (requiring massively parallel software running on tens, hundreds, or even thousands of servers) ¡  Expertise in a wide range of topics including ¡  Machine architectures (HPC), Service-oriented-architecture ¡  Distributed/cloud computing, operating and database systems ¡  Software Engineering, algorithmic techniques and networking “With hardware, networks and software having been commoditized to the point that all are essentially free, it was inevitable that the trajectory toward maximum entropy would bring us the current age of Big Data.” -Shomit Ghose, Big Data,The Only Business ModelTech has Left; CIO Network http://www.bigdatabytes.com/big-data-is-the-new-tech-economy/ 52
  • 53. Contact: GenovevaVargas-Solar, CNRS, LIG-LAFMIA Genoveva.Vargas@imag.fr   http://vargas-­‐solar.imag.fr     http://code.google.com/p/exschema/     Want to put cloud data management in practice ? cf. http://vargas-­‐solar.imag.fr/academika/cloud-­‐data-­‐management/     53 http://code.google.com/p/model2roo/     Open source polyglot persistence tools
  • 54. 54