SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
How to get started in
Big Data for Master’s
Students
Mohamed Nadjib Mami
mami@cs.uni-bonn.de
24 March 2018
1. Big Data is a “way of thinking” not a “Domain”
- It is a Situation
- It is a Way of thinking
- It is an Adaptation
- It is not a Domain
- It is not a Specialty
- It is not not only Big in size
Limitation of traditional systems
- Size of computational data
- Speed of flowing data
- Formats of data
… Quality/trustworthiness of data
… Importance of data
Dimensions
- Volume
- Velocity
- Variety
- Veracity
- Value
2
2. Big Data is Data Management in the back
Source: DAMA-DMBOK2 Framework 2014
● It is all about interacting with data
○ Collect
○ Store
○ Maintain & control
○ Retrieve
○ Analyse
3
2. Big Data is Data Management in the back
● Take Data Management class, most importantly:
○ Relational algebra and database, ACID properties
○ SQL query language (focus on join and aggregation queries)
○ NOSQL, CAP theorem, BASE properties
○ Batch vs. stream vs. interactive processing
○ Lambda vs. Kappa architectures
○ Data Lake vs. Data Warehouse concepts
4
2. Big Data is Data Management in the back
● Relational model
○ The basics of basics ... the past, present (& future?)
○ Data modeled in form of relations
■ Algebra: project, select, join, aggregate, union, intersect...
○ Data stored RDBMS in tables, tuples, attributes...
● ACID Properties => guarantees DB integrity
○ Atomicity … apply all ops or nothing
○ Consistency … changes respect constraint
○ Isolation … parallel changes do not interfere
○ Durability … no committed change is lost
5
2. Big Data is Data Management in the back
● SQL: Structured Query Language
○ Declarative Query Language for Structured data (tables)
○ Aka. relational query language
■ Implements the relational algebra functions
○ (You should) Focus on JOIN and AGGREGATION
■ JOIN is the bases of querying
■ AGGREGATE is the bases of data analytics
6
2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ New application needs => new DB management systems
■ Scalable and scale-out solutions (distributed)
■ Representations other than relational/SQL
■ Flexible schema
○ Not only SQL?
■ Similar syntaxes to SQL are used
● CQL (Cassandra Query Language)
7
2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Quick lookups (hash, dictionary)
○ Query semi-structured data
○ Query flexible-schema tables
○ Query highly interconnected data
○ A mix of the above (multi-model)
● SQL & NOSQL = friends not foes (complementary)
8
Key-value
Document
Columnar
Graph
2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Key-value (Simplest NOSQL model)
■ Encode all data in form of (key : value) pairs
■ Long distributed dictionaries/hash
■ Access: HTTP requests, API, etc.
■ Examples:
● Riak, Redis, Voldemort, Dynamo
9
105 abd
106 azb
107 tvu
108 lol
2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Document-oriented
■ Encode data in form of semi-structured “documents”
● Commonly in JSON-like
■ Access: HTTP requests, API, etc.
■ Examples:
● MongoDB, CouchDB, Couchbase
10
{
"FirstName": "AAA",
"LastName": "BBB",
"Hobbies":
["painting",”swimming”]
}
2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Columnar
■ Store data in columns (vs. rows in RDBMS)
● Optimized for analytical queries OLAP
■ Based on Columns families
● Like RDBMS tables but with unfixed schema
■ Examples:
● Cassandra, HBase, Accumulo, Bigtable
11
2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Graph-oriented
■ Model data in form of graphs (edges and vertices)
■ Optimal for storing highly interconnected
Graph-shaped data
● Query data by traversal
■ Examples:
● Neo4j, infinitegraph, Neptune
12
2. Big Data is Data Management in the back
● NOSQL and distributed systems (network, shared-data)
○ CAP theorem for designing distributed systems
■ Consistency returns latest results
■ Availability has to return result even stale
■ Partition tolerance tolerate data loss between nodes
○ In present of P choose between C and A (tradeoff)
■ C: query errors or times out as requested data is n/a
■ A: query returns out-of-data results
13
2. Big Data is Data Management in the back
● NOSQL and distributed systems (network, shared-data)
○ CAP theorem for designing distributed systems
■ too simplistic | good to learn the basics
○ PACELC extends CAP
■ P(A|C)E(L|C) = if P choose A or C Else choose E or C
14
Partition?
Latency
Consistency
Availability
Consistency
Elsethen
2. Big Data is Data Management in the back
● NOSQL and distributed systems (network, shared-data)
○ BASE of NOSQL (contrasting ACID of RDBMS)
○ Suggested by the same person as ACID
○ Basically available guarantees CAP Availability
○ Soft state system state may change over time
○ Eventual consistency system will become consistent over
time
15
2. Big Data is Data Management in the back
● Batch vs. stream vs. interactive processing
○ Batch: actions applied to bulked data periodically
■ Example: Extract-Transform-Load (ETL)
○ Real-time: computation applied to streams once arrived
■ Example: analyse sensors weather data
○ Interactive/iterative:
■ Example: Machine Learning algorithms
16
2. Big Data is Data Management in the back
● Lambda vs. Kappa architectures
○ Lambda architecture
■ Three layers:
● Batch
● Speed
● Serving
■ Fault-tolerant
■ Scalable
17
Source: MapR - Lambda Architecture
2. Big Data is Data Management in the back
● Lambda vs. Kappa architectures
○ Kappa architecture
■ Batch layers omitted => batch special case of stream
18
Source: O’reilly: Applying the Kappa architecture in the telco industry
2. Big Data is Data Management in the back
● Data Warehouse can be implemented on top of Data Lake
19
Data Lake Data Warehouse
Repository of raw-data in its original form A well structured data repository
Append-only, read-only Read and write
Schema-on-read (no predefined schema) Schema-on-right (well predefined schema)
ETL (Extract, Transform, Load) ELT (Extract, Load, Transform)
Open to any access tools incl. DWH tools BI and OLAP tools and standards
3. Think big, think distributed
● Adaptation: now we deal with cluster-wide large scale data
● New essential factors come into play
○ Movement (aka shuffling)...
○ Reading and writing…
● MUST-know: fault-tolerance, replication, high-availability,
distributed file system ...in addition to previous concepts
○ Advise: learn them from Hadoop (HDFS), Apache Spark
20
...of large data
4. Adopt an “Optimizer” way of thinking
● History: my code works!
● Now: my code works fast
⇒ a slowly working code ~= not working code
○ How fast my app gets the job done? (performance)
○ How much output my app generates (throughput)
● Tuning and optimization are your new concerns e.g.
○ Reduce shuffled data (moved)
○ Reduce data written to/read from disk
21
General advice and comments
● Don’t move to big data settings if you don’t have to
● Don’t hesitate to start it if you feel like … it’s a lot of fun! :)
● For people who intend to do research in relation to big data
○ I have an idea, I just need to implement it becomes
○ I just have an idea, I need to implement it
○ Two phases instead of one:
■ 1. Make it work in your single-machine
■ 2. Make it work in your cluster >> and optimize
○ But it’s a lot of fun … still!
● Can all that fade off? Yes, as anything can, but unlikely any sooner
22
Wrap-up
1. Big Data is a Way of thinking not a Domain
2. Big Data is Data Management in the back
3. Think big, think distributed
4. Adopt an “Optimizer” way of thinking
23
questions

Más contenido relacionado

La actualidad más candente

Accelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO WayAccelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO WayMongoDB
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortalsChapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortalsnehabsairam
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLbalwinders
 
Relational is the new Big Data by Miguel Ángel Fajardo and Daniel Dominguez a...
Relational is the new Big Data by Miguel Ángel Fajardo and Daniel Dominguez a...Relational is the new Big Data by Miguel Ángel Fajardo and Daniel Dominguez a...
Relational is the new Big Data by Miguel Ángel Fajardo and Daniel Dominguez a...Big Data Spain
 
Music recommendations API with Neo4j
Music recommendations API with Neo4jMusic recommendations API with Neo4j
Music recommendations API with Neo4jBoris Guarisma
 
MongoDB and Hadoop Handling for Big Data
MongoDB and Hadoop Handling for Big DataMongoDB and Hadoop Handling for Big Data
MongoDB and Hadoop Handling for Big DataMuhammad zubair
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
Big Data: Improving capacity utilization of transport companies
Big Data: Improving capacity utilization of transport companiesBig Data: Improving capacity utilization of transport companies
Big Data: Improving capacity utilization of transport companiesData Science Society
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentationSalma Gouia
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 

La actualidad más candente (20)

Accelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO WayAccelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO Way
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortalsChapter 4 terminolgy of keyvalue databses from nosql for mere mortals
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
 
Mongo db
Mongo dbMongo db
Mongo db
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Relational is the new Big Data by Miguel Ángel Fajardo and Daniel Dominguez a...
Relational is the new Big Data by Miguel Ángel Fajardo and Daniel Dominguez a...Relational is the new Big Data by Miguel Ángel Fajardo and Daniel Dominguez a...
Relational is the new Big Data by Miguel Ángel Fajardo and Daniel Dominguez a...
 
Music recommendations API with Neo4j
Music recommendations API with Neo4jMusic recommendations API with Neo4j
Music recommendations API with Neo4j
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
MongoDB and Hadoop Handling for Big Data
MongoDB and Hadoop Handling for Big DataMongoDB and Hadoop Handling for Big Data
MongoDB and Hadoop Handling for Big Data
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Big Data: Improving capacity utilization of transport companies
Big Data: Improving capacity utilization of transport companiesBig Data: Improving capacity utilization of transport companies
Big Data: Improving capacity utilization of transport companies
 
10. Graph Databases
10. Graph Databases10. Graph Databases
10. Graph Databases
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentation
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 

Similar a How to get started in Big Data for master's students

Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesVasia Kalavri
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of dataPiyush Katariya
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyGuillaume Lefranc
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
Architecting Database by Jony Sugianto (Detik.com)
Architecting Database by Jony Sugianto (Detik.com)Architecting Database by Jony Sugianto (Detik.com)
Architecting Database by Jony Sugianto (Detik.com)Tech in Asia ID
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesScyllaDB
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageBethmi Gunasekara
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresqlZaid Shabbir
 
Overview of no sql
Overview of no sqlOverview of no sql
Overview of no sqlSean Murphy
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems researchVasia Kalavri
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Introduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBIntroduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBAhmed Farag
 
NoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache CalciteNoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache Calcitegianmerlino
 

Similar a How to get started in Big Data for master's students (20)

Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open Issues
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Architecting Database by Jony Sugianto (Detik.com)
Architecting Database by Jony Sugianto (Detik.com)Architecting Database by Jony Sugianto (Detik.com)
Architecting Database by Jony Sugianto (Detik.com)
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresql
 
Overview of no sql
Overview of no sqlOverview of no sql
Overview of no sql
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
 
Big data
Big dataBig data
Big data
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Introduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBIntroduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDB
 
NoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache CalciteNoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache Calcite
 

Último

毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 

Último (20)

毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 

How to get started in Big Data for master's students

  • 1. How to get started in Big Data for Master’s Students Mohamed Nadjib Mami mami@cs.uni-bonn.de 24 March 2018
  • 2. 1. Big Data is a “way of thinking” not a “Domain” - It is a Situation - It is a Way of thinking - It is an Adaptation - It is not a Domain - It is not a Specialty - It is not not only Big in size Limitation of traditional systems - Size of computational data - Speed of flowing data - Formats of data … Quality/trustworthiness of data … Importance of data Dimensions - Volume - Velocity - Variety - Veracity - Value 2
  • 3. 2. Big Data is Data Management in the back Source: DAMA-DMBOK2 Framework 2014 ● It is all about interacting with data ○ Collect ○ Store ○ Maintain & control ○ Retrieve ○ Analyse 3
  • 4. 2. Big Data is Data Management in the back ● Take Data Management class, most importantly: ○ Relational algebra and database, ACID properties ○ SQL query language (focus on join and aggregation queries) ○ NOSQL, CAP theorem, BASE properties ○ Batch vs. stream vs. interactive processing ○ Lambda vs. Kappa architectures ○ Data Lake vs. Data Warehouse concepts 4
  • 5. 2. Big Data is Data Management in the back ● Relational model ○ The basics of basics ... the past, present (& future?) ○ Data modeled in form of relations ■ Algebra: project, select, join, aggregate, union, intersect... ○ Data stored RDBMS in tables, tuples, attributes... ● ACID Properties => guarantees DB integrity ○ Atomicity … apply all ops or nothing ○ Consistency … changes respect constraint ○ Isolation … parallel changes do not interfere ○ Durability … no committed change is lost 5
  • 6. 2. Big Data is Data Management in the back ● SQL: Structured Query Language ○ Declarative Query Language for Structured data (tables) ○ Aka. relational query language ■ Implements the relational algebra functions ○ (You should) Focus on JOIN and AGGREGATION ■ JOIN is the bases of querying ■ AGGREGATE is the bases of data analytics 6
  • 7. 2. Big Data is Data Management in the back ● NOSQL (aka. non-relational) = Not Only SQL ○ New application needs => new DB management systems ■ Scalable and scale-out solutions (distributed) ■ Representations other than relational/SQL ■ Flexible schema ○ Not only SQL? ■ Similar syntaxes to SQL are used ● CQL (Cassandra Query Language) 7
  • 8. 2. Big Data is Data Management in the back ● NOSQL (aka. non-relational) = Not Only SQL ○ Quick lookups (hash, dictionary) ○ Query semi-structured data ○ Query flexible-schema tables ○ Query highly interconnected data ○ A mix of the above (multi-model) ● SQL & NOSQL = friends not foes (complementary) 8 Key-value Document Columnar Graph
  • 9. 2. Big Data is Data Management in the back ● NOSQL (aka. non-relational) = Not Only SQL ○ Key-value (Simplest NOSQL model) ■ Encode all data in form of (key : value) pairs ■ Long distributed dictionaries/hash ■ Access: HTTP requests, API, etc. ■ Examples: ● Riak, Redis, Voldemort, Dynamo 9 105 abd 106 azb 107 tvu 108 lol
  • 10. 2. Big Data is Data Management in the back ● NOSQL (aka. non-relational) = Not Only SQL ○ Document-oriented ■ Encode data in form of semi-structured “documents” ● Commonly in JSON-like ■ Access: HTTP requests, API, etc. ■ Examples: ● MongoDB, CouchDB, Couchbase 10 { "FirstName": "AAA", "LastName": "BBB", "Hobbies": ["painting",”swimming”] }
  • 11. 2. Big Data is Data Management in the back ● NOSQL (aka. non-relational) = Not Only SQL ○ Columnar ■ Store data in columns (vs. rows in RDBMS) ● Optimized for analytical queries OLAP ■ Based on Columns families ● Like RDBMS tables but with unfixed schema ■ Examples: ● Cassandra, HBase, Accumulo, Bigtable 11
  • 12. 2. Big Data is Data Management in the back ● NOSQL (aka. non-relational) = Not Only SQL ○ Graph-oriented ■ Model data in form of graphs (edges and vertices) ■ Optimal for storing highly interconnected Graph-shaped data ● Query data by traversal ■ Examples: ● Neo4j, infinitegraph, Neptune 12
  • 13. 2. Big Data is Data Management in the back ● NOSQL and distributed systems (network, shared-data) ○ CAP theorem for designing distributed systems ■ Consistency returns latest results ■ Availability has to return result even stale ■ Partition tolerance tolerate data loss between nodes ○ In present of P choose between C and A (tradeoff) ■ C: query errors or times out as requested data is n/a ■ A: query returns out-of-data results 13
  • 14. 2. Big Data is Data Management in the back ● NOSQL and distributed systems (network, shared-data) ○ CAP theorem for designing distributed systems ■ too simplistic | good to learn the basics ○ PACELC extends CAP ■ P(A|C)E(L|C) = if P choose A or C Else choose E or C 14 Partition? Latency Consistency Availability Consistency Elsethen
  • 15. 2. Big Data is Data Management in the back ● NOSQL and distributed systems (network, shared-data) ○ BASE of NOSQL (contrasting ACID of RDBMS) ○ Suggested by the same person as ACID ○ Basically available guarantees CAP Availability ○ Soft state system state may change over time ○ Eventual consistency system will become consistent over time 15
  • 16. 2. Big Data is Data Management in the back ● Batch vs. stream vs. interactive processing ○ Batch: actions applied to bulked data periodically ■ Example: Extract-Transform-Load (ETL) ○ Real-time: computation applied to streams once arrived ■ Example: analyse sensors weather data ○ Interactive/iterative: ■ Example: Machine Learning algorithms 16
  • 17. 2. Big Data is Data Management in the back ● Lambda vs. Kappa architectures ○ Lambda architecture ■ Three layers: ● Batch ● Speed ● Serving ■ Fault-tolerant ■ Scalable 17 Source: MapR - Lambda Architecture
  • 18. 2. Big Data is Data Management in the back ● Lambda vs. Kappa architectures ○ Kappa architecture ■ Batch layers omitted => batch special case of stream 18 Source: O’reilly: Applying the Kappa architecture in the telco industry
  • 19. 2. Big Data is Data Management in the back ● Data Warehouse can be implemented on top of Data Lake 19 Data Lake Data Warehouse Repository of raw-data in its original form A well structured data repository Append-only, read-only Read and write Schema-on-read (no predefined schema) Schema-on-right (well predefined schema) ETL (Extract, Transform, Load) ELT (Extract, Load, Transform) Open to any access tools incl. DWH tools BI and OLAP tools and standards
  • 20. 3. Think big, think distributed ● Adaptation: now we deal with cluster-wide large scale data ● New essential factors come into play ○ Movement (aka shuffling)... ○ Reading and writing… ● MUST-know: fault-tolerance, replication, high-availability, distributed file system ...in addition to previous concepts ○ Advise: learn them from Hadoop (HDFS), Apache Spark 20 ...of large data
  • 21. 4. Adopt an “Optimizer” way of thinking ● History: my code works! ● Now: my code works fast ⇒ a slowly working code ~= not working code ○ How fast my app gets the job done? (performance) ○ How much output my app generates (throughput) ● Tuning and optimization are your new concerns e.g. ○ Reduce shuffled data (moved) ○ Reduce data written to/read from disk 21
  • 22. General advice and comments ● Don’t move to big data settings if you don’t have to ● Don’t hesitate to start it if you feel like … it’s a lot of fun! :) ● For people who intend to do research in relation to big data ○ I have an idea, I just need to implement it becomes ○ I just have an idea, I need to implement it ○ Two phases instead of one: ■ 1. Make it work in your single-machine ■ 2. Make it work in your cluster >> and optimize ○ But it’s a lot of fun … still! ● Can all that fade off? Yes, as anything can, but unlikely any sooner 22
  • 23. Wrap-up 1. Big Data is a Way of thinking not a Domain 2. Big Data is Data Management in the back 3. Think big, think distributed 4. Adopt an “Optimizer” way of thinking 23 questions