Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
How to get started in Big Data for master's students
1. How to get started in
Big Data for Master’s
Students
Mohamed Nadjib Mami
mami@cs.uni-bonn.de
24 March 2018
2. 1. Big Data is a “way of thinking” not a “Domain”
- It is a Situation
- It is a Way of thinking
- It is an Adaptation
- It is not a Domain
- It is not a Specialty
- It is not not only Big in size
Limitation of traditional systems
- Size of computational data
- Speed of flowing data
- Formats of data
… Quality/trustworthiness of data
… Importance of data
Dimensions
- Volume
- Velocity
- Variety
- Veracity
- Value
2
3. 2. Big Data is Data Management in the back
Source: DAMA-DMBOK2 Framework 2014
● It is all about interacting with data
○ Collect
○ Store
○ Maintain & control
○ Retrieve
○ Analyse
3
4. 2. Big Data is Data Management in the back
● Take Data Management class, most importantly:
○ Relational algebra and database, ACID properties
○ SQL query language (focus on join and aggregation queries)
○ NOSQL, CAP theorem, BASE properties
○ Batch vs. stream vs. interactive processing
○ Lambda vs. Kappa architectures
○ Data Lake vs. Data Warehouse concepts
4
5. 2. Big Data is Data Management in the back
● Relational model
○ The basics of basics ... the past, present (& future?)
○ Data modeled in form of relations
■ Algebra: project, select, join, aggregate, union, intersect...
○ Data stored RDBMS in tables, tuples, attributes...
● ACID Properties => guarantees DB integrity
○ Atomicity … apply all ops or nothing
○ Consistency … changes respect constraint
○ Isolation … parallel changes do not interfere
○ Durability … no committed change is lost
5
6. 2. Big Data is Data Management in the back
● SQL: Structured Query Language
○ Declarative Query Language for Structured data (tables)
○ Aka. relational query language
■ Implements the relational algebra functions
○ (You should) Focus on JOIN and AGGREGATION
■ JOIN is the bases of querying
■ AGGREGATE is the bases of data analytics
6
7. 2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ New application needs => new DB management systems
■ Scalable and scale-out solutions (distributed)
■ Representations other than relational/SQL
■ Flexible schema
○ Not only SQL?
■ Similar syntaxes to SQL are used
● CQL (Cassandra Query Language)
7
8. 2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Quick lookups (hash, dictionary)
○ Query semi-structured data
○ Query flexible-schema tables
○ Query highly interconnected data
○ A mix of the above (multi-model)
● SQL & NOSQL = friends not foes (complementary)
8
Key-value
Document
Columnar
Graph
9. 2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Key-value (Simplest NOSQL model)
■ Encode all data in form of (key : value) pairs
■ Long distributed dictionaries/hash
■ Access: HTTP requests, API, etc.
■ Examples:
● Riak, Redis, Voldemort, Dynamo
9
105 abd
106 azb
107 tvu
108 lol
10. 2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Document-oriented
■ Encode data in form of semi-structured “documents”
● Commonly in JSON-like
■ Access: HTTP requests, API, etc.
■ Examples:
● MongoDB, CouchDB, Couchbase
10
{
"FirstName": "AAA",
"LastName": "BBB",
"Hobbies":
["painting",”swimming”]
}
11. 2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Columnar
■ Store data in columns (vs. rows in RDBMS)
● Optimized for analytical queries OLAP
■ Based on Columns families
● Like RDBMS tables but with unfixed schema
■ Examples:
● Cassandra, HBase, Accumulo, Bigtable
11
12. 2. Big Data is Data Management in the back
● NOSQL (aka. non-relational) = Not Only SQL
○ Graph-oriented
■ Model data in form of graphs (edges and vertices)
■ Optimal for storing highly interconnected
Graph-shaped data
● Query data by traversal
■ Examples:
● Neo4j, infinitegraph, Neptune
12
13. 2. Big Data is Data Management in the back
● NOSQL and distributed systems (network, shared-data)
○ CAP theorem for designing distributed systems
■ Consistency returns latest results
■ Availability has to return result even stale
■ Partition tolerance tolerate data loss between nodes
○ In present of P choose between C and A (tradeoff)
■ C: query errors or times out as requested data is n/a
■ A: query returns out-of-data results
13
14. 2. Big Data is Data Management in the back
● NOSQL and distributed systems (network, shared-data)
○ CAP theorem for designing distributed systems
■ too simplistic | good to learn the basics
○ PACELC extends CAP
■ P(A|C)E(L|C) = if P choose A or C Else choose E or C
14
Partition?
Latency
Consistency
Availability
Consistency
Elsethen
15. 2. Big Data is Data Management in the back
● NOSQL and distributed systems (network, shared-data)
○ BASE of NOSQL (contrasting ACID of RDBMS)
○ Suggested by the same person as ACID
○ Basically available guarantees CAP Availability
○ Soft state system state may change over time
○ Eventual consistency system will become consistent over
time
15
16. 2. Big Data is Data Management in the back
● Batch vs. stream vs. interactive processing
○ Batch: actions applied to bulked data periodically
■ Example: Extract-Transform-Load (ETL)
○ Real-time: computation applied to streams once arrived
■ Example: analyse sensors weather data
○ Interactive/iterative:
■ Example: Machine Learning algorithms
16
17. 2. Big Data is Data Management in the back
● Lambda vs. Kappa architectures
○ Lambda architecture
■ Three layers:
● Batch
● Speed
● Serving
■ Fault-tolerant
■ Scalable
17
Source: MapR - Lambda Architecture
18. 2. Big Data is Data Management in the back
● Lambda vs. Kappa architectures
○ Kappa architecture
■ Batch layers omitted => batch special case of stream
18
Source: O’reilly: Applying the Kappa architecture in the telco industry
19. 2. Big Data is Data Management in the back
● Data Warehouse can be implemented on top of Data Lake
19
Data Lake Data Warehouse
Repository of raw-data in its original form A well structured data repository
Append-only, read-only Read and write
Schema-on-read (no predefined schema) Schema-on-right (well predefined schema)
ETL (Extract, Transform, Load) ELT (Extract, Load, Transform)
Open to any access tools incl. DWH tools BI and OLAP tools and standards
20. 3. Think big, think distributed
● Adaptation: now we deal with cluster-wide large scale data
● New essential factors come into play
○ Movement (aka shuffling)...
○ Reading and writing…
● MUST-know: fault-tolerance, replication, high-availability,
distributed file system ...in addition to previous concepts
○ Advise: learn them from Hadoop (HDFS), Apache Spark
20
...of large data
21. 4. Adopt an “Optimizer” way of thinking
● History: my code works!
● Now: my code works fast
⇒ a slowly working code ~= not working code
○ How fast my app gets the job done? (performance)
○ How much output my app generates (throughput)
● Tuning and optimization are your new concerns e.g.
○ Reduce shuffled data (moved)
○ Reduce data written to/read from disk
21
22. General advice and comments
● Don’t move to big data settings if you don’t have to
● Don’t hesitate to start it if you feel like … it’s a lot of fun! :)
● For people who intend to do research in relation to big data
○ I have an idea, I just need to implement it becomes
○ I just have an idea, I need to implement it
○ Two phases instead of one:
■ 1. Make it work in your single-machine
■ 2. Make it work in your cluster >> and optimize
○ But it’s a lot of fun … still!
● Can all that fade off? Yes, as anything can, but unlikely any sooner
22
23. Wrap-up
1. Big Data is a Way of thinking not a Domain
2. Big Data is Data Management in the back
3. Think big, think distributed
4. Adopt an “Optimizer” way of thinking
23
questions