CS 542 Parallel DBs, NoSQL, MapReduce

CS 542 Database Management Systems Parallel and Distributed Databases Introduction to NoSQL databases and MapReduce J Singh February 14, 2011

Today’s meeting Parallel and Distributed Databases, Chapter 20.1 – 20.4 But only a part of 20.4, the rest w/ Query Optimization NoSQL Databases MapReduce References: Selected Papers MapReduce (textbook), Lin & Dyer, 2010

Parallel Databases Motivation More transactions per second, or less time per query Throughput vs. Response Time Speedup vs. Scaleup Database operations are extremely parallel E.g. Consider a join between R and S on R.b = S.b

Parallel Databases Shared-nothing vs. shared-memory vs. shared-disk

Date’s Rules for Distributed DBMS Rule 0 To the user, a distributed system should look exactly like a non-distributed system Other rules: Local autonomy No reliance on central site Continuous operation Location independence Fragmentation independence Replication independence Distributed query processing Distributed transaction management Hardware independence Operating system independence Network independence DBMS independence

Distributed Systems Headaches Especially if trying to execute transactions that involve data from multiple sites Keeping the databases in sync 2-phase commit for transactions uniformly hated (we’ll cover on 4/11) Autonomy issues Even within an organization, people tend to be protective of their unit/department Locks/Deadlock management Works better for query processing Since we are only reading the data

Distributed Query Optimization Cost-based approach; consider all plans, pick cheapest; similar to centralized optimization. Communication costs must be considered. Local site autonomy must be respected. New distributed join methods. Query site constructs global plan, with suggested local plans describing processing at each site. If a site can improve suggested local plan, free to do so.

Distributed Query Example Find all cities in Africa SELECT City.Name FROM City, Country WHERE City.CountryCode = Country.Code AND Country.Continent = 'Africa' ; Location assumptions: Country table: Server A City table: Server B Request arrives at Server A Think through on whiteboard

Detour: Paxos Algorithm Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures. Includes a spectrum of trade-offs between the number of processors, number of message delays before learning the agreed value, the activity level of individual participants, number of messages sent, and types of failures. Widely used. In Google’s lock management service among other uses The above definitions from Wikipedia. Feel free to pursue. Not on exam. Contributions from Leslie Lamport, Nancy Lynch, Barbara Liskov

Updating Distributed Data Synchronous Replication: All copies of a modified relation (fragment) must be updated before the modifying transaction commits. Data distribution is made transparent to users. Asynchronous Replication:Copies of a modified relation are only periodically updated; different copies may get out of synch in the meantime. Users must be aware of data distribution. Many current products follow this approach. Also referred to as Master/Slave.

Synchronous Replication Voting: transaction must write a majority of copies to modify an object; must read enough copies to be sure of seeing at least one most recent copy. E.g., 10 copies; 7 written for update; 4 copies read. Each copy has version number. Not attractive usually because reads are common. Read-any Write-all: Writes are slower and reads are faster, relative to Voting. Most common approach to synchronous replication. But what if one of the 10 nodes is down? Choice of technique determines which locks to set.

Peer-to-Peer Replication More than one of the copies of an object can be a master in this approach. Changes to a master copy must be propagated to other copies somehow. If two master copies are changed in a conflicting manner, this must be resolved. (e.g., Site 1: Joe’s age changed to 35; Site 2: to 36) Best used when conflicts do not arise: E.g., Each master site owns a disjoint fragment. E.g., Updating rights owned by one master at a time.

Summary Parallel DBMSs designed for scalable performance. Relational operators very well-suited for parallel execution. Pipeline and partitioned parallelism. Distributed DBMSs offer site autonomy and distributed administration. Must revisit storage and catalog techniques, concurrency control, and recovery issues. Thus far, we have taken an ad hoc approach to database parallelism and distribution Time to formalize it

Brewer’s Conjecture (p1) Source: Eric Brewer’s July 2000 PODC Keynote Main points: Classic “Distributed Systems” don’t work They focus on computation, not data Distributing computation is easy, distributing data is hard DBMS research is about ACID (mostly) But we forfeit “C” and “I” for availability, graceful degradation and performance This tradeoff is fundamental BASE Basically Available Soft-state Eventual Consistency

Brewer’s Conjecture (p2) BASE Weak consistency stale data OK Availability first Best effort Approximate answers OK Aggressive (optimistic) Simpler! Faster Easier evolution ACID Strong consistency Isolation Focus on “commit” Nested transactions Availability? Conservative (pessimistic) Difficult evolution (e.g. schema) But I think it’s a spectrum Eric Brewer

CAP Theorem (p1) Brewer’s Take Home Messages Can have consistency & availability within a cluster, but it is still hard in practice OS/Networking good at BASE/Availability, but terrible at consistency Databases better at C than Availability Wide-area databases can’t have both All systems are probabilistic Since then, Brewer’s conjecture formally proved: Gilbert & Lynch, 2002 Thus Brewer’s conjecture became the CAP theorem… …and contributed to the birth of the NoSQL movement

CAP Theorem (p2) But the theory is not settled Aren’t Availability and Partition Tolerance the same thing? And shouldn’t we be thinking about latency? While http://nosql-database.org/ lists 122 NoSQL databases References Availability and Partition Tolerance, Jeff Darcy, 2010 Problems with CAP…, Dan Abadi, 2010 What does the Proof of the CAP Theorem mean? Dan Weinreb, 2010

CS 542 Database Management Systems NoSQL Databases

What is NoSQL? Stands for Not Only SQL Class of non-relational data storage systems Usually do not require a fixed table schema nor do they use the concept of joins All NoSQL offerings relax one or more of the ACID properties

Forces at Work Three major papers were the seeds of the NoSQL movement CAP Theorem (discussed above) BigTable(Google) Dynamo (Amazon) Gossip protocol (discovery and error detection) Distributed key-value data store Eventual consistency Some types of data could not be modeled well in RDBMS Document Storage and Indexing Recursive Data and Graphs Time Series Data Genomics Data

The Perfect Storm Large datasets, acceptance of alternatives, and dynamically-typed data has come together in a perfect storm Not a backlash/rebellion against RDBMS SQL is a rich query language that cannot be rivaled by the current list of NoSQL offerings

What kinds of NoSQL NoSQL solutions fall into two major areas: Key/Value or ‘the big hash table’. Amazon S3 (Dynamo) Voldemort Scalaris Schema-less which comes in multiple flavors, column-based, document-based or graph-based. Cassandra (column-based) CouchDB (document-based) Neo4J (graph-based) HBase (column-based)

Amazon SimpleDB Key-value store Written in Erlang, (as is CouchDB) Data is modeled in terms of Domain, a container of entities, Item, an entity and Attribute and Value, a property of an Item Eventually Consistent, except when ReadConsistent flag specified Impressive performance numbers, e.g., .7 sec to store 1 million records SQL-like SELECT select output_list from domain_name [where expression] [sort_instructions] [limit limit]

Google Datastore Part of App Engine; also used for internal applications Used for all storage Incorporates a transaction model to ensure high consistency Optimistic locking Transactions can fail CAP implications Datastore isn’t just “eventually consistent” They offer two commercial options (with different prices) Master/Slave Low latency but also lower availability Asynchronous replication High Replication Strong availability at the cost of higher latency

App Engine Architecture req/resp stateless APIs R/O FS Python VM process stdlib urlfech mail app images datastore stateful APIs memcache

Datastore Programming Model (p1) Entities have a Kind, a Key, and Properties Entity ~~ Record ~~ Python dict ~~ Python class instance Key ~~ structured foreign key; includes Kind Kind ~~ Table ~~ Python class Property ~~ Column or Field; has a type Dynamically typed: Property types are recorded per Entity Key has either id or name the id is auto-assigned; alternatively, the name is set by app A key can be a path including the parent key, and so on Paths define entity groups whichlimit transactions A transaction locks the root entity (parentless ancestor key)

Datastore Programming Model (p2) GQL GQL offers SELECT but no INSERT, UPDATE, JOIN Use language bindings for INSERT, … and Transaction primitives SELECT [* | __key__] FROM <kind> [WHERE <condition> [AND <condition> ...]] [ORDER BY <property> [ASC | DESC] [, <property> [ASC | DESC] …]] [LIMIT [<offset>,]<count>] [OFFSET <offset>] <condition> := <property> {< | <= | > | >= | = | != } <value> <condition> := <property> IN <list> <condition> := ANCESTOR IS <entity or key>

Datastore is Based on BigTable Provides Scalable, Structured Storage Implemented as a sharded, sorted, array Sharded: Each block (tablet) lives on its own server Sorted: Engineered to fetch the results of range queries with fewest disk reads Operations: Only these six Read Write Delete Update (atomic) Prefix scan Range scan Row Names (keys) up to 64KB Columns unlimited in size Divided into “column families”

Datastore Entity Implementation Shown Index: Entities By Kind Index: Entities By Prop ASC Not Shown Index: Entities By Prop DESC Index: By Composite Property

For more info, see video of Ryan Barrett’s talk at Google I/ODatastore Application at Google

Databases and Key-Value Stores http://browsertoolkit.com/fault-tolerance.png

CS 542 Database Management Systems MapReduce

Conceptual Underpinnings Programming model from Lisp and other functional languages (map square '(1 2 3 4))  (1 4 9 16) (reduce + '(1 4 9 16)) 30 Easy to distribute Nice failure/retry semantics

Example: Reverse index words in docs Input: Crawler yields (url, content) pairs Map function: map (key = url, value = content) For each word w in content, emit (w, [url, offset] reduce(key = word, values = list of [url, offset]) Sort values Emit (word, sorted list of [url, offset])

Implementation Questions Map: How many processors should we use? 4? 32? 1024? Reduce: How many processors? What’s the allocation algorithm for assigning words to processors? These all design decisions driven by Size and other characteristics of the problem

Implementation Steps Split key/value pairs into M chunks, run a map task on each chunk in parallel Partition the output of map tasks into R regions After all map tasks complete, Why do we need to wait? Run a reduce task on each (of R) regions

Fault Tolerance Problem Detection Heartbeat Remedy Re-execute in-progress and completed map tasks Re-execute in-progress reduce tasks

Not limited to data analysis tasks Any task that is parallelizable e.g., prepare a report for each user The importance of idempotence It must be possible to rerun a task. Applications - name: prepUserReport mapper: input_reader: mapreduce.input_readers.DatastoreInputReader handler: Service.prepUserReport params: - name: entity_kind default: DataModel.Org - name: done_callback default: /mr_done/prep

Refinements: Usability In the App Engine Environment, automation has been the focus Automatic sharding for faster execution Automatic rate limiting for slow execution Status pages (demo in a minute) Counters Parameterized mappers Batching datastore operations Iterating over blob data

Refinement: Redundant Execution Slow workers significantly delay completion time Other jobs consuming resources on machine Bad disks w/ soft errors transfer data slowly Weird things: processor caches disabled (!!) Solution: Near end of phase, spawn backup tasks Whichever one finishes first "wins" Dramatically shortens job completion time

Refinement: Locality Optimization Master scheduling policy: Asks GFS for locations of replicas of input file blocks Map tasks typically split into 64MB (GFS block size) Map tasks scheduled so GFS input block replica are on same machine or same rack Effect Thousands of machines read input at local disk speed Without this, rack switches limit read rate

Refinement: Skipping Bad Records Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix Not always possible ~ third-party source libraries On segmentation fault: Send UDP packet to master from signal handler Include sequence number of record being processed If master sees two failures for same record: Next workeris told to skip the record

Refinement: Pipelining Hadoop Online Prototype (HOP) supports pipelining within and between MapReduce jobs: push rather than pull Preserve simple fault tolerance scheme Improved job completion time (better cluster utilization) Improved detection and handling of stragglers MapReduce programming model unchanged Clients supply same job parameters Hadoop client interface backward compatible No changes required to existing clients E.g., Pig, Hive, Sawzall, Jaql Extended to take a series of job

MapReduce Statistics @ GOOG Take-away message: MapReduce is not a “new-fangled technology of the future” It is here, it is proven, use it!

MapReduce is Still Controversial Arg1: MapReduce is a step backwards in database access MapReduce is not a database, a data storage, or management system MapReduce is an algorithmic technique for the distributed processing of large amounts of data Arg2: MapReduce is a poor implementation MapReduce is one way to generate indexes from a large volume of data, but it’s not a data storage and retrieval system Arg3: MapReduce is not novel Hashing, parallel processing, data partitioning, and user-defined functions are all old hat in the RDBMS world, but so what? The big innovation MapReduce enables is distributing data processing across a network of cheap and possibly unreliable computers

MapReduce is Still Controversial (p2) Arg4: MapReduce is missing features Arg5: MapReduce is incompatible with the DBMS tools The ability to process a huge volume of data quickly such as web crawling and log analysis is more important than guaranteeing 100% data integrity and completeness Arg6: Even Google is replacing MapReduce Not much written about it – seems more focused on pipelining and incremental processing

CS 542 Parallel DBs, NoSQL, MapReduce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to CS 542 Parallel DBs, NoSQL, MapReduce

Similar to CS 542 Parallel DBs, NoSQL, MapReduce (20)

More from J Singh

More from J Singh (20)

Recently uploaded

Recently uploaded (20)

CS 542 Parallel DBs, NoSQL, MapReduce