Overview of Big data zoo

Data Analysis as a Service
Iou Fag(halv)dag, 2014
Gurvinder Singh, Uninett

What the hype is ..
Cheap commodity hardware with amazing computing and storage
capacity
... but this time software has also catching up with hardware

Hype Ingredient list is ..
Cheap commodity hardware
Good network capacity
Software based on principal of "Divide and Conquer"
..thus scale out horizontally

Unstructure Storage
Store data reliably, cheaply and scalably
Hadoop Distributed File System (HDFS)
Divide data into smaller chunks
Hetrogenous storage medium support
Similar DFS e.g. Lustre, IBM GPFS, Ceph, MooseFS

Structured Storage
Store structured data reliably, scalably and indexed
NoSQL databases to store structured data
HBase, Accumulo stores underlying data in HDFS
Many more in big data zoo: Cassandra, Voltdb, NuoDB...
BlinkDB offers tradeoff between accuracy & response time
Full text search offers by Elasticsearch, Solr

Processing
Mapreduce methodology to process data in the distributed fashion
Data locality with Hadoop Mapreduce and HDFS
Spark supports mapreduce and utilize system & cluster's RAM
Support machine learning algorithms
Support python,scala,java
Support R, framework for data scientists
Hive, Shark, Pig to process structure data in distributed way

Some performance numbers to
guide..
L1 cache reference 0.5 ns
L2 cache reference 7 ns
RAM reference 100 ns (Queen)
Flash IO card reference 75,000 ns (Princess)
RTT within same datacenter 500,000 ns
Disk reference 10,000,000 ns

Overview of Big data zoo

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (13)

Similar a Overview of Big data zoo

Similar a Overview of Big data zoo (20)

Último

Último (20)

Overview of Big data zoo