SlideShare una empresa de Scribd logo
1 de 6
Descargar para leer sin conexión
Big Data Loading:
Project Voldemort
Big Data Loading
●   So you've processed your data...
●   Now, how to get that to people quickly?

●   Project Voldemort's Read-Only stores
    ●   Simple key-value store
    ●   Based upon Amazon Dynamo
    ●   Simple Java interface and operation
    ●   Immutable read only stores
Read Only Stores
●   Precompute in Hadoop or else where
●   Creates an indexed key-value store
    ●   One reducer (or file) per node
    ●   Replicated data for fail over


●   Atomically loads into nodes
    ●   Copy from hdfs or other http source
    ●   Very fast, limited by network or storage i/o
    ●   Can throttle so not affecting live services
●   Can also roll back to previous versions
Example Hadoop Store Builder
public class JsonStoreBuilder
   extends AbstractHadoopStoreBuilderMapper<LongWritable, Text>{

    JSONParser parser = new JSONParser();

    @Override
    public Object makeKey(LongWritable lineNo, Text line) {
       JSONObject json = parser.parse(line.toString());
       return json.get("name");
    }

    @Override
    public Object makeValue(LongWritable lineNo, Text line) {
       return line.toString();
    }
}
Example Hadoop Job
$VOLDEMORT_HOME/bin/hadoop-build-readonly-store.sh

  --input hdfs/JsonFile.json
  --output hdfs/StoreOut
  --tmpdir hdfs/temp_dir
  --mapper uk.co.danharvey.hadoop.JsonStoreBuilder
  --jar hadoop-core.jar
  --cluster config/cluster.xml
  --storename example_store
  --storedefinitions config/store.xml
  --chunksize 1073741824
  --replication 1
Pig to Json Index
●   Output JSON from pig
        STORE bag INTO 'data.json' USING JsonStorage();


●   JsonStoreBuilder
    ●   Extends Voldemort StoreBuilder
    ●   Easily index any field


●   Code up here:
    http://github.com/danharvey/pigJsonUtils

Más contenido relacionado

La actualidad más candente

C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
DataStax
 

La actualidad más candente (20)

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
 
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsightHBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to Contribute
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
HBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at XiaomiHBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at Xiaomi
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
Brian Bulkowski. Aerospike
Brian Bulkowski. AerospikeBrian Bulkowski. Aerospike
Brian Bulkowski. Aerospike
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
 
Cassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting dataCassandra implementation for collecting data and presenting data
Cassandra implementation for collecting data and presenting data
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
 

Destacado

ISTC 201 - Plagiarism and Proper Citation
ISTC 201 - Plagiarism and Proper CitationISTC 201 - Plagiarism and Proper Citation
ISTC 201 - Plagiarism and Proper Citation
Laksamee Putnam
 

Destacado (20)

Project Voldemort
Project VoldemortProject Voldemort
Project Voldemort
 
thesis-despoina
thesis-despoinathesis-despoina
thesis-despoina
 
Plagcitation fa2012
Plagcitation fa2012Plagcitation fa2012
Plagcitation fa2012
 
ISTC 201 - Plagiarism and Proper Citation
ISTC 201 - Plagiarism and Proper CitationISTC 201 - Plagiarism and Proper Citation
ISTC 201 - Plagiarism and Proper Citation
 
Google Apps and Plagiarism
Google Apps and PlagiarismGoogle Apps and Plagiarism
Google Apps and Plagiarism
 
Google analytics ppt
Google analytics pptGoogle analytics ppt
Google analytics ppt
 
5 Fantasy Google Translator
5 Fantasy Google Translator5 Fantasy Google Translator
5 Fantasy Google Translator
 
HBase at Mendeley
HBase at MendeleyHBase at Mendeley
HBase at Mendeley
 
How to set up campaign in google adwords by Tanuja Talekar
How to set up campaign in google adwords by Tanuja TalekarHow to set up campaign in google adwords by Tanuja Talekar
How to set up campaign in google adwords by Tanuja Talekar
 
Scientific writing pro : Office word & Mendeley (dani r firman)
Scientific writing pro : Office word & Mendeley (dani r firman)Scientific writing pro : Office word & Mendeley (dani r firman)
Scientific writing pro : Office word & Mendeley (dani r firman)
 
Webmaster tool by Neha Nayak
Webmaster tool by Neha NayakWebmaster tool by Neha Nayak
Webmaster tool by Neha Nayak
 
Google Analytics Overview
Google Analytics OverviewGoogle Analytics Overview
Google Analytics Overview
 
Google analytics by Neha Nayak
Google analytics by Neha NayakGoogle analytics by Neha Nayak
Google analytics by Neha Nayak
 
Top 10 Google Analytics Reports
Top 10 Google Analytics ReportsTop 10 Google Analytics Reports
Top 10 Google Analytics Reports
 
Google Analytics 101 for Business - How to Get Started With Google Analytics
Google Analytics 101 for Business - How to Get Started With Google AnalyticsGoogle Analytics 101 for Business - How to Get Started With Google Analytics
Google Analytics 101 for Business - How to Get Started With Google Analytics
 
An introduction to Google Analytics
An introduction to Google AnalyticsAn introduction to Google Analytics
An introduction to Google Analytics
 
Google Analytics 101 | 2015
Google Analytics 101 |  2015Google Analytics 101 |  2015
Google Analytics 101 | 2015
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 
Voldemort : Prototype to Production
Voldemort : Prototype to ProductionVoldemort : Prototype to Production
Voldemort : Prototype to Production
 

Similar a Project Voldemort: Big data loading

Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
Hortonworks
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
guest27e6764
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 
Java Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDBJava Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDB
MongoDB
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 

Similar a Project Voldemort: Big data loading (20)

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
מיכאל
מיכאלמיכאל
מיכאל
 
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12
 
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedTom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
"Xapi-lang For declarative code generation" By James Nelson
"Xapi-lang For declarative code generation" By James Nelson"Xapi-lang For declarative code generation" By James Nelson
"Xapi-lang For declarative code generation" By James Nelson
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Zend Server Data Caching
Zend Server Data CachingZend Server Data Caching
Zend Server Data Caching
 
Collect distributed application logging using fluentd (EFK stack)
Collect distributed application logging using fluentd (EFK stack)Collect distributed application logging using fluentd (EFK stack)
Collect distributed application logging using fluentd (EFK stack)
 
Use Xdebug to profile PHP
Use Xdebug to profile PHPUse Xdebug to profile PHP
Use Xdebug to profile PHP
 
Exploring Node.jS
Exploring Node.jSExploring Node.jS
Exploring Node.jS
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB Webinar: Managing Real Time Risk Analytics with MongoDB
Webinar: Managing Real Time Risk Analytics with MongoDB
 
Java Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDBJava Persistence Frameworks for MongoDB
Java Persistence Frameworks for MongoDB
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Project Voldemort: Big data loading

  • 2. Big Data Loading ● So you've processed your data... ● Now, how to get that to people quickly? ● Project Voldemort's Read-Only stores ● Simple key-value store ● Based upon Amazon Dynamo ● Simple Java interface and operation ● Immutable read only stores
  • 3. Read Only Stores ● Precompute in Hadoop or else where ● Creates an indexed key-value store ● One reducer (or file) per node ● Replicated data for fail over ● Atomically loads into nodes ● Copy from hdfs or other http source ● Very fast, limited by network or storage i/o ● Can throttle so not affecting live services ● Can also roll back to previous versions
  • 4. Example Hadoop Store Builder public class JsonStoreBuilder extends AbstractHadoopStoreBuilderMapper<LongWritable, Text>{ JSONParser parser = new JSONParser(); @Override public Object makeKey(LongWritable lineNo, Text line) { JSONObject json = parser.parse(line.toString()); return json.get("name"); } @Override public Object makeValue(LongWritable lineNo, Text line) { return line.toString(); } }
  • 5. Example Hadoop Job $VOLDEMORT_HOME/bin/hadoop-build-readonly-store.sh --input hdfs/JsonFile.json --output hdfs/StoreOut --tmpdir hdfs/temp_dir --mapper uk.co.danharvey.hadoop.JsonStoreBuilder --jar hadoop-core.jar --cluster config/cluster.xml --storename example_store --storedefinitions config/store.xml --chunksize 1073741824 --replication 1
  • 6. Pig to Json Index ● Output JSON from pig STORE bag INTO 'data.json' USING JsonStorage(); ● JsonStoreBuilder ● Extends Voldemort StoreBuilder ● Easily index any field ● Code up here: http://github.com/danharvey/pigJsonUtils