SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
Next Revolution
Toward Open Platform




         Terapot: Massive Email Archiving
         with Hadoop & Friends
             - Commercial Hadoop Application




                              Jaesun Han
                              Founder & CEO of NexR
                              jshan@nexrcorp.com
#2
About NexR

  Offering Hadoop & Cloud Computing Platform and Services

                                                      Hadoop & Cloud Computing Services
      Hadoop Provisioning & Management




                                                                                            Academic Support
                                         Massive Email Archiving   MapReduce Workflow
                                                                                                Program




                                                 Massive Data Storage & Processing Platform


                                                                           Cloud Computing Platform
                                                                         (Compatible with Amazon AWS)

                                                      icube-cc                   icube-sc
                                                     (Compute)                  (Storage)
#3
What is Email Archiving?

     The Objectives of Email Archiving
       -   Regulatory compliance
       -   e-Discovery: Litigation and legal discovery
       -   E-mail backup and disaster recovery
       -   Messaging system & storage optimization
       -   Monitoring of internal and external e-mail content
#4
The Architecture of Email Archiving

    Data Acquisition              Data Processing                      Data Access
        Journaling                       Indexing                            Search
     Mailbox Crawling                    Filtering                          Discovery




                                          Email
                                         Servers
             Journaling                                 Crawling


                                                                   Search               employee
                                     Indexing        Indexes
          Email Archiving
              Server
                                                               Discovery                  auditor
                                                                                        administrator




                        Archival Storage
                            email data
#5
The Challenges of Email Archiving

             Explosive growth of digital data
               - 6 times (988XB) in 2010 than 2006
               - 95% (939 XB) unstructured data including email
               - Increasing the cost and complexity of archiving
                Requiring scalable & low cost archiving



             Reinforcement of data retention regulation
               - Retention, Disposal, e-Discovery, Security
               - HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,
                 OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX
                Requiring scalable archiving & fast discovery


             Needs for intelligent data management
               - Knowledge management from email data
               - Filtering, monitoring, data mining, etc
                Requiring integration with intelligent system
#6
New Requirements of Email Archiving



           High Scalability

           Low Cost

           High Performance

           Intelligence
#7
Terapot: When Hadoop Met Email Archiving…
               Scale-out architecture with Hadoop
                 - Hadoop HDFS for archiving email data
                 - Hadoop MapReduce for crawling & indexing
                 - Apache Lucene for search & discovery


     Email
    Servers

                                     Distributed Crawling
          Journaling


                                                               Hadoop MapReduce
                                                               (Crawling, Indexing, etc)


          Journaling                                               Hadoop HDFS
            Server                                                    (Archiving)




                              Distributed Search & Discovery
#8
Features of Terapot

  Distributed Massive Email Archiving
  High Scalability by Shared-Nothing Architecture
    - Thousands of servers, billions of emails
  Low Cost by Inexpensive Hardware
    - Entry servers under $5,000
  High Performance by Parallelism
    - Fast search under 1-2 seconds for each user account
    - Fast discovery in parallel with MapReduce
  Intelligence by Data Mining
    - Contact network analysis, content analysis, statistics
  Support Both On-premise Version and Cloud(hosted)
 Version
  Development with Various Open Source Software
#9
The Architecture of Terapot
       Terapot Clients                                Email Sources
                                                   HTTP/
   SOAP      REST        JSON         POP3                           Mail          NAS/
                                                  FTP/SFTP
                                      Server                        Server         NFS
                                                   Server




                                   Terapot Frontend

    MR Workflow Manager            MailServer         Search Gateway           Analyzer




       Batch processing                                                       Analysis         4 key
                                    Real-Time
  Crawling   Indexing    Merging                        Searching            ETL   Mining   components
                                     Indexing

                        Hadoop MapReduce, Lucene, & Hive




                                                                                              HDFS
                                                                                              (email)
                                                                                              Local
                                                                                              (index)
#10
  Batch Processing Component
               Email Sources

                                       HDFS

             Crawling                                     Archiving policies
               (MR)                                        An archive file per user
                               An archive file per user    Several archive files per crawling
                                   (sequence file)

configured
  period
             Indexing
               (MR)
                               a temporary index file
                                       per user
                                  (lucene index file)
                                                             Local file system

             Merging                                          shard 1   shard 0
                                                                                     Search
                                 a merged index file
                                   (for backing up)
                                                                  index shard
                                                              (3 copy replication)
#11
Real-Time Indexing Component



                         Journaling
                           Server

                               Forwarding               Database
  Memory
              Indexing   Real-Time         Archiving
                          Indexing

                                                        Crawling
  Real-Time
                                                                    HDFS
    Index
                                Flushing
                                                                      archive
                                                         Batch
                                                       Processing     index

                                                       Component
#12
Search & Discovery Component


                        Search
                       Gateway
  Locating
index shards
                                 Distributed
                                   Search

         Assigning
          shards




                         Search Nodes                                            Real-Time
                                                        copy index shards     Indexing Nodes
                                                       to local file system
    Zookeeper
                  Updating
                 shard status                                        HDFS
                                        index shards
#13
Data Analysis Component


                         Personal contact network analysis         Mining
                                         Domain statistics
                                                                    Engine

                                                 Hive queries

     ETL (MR)                                                                                       Analyzer
   Extract-Transform-                                                   Hive
                                                                                                      Web
          Load
                                                                                                    Reporter

                                                              MR   MR    MR    MR     MR
                                                                                                         reports



                                                                                       generating
                                                                                        reports

  email archive files                 Hive table                   analysis results                 database

                                         HDFS
#14
Installation & Quantitative Analysis

                                                      Quantitative Analysis
              2                            Assuming
HA          master                           - 1000 employees
            nodes                            - 16 emails per day for each person
                                             - 215KB (content 142 KB + attachment 73 KB)
                                               for average email size
                                             - 1.25 GB per year for 1 employee
                                           Storage
               10                            - index size: about 80% of email
                                             - compression ratio: about 50 %
              worker                       Disk volume required for 1 year
              nodes                          - email archive (HDFS): 1881 GB
             (datanode,                      - indexes (HDFS + Local): 4559 GB
             tasktracker,                    - total: about 6.4 TB per year
               searcher,
                 etc)                      40 TB may cover 6 years archiving



                            Description         Qty

                        Intel Xeon Nehalem        2
                CPU
                           E5504 2.0GHz       (8 cores)
                       DDR3 2GB PC3-10600         9
              Memory
                         Registered Dimm       (18GB)
                                                  4
                HDD    1TB 7200 RPM SATA2
                                                (4TB)
#15
Demonstration
For more information
  -   www.nexrcorp.com
  -   www.terapot.com
  -   jshan@nexrcorp.com
  -   @jaesun_han




       www.nexrcorp.com
  Hadoop & Cloud Computing
          Company

Más contenido relacionado

La actualidad más candente

Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32jujukoko
 
2012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2Wilfried Hoge
 
Liquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANALiquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANASAP Technology
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
 
The 25 Most Promising Open Source Projects
The 25 Most Promising Open Source ProjectsThe 25 Most Promising Open Source Projects
The 25 Most Promising Open Source Projectsaf83
 
Implementation of nosql for robotics
Implementation of nosql for roboticsImplementation of nosql for robotics
Implementation of nosql for roboticsJoão Gabriel Lima
 
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ..."A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ...Lucidworks (Archived)
 
Realtime hadoopsigmod2011
Realtime hadoopsigmod2011Realtime hadoopsigmod2011
Realtime hadoopsigmod2011iammutex
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingPaco Nathan
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceTed Dunning
 
How Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDBHow Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDBJeremy Taylor
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows AzureJeremy Taylor
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows AzureJeremy Taylor
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationDynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationShanley Kane
 

La actualidad más candente (20)

Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation Defense
 
Cidr11 paper32
Cidr11 paper32Cidr11 paper32
Cidr11 paper32
 
2012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2
 
Liquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANALiquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANA
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
The 25 Most Promising Open Source Projects
The 25 Most Promising Open Source ProjectsThe 25 Most Promising Open Source Projects
The 25 Most Promising Open Source Projects
 
Implementation of nosql for robotics
Implementation of nosql for roboticsImplementation of nosql for robotics
Implementation of nosql for robotics
 
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ..."A Study of I/O and Virtualization Performance with a Search Engine based on ...
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
 
Realtime hadoopsigmod2011
Realtime hadoopsigmod2011Realtime hadoopsigmod2011
Realtime hadoopsigmod2011
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with Cascading
 
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
 
Science & technology (s&t) cloud2
Science & technology (s&t) cloud2Science & technology (s&t) cloud2
Science & technology (s&t) cloud2
 
How Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDBHow Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDB
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows Azure
 
MongoDB on Windows Azure
MongoDB on Windows AzureMongoDB on Windows Azure
MongoDB on Windows Azure
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationDynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 Presentation
 

Destacado

[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email ArchivingJinho Jung
 
Aacte Junio 2008
Aacte Junio 2008Aacte Junio 2008
Aacte Junio 2008roke
 
Serious Games und Social Media: Ein Zukunftsmarkt
Serious Games und Social Media: Ein ZukunftsmarktSerious Games und Social Media: Ein Zukunftsmarkt
Serious Games und Social Media: Ein ZukunftsmarktJohannes Konert
 
CFW Domestic Sprinkler Regulations - Bafsa Fire Sprinkler Wales
CFW Domestic Sprinkler Regulations - Bafsa Fire Sprinkler WalesCFW Domestic Sprinkler Regulations - Bafsa Fire Sprinkler Wales
CFW Domestic Sprinkler Regulations - Bafsa Fire Sprinkler WalesRae Davies
 
Eventi GEOWEB
Eventi GEOWEBEventi GEOWEB
Eventi GEOWEBGEOWEB
 
Aprendiendo A Ver La Escultura
Aprendiendo A Ver La EsculturaAprendiendo A Ver La Escultura
Aprendiendo A Ver La Esculturacarolinaperez_76
 
Hoja de afiliación
Hoja de afiliaciónHoja de afiliación
Hoja de afiliaciónfontaine18
 
Syllabus propedéutica y terapéutica ocular ciclo 2 2015
Syllabus propedéutica y terapéutica ocular ciclo 2 2015Syllabus propedéutica y terapéutica ocular ciclo 2 2015
Syllabus propedéutica y terapéutica ocular ciclo 2 2015Universidad Técnica de Manabí
 
Skills Portfolio 2010
Skills Portfolio 2010Skills Portfolio 2010
Skills Portfolio 2010JacquiBIUK
 
Presentación Progestion Occidente 2009 Gcv Mp3
Presentación Progestion Occidente 2009   Gcv   Mp3Presentación Progestion Occidente 2009   Gcv   Mp3
Presentación Progestion Occidente 2009 Gcv Mp3GECIVI
 
Ethical hacking Chapter 2 - TCP/IP - Eric Vanderburg
Ethical hacking   Chapter 2 - TCP/IP - Eric VanderburgEthical hacking   Chapter 2 - TCP/IP - Eric Vanderburg
Ethical hacking Chapter 2 - TCP/IP - Eric VanderburgEric Vanderburg
 
Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF
Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SFAmeet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF
Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SFMLconf
 
De conversation manager extended oct 10
De conversation manager extended oct 10De conversation manager extended oct 10
De conversation manager extended oct 10Steven Van Belleghem
 
EV.Cloud Email Archiving
EV.Cloud Email ArchivingEV.Cloud Email Archiving
EV.Cloud Email Archivingcrussell79
 

Destacado (20)

[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving
 
ElasticInbox
ElasticInboxElasticInbox
ElasticInbox
 
Aacte Junio 2008
Aacte Junio 2008Aacte Junio 2008
Aacte Junio 2008
 
Madres y blogs
Madres y blogsMadres y blogs
Madres y blogs
 
Serious Games und Social Media: Ein Zukunftsmarkt
Serious Games und Social Media: Ein ZukunftsmarktSerious Games und Social Media: Ein Zukunftsmarkt
Serious Games und Social Media: Ein Zukunftsmarkt
 
CFW Domestic Sprinkler Regulations - Bafsa Fire Sprinkler Wales
CFW Domestic Sprinkler Regulations - Bafsa Fire Sprinkler WalesCFW Domestic Sprinkler Regulations - Bafsa Fire Sprinkler Wales
CFW Domestic Sprinkler Regulations - Bafsa Fire Sprinkler Wales
 
Eventi GEOWEB
Eventi GEOWEBEventi GEOWEB
Eventi GEOWEB
 
Aprendiendo A Ver La Escultura
Aprendiendo A Ver La EsculturaAprendiendo A Ver La Escultura
Aprendiendo A Ver La Escultura
 
Hoja de afiliación
Hoja de afiliaciónHoja de afiliación
Hoja de afiliación
 
Syllabus propedéutica y terapéutica ocular ciclo 2 2015
Syllabus propedéutica y terapéutica ocular ciclo 2 2015Syllabus propedéutica y terapéutica ocular ciclo 2 2015
Syllabus propedéutica y terapéutica ocular ciclo 2 2015
 
Skills Portfolio 2010
Skills Portfolio 2010Skills Portfolio 2010
Skills Portfolio 2010
 
EXPEDIA.ES
EXPEDIA.ESEXPEDIA.ES
EXPEDIA.ES
 
Presentación Progestion Occidente 2009 Gcv Mp3
Presentación Progestion Occidente 2009   Gcv   Mp3Presentación Progestion Occidente 2009   Gcv   Mp3
Presentación Progestion Occidente 2009 Gcv Mp3
 
Ethical hacking Chapter 2 - TCP/IP - Eric Vanderburg
Ethical hacking   Chapter 2 - TCP/IP - Eric VanderburgEthical hacking   Chapter 2 - TCP/IP - Eric Vanderburg
Ethical hacking Chapter 2 - TCP/IP - Eric Vanderburg
 
Paseo Por Granada
Paseo Por GranadaPaseo Por Granada
Paseo Por Granada
 
Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF
Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SFAmeet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF
Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF
 
De conversation manager extended oct 10
De conversation manager extended oct 10De conversation manager extended oct 10
De conversation manager extended oct 10
 
Plan de convivencia nebrija
Plan de convivencia nebrijaPlan de convivencia nebrija
Plan de convivencia nebrija
 
Diagnóstico y tratamiento del acné
Diagnóstico y tratamiento del acnéDiagnóstico y tratamiento del acné
Diagnóstico y tratamiento del acné
 
EV.Cloud Email Archiving
EV.Cloud Email ArchivingEV.Cloud Email Archiving
EV.Cloud Email Archiving
 

Similar a Hw09 Terapot Email Archiving With Hadoop

Enterprise linked data clouds
Enterprise linked data cloudsEnterprise linked data clouds
Enterprise linked data cloudsdamienjoyce
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebNuxeo
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrDataWorks Summit
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache HadoopHortonworks
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopDataWorks Summit
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...Cloudera, Inc.
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Cloudera, Inc.
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summitdrewz lin
 

Similar a Hw09 Terapot Email Archiving With Hadoop (20)

Enterprise linked data clouds
Enterprise linked data cloudsEnterprise linked data clouds
Enterprise linked data clouds
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Crowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over HadoopCrowd-Sourced Intelligence Built into Search over Hadoop
Crowd-Sourced Intelligence Built into Search over Hadoop
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
HBaseCon 2012 | Overcoming Data Deluge with HBase to Help Save the Environmen...
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Solr -
Solr - Solr -
Solr -
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
 

Más de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Último (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Hw09 Terapot Email Archiving With Hadoop

  • 1. Next Revolution Toward Open Platform Terapot: Massive Email Archiving with Hadoop & Friends - Commercial Hadoop Application Jaesun Han Founder & CEO of NexR jshan@nexrcorp.com
  • 2. #2 About NexR Offering Hadoop & Cloud Computing Platform and Services Hadoop & Cloud Computing Services Hadoop Provisioning & Management Academic Support Massive Email Archiving MapReduce Workflow Program Massive Data Storage & Processing Platform Cloud Computing Platform (Compatible with Amazon AWS) icube-cc icube-sc (Compute) (Storage)
  • 3. #3 What is Email Archiving?  The Objectives of Email Archiving - Regulatory compliance - e-Discovery: Litigation and legal discovery - E-mail backup and disaster recovery - Messaging system & storage optimization - Monitoring of internal and external e-mail content
  • 4. #4 The Architecture of Email Archiving Data Acquisition Data Processing Data Access Journaling Indexing Search Mailbox Crawling Filtering Discovery Email Servers Journaling Crawling Search employee Indexing Indexes Email Archiving Server Discovery auditor administrator Archival Storage email data
  • 5. #5 The Challenges of Email Archiving  Explosive growth of digital data - 6 times (988XB) in 2010 than 2006 - 95% (939 XB) unstructured data including email - Increasing the cost and complexity of archiving  Requiring scalable & low cost archiving  Reinforcement of data retention regulation - Retention, Disposal, e-Discovery, Security - HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs, OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX  Requiring scalable archiving & fast discovery  Needs for intelligent data management - Knowledge management from email data - Filtering, monitoring, data mining, etc  Requiring integration with intelligent system
  • 6. #6 New Requirements of Email Archiving  High Scalability  Low Cost  High Performance  Intelligence
  • 7. #7 Terapot: When Hadoop Met Email Archiving…  Scale-out architecture with Hadoop - Hadoop HDFS for archiving email data - Hadoop MapReduce for crawling & indexing - Apache Lucene for search & discovery Email Servers Distributed Crawling Journaling Hadoop MapReduce (Crawling, Indexing, etc) Journaling Hadoop HDFS Server (Archiving) Distributed Search & Discovery
  • 8. #8 Features of Terapot  Distributed Massive Email Archiving  High Scalability by Shared-Nothing Architecture - Thousands of servers, billions of emails  Low Cost by Inexpensive Hardware - Entry servers under $5,000  High Performance by Parallelism - Fast search under 1-2 seconds for each user account - Fast discovery in parallel with MapReduce  Intelligence by Data Mining - Contact network analysis, content analysis, statistics  Support Both On-premise Version and Cloud(hosted) Version  Development with Various Open Source Software
  • 9. #9 The Architecture of Terapot Terapot Clients Email Sources HTTP/ SOAP REST JSON POP3 Mail NAS/ FTP/SFTP Server Server NFS Server Terapot Frontend MR Workflow Manager MailServer Search Gateway Analyzer Batch processing Analysis 4 key Real-Time Crawling Indexing Merging Searching ETL Mining components Indexing Hadoop MapReduce, Lucene, & Hive HDFS (email) Local (index)
  • 10. #10 Batch Processing Component Email Sources HDFS Crawling Archiving policies (MR)  An archive file per user An archive file per user  Several archive files per crawling (sequence file) configured period Indexing (MR) a temporary index file per user (lucene index file) Local file system Merging shard 1 shard 0 Search a merged index file (for backing up) index shard (3 copy replication)
  • 11. #11 Real-Time Indexing Component Journaling Server Forwarding Database Memory Indexing Real-Time Archiving Indexing Crawling Real-Time HDFS Index Flushing archive Batch Processing index Component
  • 12. #12 Search & Discovery Component Search Gateway Locating index shards Distributed Search Assigning shards Search Nodes Real-Time copy index shards Indexing Nodes to local file system Zookeeper Updating shard status HDFS index shards
  • 13. #13 Data Analysis Component  Personal contact network analysis Mining  Domain statistics Engine Hive queries ETL (MR) Analyzer Extract-Transform- Hive Web Load Reporter MR MR MR MR MR reports generating reports email archive files Hive table analysis results database HDFS
  • 14. #14 Installation & Quantitative Analysis Quantitative Analysis 2  Assuming HA master - 1000 employees nodes - 16 emails per day for each person - 215KB (content 142 KB + attachment 73 KB) for average email size - 1.25 GB per year for 1 employee  Storage 10 - index size: about 80% of email - compression ratio: about 50 % worker  Disk volume required for 1 year nodes - email archive (HDFS): 1881 GB (datanode, - indexes (HDFS + Local): 4559 GB tasktracker, - total: about 6.4 TB per year searcher, etc)  40 TB may cover 6 years archiving Description Qty Intel Xeon Nehalem 2 CPU E5504 2.0GHz (8 cores) DDR3 2GB PC3-10600 9 Memory Registered Dimm (18GB) 4 HDD 1TB 7200 RPM SATA2 (4TB)
  • 16. For more information - www.nexrcorp.com - www.terapot.com - jshan@nexrcorp.com - @jaesun_han www.nexrcorp.com Hadoop & Cloud Computing Company