SlideShare una empresa de Scribd logo
1 de 27
29.03.12                       SysFera




              Big Data
            Technologies
                SysFera
           Benjamin Depardon
29.03.12                                              SysFera




SysFera
• 2001: Research project from the Graal team
  (Inria/ENS)
      – DIET: grid middleware
• 2007: SysFera-DS used within the Décrypthon
  project
      – Used in production 24/7/365 since then
      – Selected by IBM to replace Univa-UD
• 2010: Creation of SysFera, INRIA spin-off
• 2012: A team of 14 (R&D: 4 engineers and 5 PhD)
      – Supported by two experts from INRIA and ENS
      – SysFera-DS

                                                           2
29.03.12                                       SysFera




What is Big Data?
• All kinds of data
• Valuable insight, but difficult to
  extract
• Several dimensions
      – Variety
           • Structured/unstructured
           • Text, audio, video…
      – Velocity
           • Time sensitivity
           • Streaming
      – Volume
           • Large files
           • Small files in large quantities
      – Variability
           • Different meanings/format over
             different time period


                                                    3
29.03.12                                                                                                             SysFera




      What can you do with Big Data?
                                                                                Analyze a Variety of Information
 Analyze Information in Motion                                                         Social media/sentiment analysis
     Smart Grid management                                                            Geospatial analysis
     Multimodal surveillance                                                          Brand strategy
     Real-time promotions                                                             Scientific research
     Cyber security                                                                   Epidemic early warning system
     ICU monitoring                                                                   Market analysis
     Options trading                                                                  Video analysis
     Click-stream analysis                                                            Audio analysis
     CDR processing
     IT log analysis
     RFID tracking & analysis
                                                                                           Discovery & Experimentation

Analyze Extreme Volumes of                                                                      Sentiment analysis
Information                                                                                     Brand strategy
                                                                                                Scientific research
 Transaction analysis to create insight-based                                                  Ad-hoc analysis
  product/service offerings                                                                     Model development
 Fraud modeling & detection                                                                    Hypothesis testing
 Risk modeling & management                                                                    Transaction analysis to create insight-
 Social media/sentiment analysis                                                                based product/service offerings
 Environmental analysis                   Manage   and Plan
                                            Operational analytics – BI reporting
                                            Planning and forecasting analysis
                                            Predictive analysis
                                            …
29.03.12                                                               SysFera




  What can you do with Big Data?
          Financial Services              Utilities
           Fraud detection                Weather impact analysis on
           Risk management                 power generation
           360° View of the Customer      Transmission monitoring
                                           Smart grid management


Transportation                                         IT
 Weather and traffic
                                                        Transition log analysis
  impact on logistics and
                                                         for multiple
  fuel consumption
                                                         transactional systems
                                                        Cybersecurity

Health & Life Sciences
 Epidemic early warning
                                                      Retail
  system                                               360° View of the Customer
 ICU monitoring                                       Click-stream analysis
 Remote healthcare monitoring                         Real-time promotions


             Telecommunications
                CDR processing
                                          Law Enforcement
                                           Real-time multimodal surveillance
                Churn prediction
                                           Situational awareness
                Geomapping / marketing
                                           Cyber security detection
                Network monitoring
29.03.12                        SysFera




What do you need?
• Hardware
      – Storage capacity
      – Computing power
• Software
      – Storage
           • Filesystems
           • Databases
      – Computation framework


                                     6
29.03.12                     SysFera




   DISTRIBUTED FILESYSTEMS




                                  7
29.03.12                                                   SysFera




HDFS
• Hadoop Distributed File System
• Open source (Apache)
• Design
      –    High throughput instead of low latency
      –    Large data sets (large files), data locality
      –    Fault tolerance (replication)
      –    Write once and read-many (WORM)
      –    Userspace
• Limitations
      –    Write-once model
      –    Cannot be mounted by existing OS
      –    No quotas/access permissions
      –    Name node is a single point of failure
• Used by Yahoo, Twitter, Rackspace, LinkedIn, Facebook…


                                                                8
29.03.12                                      SysFera




GlusterFS
•    Open source (GPLV3) NAS file system
•    Runs in userspace
•    File-based distributed mirroring,
     replication, striping, load balancing
•    FUSE, POSIX compliant
•    Storage quotas
•    No meta-data server (fully distributed
     architecture, elastic hash)
•    Unified global namespace:
     aggregation of disk and memory in a
     single pool
•    Data is stored in logical volumes that
     are abstracted from the hardware and
     logically partitioned from each other
•    Multiprotocole client support:
     GlusterFS native, NFS, CIFS, HTTP,
     WebDAV, FTP
•    Real time Self-healing
•    VM live replication

                                                   9
29.03.12                                     SysFera




LUSTRE
• Open Source (GPL)
• Object based: separate metadata
  and file data
      – Meta Data Servers (MDS) nodes
      – Object Storage Servers (OSS)
        nodes
• Consistency: Lustre distributed
  lock manager (MSD and OSS)
• Performance:
      – data can be striped
      – MDT is only involved in pathname
        and permission checks, and is not
        involved in any file IO operations
• POSIX interface
• Lustre Network (LNET):
  infinibands, TCP/IP, Myrinet…
• Targeted to manage large files

                                                 10
29.03.12       SysFera




   DATABASES




                   11
29.03.12                                        SysFera




CAP theorem (Brewer’s theorem)
It is impossible for a distributed computer
system to simultaneously provide all three of
the following guarantees:
• Consistency
• Availability
• Partition tolerance




                                                    12
29.03.12                           SysFera




NoSQL
• Release ACID conditions
• 4 types of NoSQL bases
      – Key-value
        (Memcached, Voldemort):
        data agnostic
      – Document oriented
        (CouchDB, MongoDB) :
        data conscious
      – Column oriented (Big
        Table, Hbase, Cassandra)
      – Graph (Neo4j)
• Requires more work on
  the client side
                                       13
29.03.12                                              SysFera




MemCached
• Free & open source, high-performance, distributed
  memory object caching system, generic in nature, but
  intended for use in speeding up dynamic web
  applications by alleviating database load.
• Simple Key/Value Store
• Smarts Half in Client, Half in Server
• Servers are Disconnected From Each Other
• O(1) Everything
• Forgetting Data is a Feature
• Used by
  LiveJournal, flickr, Wordpress.org, Wikipedia, YouTube
  …
                                                           14
29.03.12                                   SysFera




MongoDB
• Document oriented
• Transport and storage: BSON format (derived
  from JSON, but binary)
• Queries
      – no join
      – Map/reduce
• Database contains collections
• Collections contain documents
• Master-slave replication
                                                15
29.03.12                                        SysFera




Cassandra
• Column oriented (inspired from Big Table &
  Dynamo)
• Notion of super-columns
      – (sorted) associative array of columns
•    Range queries on keys
•    Low latency: sequential access to disk
•    O(1) DHT
•    Eventual Consistency
•    Values limited to 2GB
•    RPC with Thrift

                                                    16
29.03.12                                                  SysFera




Neo4J
• Graph oriented
• Fully ACID transactions
• Data is stored as a graph/network
      – Nodes and relationships with properties
      – "Property graph" or "edge-labeled multidigraph"
• Queries
      – Indexing of nodes and properties
      – Graph traversal
•    Disk-based, native storage
•    Java, REST API
•    Master-slave load balancing
•    Use case: social network

                                                              17
29.03.12                                          SysFera




PaaS Databases
• Different providers
      – Amazon: RDS, SimpleDB
      – Google: AppEngine (GQL)
      – Microsoft: SQL Azure
• Different cost models
      – CPU hour
      – CPU hour + traffic
      – Monthly fee + CPU hour + traffic
      All depend on the load (number of users)

                                                      18
29.03.12       SysFera




   SOLUTIONS




                   19
29.03.12                                               SysFera




    GO-Transfer: Data transfer as SaaS
Reliable file transfer.
        Easy “fire-and-forget” transfers
        Automatic fault recovery
        High performance
        Across multiple security domains
No IT required.
        Software as a Service (SaaS)
           No client software installation
           New features automatically available
        Consolidated support & troubleshooting
        Works with existing GridFTP servers
        Globus Connect solves “last mile problem”
GO-Transfer is the initial offering of the US National
Science Foundation’s XSEDE User Access Services (XUAS)
                                                         © Ian Foster
                                                               20
29.03.12                                                                                                 SysFera




Hadoop environment

                                PIG (Data Flow)         HIVE (Batch SQL)     SQOOP (Data Import)
    ZOOKEEPER (Coordination)




                                                                                                   AVRO (Serialization)
                                                                 CHUKWA
                                                  (Displaying, Monitoring, Analysing Logs)

                               MAP REDUCE (Job scheduling – Raw processing)
                                 HBASE (Real Time Query)

                                                           HDFS
                                 (Hadoop Distributed File System – Unstructured Storage)




                                                                                                                          21
29.03.12                                                                                                                                               SysFera




       IBM Big Data Platform
                                                                  InfoSphere BigInsights
                                                                  Hadoop-based low latency analytics
                                                                       for variety and volume

                                                                            Hadoop


                                     Information                                                             Stream Computing
InfoSphere Information Server         Integration                                                                                            InfoSphere Streams
High volume data integration and
                                                                                                                                           Low Latency Analytics for streaming
         transformation
                                                                                                                                                         data



                                                             MPP Data Warehouse




        IBM InfoSphere             IBM Netezza High Capacity                  IBM Netezza 1000                IBM Smart Analytics System           IBM Informix Timeseries
          Warehouse                       Appliance                    BI+Ad Hoc Analytics Structured Data      Operational Analytics on            Time-structured analytics
    Large volume structured        Queryable Archive Structured                                                    Structured Data
         data analytics                       Data
                                                                                                                                                                    22
29.03.12     SysFera




SysFera-DS




                 23
29.03.12                                                                                                                  SysFera




    Dataflows
    • Iteration strategies
    • Automatic parallelism
    • Control structure
      (if/then/else, do/while)
    • Fault tolerant
    • Multi-workflow scheduling
                                         HALOMAKER
                                                                           GALAXYMAKER      MOMAF
                                                                            GALAXYMAKER      MOMAF
                                                                             GALAXYMAKER      MOMAF
                                                                              GALAXYMAKER      MOMAF
                                            ...




                                                                                     ...




                                                                                                    ...
                                                                                   ...




                                                                                                  ...
                                                                                 ...




                                                                                                ...
                                                                               ...




                                                                                              ...
                  RAMSES
                 RAMSES
GRAFIC2         RAMSES                   HALOMAKER   TREEMAKER             GALAXYMAKER      MOMAF
               RAMSES                                                       GALAXYMAKER      MOMAF
                                                                             GALAXYMAKER      MOMAF
                                                                              GALAXYMAKER      MOMAF
                 MPI                                                                                        Mock catalogues
                                                                                     ...




                                                                                                    ...
                                            ...




                                                                                   ...




                                                                                                  ...
                                                                                 ...




                                                                                                ...
                                                                               ...




                                                                                              ...




                           n snapshots
                                                                           GALAXYMAKER      MOMAF
                                         HALOMAKER                          GALAXYMAKER      MOMAF
                                                                             GALAXYMAKER      MOMAF
                                                                              GALAXYMAKER      MOMAF

                                                                 x tree files
                                                                                                          Parameter               24
                                                                                                           sweep
29.03.12                           SysFera




DAGDA
• Meta data-manager
• Data management from end to
  end
• Data replication
      – Explicit
      – Implicit
• Data persistency
• Memory and disk quotas
• Replacement algorithms (LRU,
  LFU, FIFO)
• Best source selection
• Strong link with task manager
• Pluggable policies, local data
  managers


                                       25
29.03.12                               SysFera




                          Thank you!

                            Questions?
Benjamin.Depardon@SysFera.com
http://www.sysfera.com




                                             26
29.03.12                                                        SysFera




Bibliography
• « Big Data & Open Source: Une convergence inévitable ? », Stefane
  Fermigier, http://www.fermigier.com/blog/2012/03/new-
  whitepaper-big-data-open-source/
• « Visual Guide to NoSQL
  Systems », http://blog.beany.co.kr/archives/275
• The Cassandra Distributed Database », Eric
  Evans, http://www.parleys.com/#st=5&id=1866&sl=40
• « Big Data Architecture », Julio
  Philippe, http://www.slideshare.net/PhilippeJulio/big-data-
  architecture
• « Big Data in Real-Time analysis at Twitter », Nick
  Allen, http://www.slideshare.net/nkallen/q-con-3770885
• …




                                                                    27

Más contenido relacionado

Similar a Big Data - SysFera presentation at the CSCI

Big dataforcf os1_23_12_final
Big dataforcf os1_23_12_finalBig dataforcf os1_23_12_final
Big dataforcf os1_23_12_finalBurrPilgerMayer
 
Big Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureOdinot Stanislas
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureOdinot Stanislas
 
InfoSphere streams_technical_overview_infospherusergroup
InfoSphere streams_technical_overview_infospherusergroupInfoSphere streams_technical_overview_infospherusergroup
InfoSphere streams_technical_overview_infospherusergroupIBMInfoSphereUGFR
 
IBM Netezza - The data warehouse in a big data strategy
IBM Netezza - The data warehouse in a big data strategyIBM Netezza - The data warehouse in a big data strategy
IBM Netezza - The data warehouse in a big data strategyIBM Sverige
 
How a Cloud Computing Provider Reached the Holy Grail of Visibility
How a Cloud Computing Provider Reached the Holy Grail of VisibilityHow a Cloud Computing Provider Reached the Holy Grail of Visibility
How a Cloud Computing Provider Reached the Holy Grail of Visibilityeladgotfrid
 
Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceRobert H. McDonald
 
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)Ajay Ohri
 
Implementing Big Data at the Speed of Business
Implementing Big Data at the Speed of BusinessImplementing Big Data at the Speed of Business
Implementing Big Data at the Speed of BusinessDataWorks Summit
 
Axxera Security Solutions
Axxera Security SolutionsAxxera Security Solutions
Axxera Security Solutionsakshayvreddy
 
Visualization in the Age of Big Data
Visualization in the Age of Big DataVisualization in the Age of Big Data
Visualization in the Age of Big DataRaffael Marty
 
Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Matt Turck
 
Splunk for DevOps - Faster Insights - Better Code
Splunk for DevOps - Faster Insights - Better CodeSplunk for DevOps - Faster Insights - Better Code
Splunk for DevOps - Faster Insights - Better CodePhilipp Drieger
 
01 im overview high level
01 im overview high level01 im overview high level
01 im overview high levelJames Findlay
 
From databases to information flow: CIFOR and seamless integration of informa...
From databases to information flow: CIFOR and seamless integration of informa...From databases to information flow: CIFOR and seamless integration of informa...
From databases to information flow: CIFOR and seamless integration of informa...IAALD Community
 

Similar a Big Data - SysFera presentation at the CSCI (20)

Big dataforcf os1_23_12_final
Big dataforcf os1_23_12_finalBig dataforcf os1_23_12_final
Big dataforcf os1_23_12_final
 
IBM Big Data Platform Nov 2012
IBM Big Data Platform Nov 2012IBM Big Data Platform Nov 2012
IBM Big Data Platform Nov 2012
 
Introducing Splunk – The Big Data Engine
Introducing Splunk – The Big Data EngineIntroducing Splunk – The Big Data Engine
Introducing Splunk – The Big Data Engine
 
Big Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the Future
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
 
InfoSphere streams_technical_overview_infospherusergroup
InfoSphere streams_technical_overview_infospherusergroupInfoSphere streams_technical_overview_infospherusergroup
InfoSphere streams_technical_overview_infospherusergroup
 
IBM Netezza - The data warehouse in a big data strategy
IBM Netezza - The data warehouse in a big data strategyIBM Netezza - The data warehouse in a big data strategy
IBM Netezza - The data warehouse in a big data strategy
 
How a Cloud Computing Provider Reached the Holy Grail of Visibility
How a Cloud Computing Provider Reached the Holy Grail of VisibilityHow a Cloud Computing Provider Reached the Holy Grail of Visibility
How a Cloud Computing Provider Reached the Holy Grail of Visibility
 
Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability Science
 
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)Ibm big data    hadoop summit 2012 james kobielus final 6-13-12(1)
Ibm big data hadoop summit 2012 james kobielus final 6-13-12(1)
 
Implementing Big Data at the Speed of Business
Implementing Big Data at the Speed of BusinessImplementing Big Data at the Speed of Business
Implementing Big Data at the Speed of Business
 
Axxera Security Solutions
Axxera Security SolutionsAxxera Security Solutions
Axxera Security Solutions
 
Visualization in the Age of Big Data
Visualization in the Age of Big DataVisualization in the Age of Big Data
Visualization in the Age of Big Data
 
Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)
 
Splunk for DevOps - Faster Insights - Better Code
Splunk for DevOps - Faster Insights - Better CodeSplunk for DevOps - Faster Insights - Better Code
Splunk for DevOps - Faster Insights - Better Code
 
Smarter Computing Big Data
Smarter Computing Big DataSmarter Computing Big Data
Smarter Computing Big Data
 
01 im overview high level
01 im overview high level01 im overview high level
01 im overview high level
 
From databases to information flow: CIFOR and seamless integration of informa...
From databases to information flow: CIFOR and seamless integration of informa...From databases to information flow: CIFOR and seamless integration of informa...
From databases to information flow: CIFOR and seamless integration of informa...
 
Data mining
Data miningData mining
Data mining
 
TruWest
TruWestTruWest
TruWest
 

Último

Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Último (20)

Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Big Data - SysFera presentation at the CSCI

  • 1. 29.03.12 SysFera Big Data Technologies SysFera Benjamin Depardon
  • 2. 29.03.12 SysFera SysFera • 2001: Research project from the Graal team (Inria/ENS) – DIET: grid middleware • 2007: SysFera-DS used within the Décrypthon project – Used in production 24/7/365 since then – Selected by IBM to replace Univa-UD • 2010: Creation of SysFera, INRIA spin-off • 2012: A team of 14 (R&D: 4 engineers and 5 PhD) – Supported by two experts from INRIA and ENS – SysFera-DS 2
  • 3. 29.03.12 SysFera What is Big Data? • All kinds of data • Valuable insight, but difficult to extract • Several dimensions – Variety • Structured/unstructured • Text, audio, video… – Velocity • Time sensitivity • Streaming – Volume • Large files • Small files in large quantities – Variability • Different meanings/format over different time period 3
  • 4. 29.03.12 SysFera What can you do with Big Data? Analyze a Variety of Information Analyze Information in Motion  Social media/sentiment analysis  Smart Grid management  Geospatial analysis  Multimodal surveillance  Brand strategy  Real-time promotions  Scientific research  Cyber security  Epidemic early warning system  ICU monitoring  Market analysis  Options trading  Video analysis  Click-stream analysis  Audio analysis  CDR processing  IT log analysis  RFID tracking & analysis Discovery & Experimentation Analyze Extreme Volumes of  Sentiment analysis Information  Brand strategy  Scientific research  Transaction analysis to create insight-based  Ad-hoc analysis product/service offerings  Model development  Fraud modeling & detection  Hypothesis testing  Risk modeling & management  Transaction analysis to create insight-  Social media/sentiment analysis based product/service offerings  Environmental analysis Manage and Plan  Operational analytics – BI reporting  Planning and forecasting analysis  Predictive analysis  …
  • 5. 29.03.12 SysFera What can you do with Big Data? Financial Services Utilities  Fraud detection  Weather impact analysis on  Risk management power generation  360° View of the Customer  Transmission monitoring  Smart grid management Transportation IT  Weather and traffic  Transition log analysis impact on logistics and for multiple fuel consumption transactional systems  Cybersecurity Health & Life Sciences  Epidemic early warning Retail system  360° View of the Customer  ICU monitoring  Click-stream analysis  Remote healthcare monitoring  Real-time promotions Telecommunications  CDR processing Law Enforcement  Real-time multimodal surveillance  Churn prediction  Situational awareness  Geomapping / marketing  Cyber security detection  Network monitoring
  • 6. 29.03.12 SysFera What do you need? • Hardware – Storage capacity – Computing power • Software – Storage • Filesystems • Databases – Computation framework 6
  • 7. 29.03.12 SysFera DISTRIBUTED FILESYSTEMS 7
  • 8. 29.03.12 SysFera HDFS • Hadoop Distributed File System • Open source (Apache) • Design – High throughput instead of low latency – Large data sets (large files), data locality – Fault tolerance (replication) – Write once and read-many (WORM) – Userspace • Limitations – Write-once model – Cannot be mounted by existing OS – No quotas/access permissions – Name node is a single point of failure • Used by Yahoo, Twitter, Rackspace, LinkedIn, Facebook… 8
  • 9. 29.03.12 SysFera GlusterFS • Open source (GPLV3) NAS file system • Runs in userspace • File-based distributed mirroring, replication, striping, load balancing • FUSE, POSIX compliant • Storage quotas • No meta-data server (fully distributed architecture, elastic hash) • Unified global namespace: aggregation of disk and memory in a single pool • Data is stored in logical volumes that are abstracted from the hardware and logically partitioned from each other • Multiprotocole client support: GlusterFS native, NFS, CIFS, HTTP, WebDAV, FTP • Real time Self-healing • VM live replication 9
  • 10. 29.03.12 SysFera LUSTRE • Open Source (GPL) • Object based: separate metadata and file data – Meta Data Servers (MDS) nodes – Object Storage Servers (OSS) nodes • Consistency: Lustre distributed lock manager (MSD and OSS) • Performance: – data can be striped – MDT is only involved in pathname and permission checks, and is not involved in any file IO operations • POSIX interface • Lustre Network (LNET): infinibands, TCP/IP, Myrinet… • Targeted to manage large files 10
  • 11. 29.03.12 SysFera DATABASES 11
  • 12. 29.03.12 SysFera CAP theorem (Brewer’s theorem) It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: • Consistency • Availability • Partition tolerance 12
  • 13. 29.03.12 SysFera NoSQL • Release ACID conditions • 4 types of NoSQL bases – Key-value (Memcached, Voldemort): data agnostic – Document oriented (CouchDB, MongoDB) : data conscious – Column oriented (Big Table, Hbase, Cassandra) – Graph (Neo4j) • Requires more work on the client side 13
  • 14. 29.03.12 SysFera MemCached • Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. • Simple Key/Value Store • Smarts Half in Client, Half in Server • Servers are Disconnected From Each Other • O(1) Everything • Forgetting Data is a Feature • Used by LiveJournal, flickr, Wordpress.org, Wikipedia, YouTube … 14
  • 15. 29.03.12 SysFera MongoDB • Document oriented • Transport and storage: BSON format (derived from JSON, but binary) • Queries – no join – Map/reduce • Database contains collections • Collections contain documents • Master-slave replication 15
  • 16. 29.03.12 SysFera Cassandra • Column oriented (inspired from Big Table & Dynamo) • Notion of super-columns – (sorted) associative array of columns • Range queries on keys • Low latency: sequential access to disk • O(1) DHT • Eventual Consistency • Values limited to 2GB • RPC with Thrift 16
  • 17. 29.03.12 SysFera Neo4J • Graph oriented • Fully ACID transactions • Data is stored as a graph/network – Nodes and relationships with properties – "Property graph" or "edge-labeled multidigraph" • Queries – Indexing of nodes and properties – Graph traversal • Disk-based, native storage • Java, REST API • Master-slave load balancing • Use case: social network 17
  • 18. 29.03.12 SysFera PaaS Databases • Different providers – Amazon: RDS, SimpleDB – Google: AppEngine (GQL) – Microsoft: SQL Azure • Different cost models – CPU hour – CPU hour + traffic – Monthly fee + CPU hour + traffic All depend on the load (number of users) 18
  • 19. 29.03.12 SysFera SOLUTIONS 19
  • 20. 29.03.12 SysFera GO-Transfer: Data transfer as SaaS Reliable file transfer. Easy “fire-and-forget” transfers Automatic fault recovery High performance Across multiple security domains No IT required. Software as a Service (SaaS) No client software installation New features automatically available Consolidated support & troubleshooting Works with existing GridFTP servers Globus Connect solves “last mile problem” GO-Transfer is the initial offering of the US National Science Foundation’s XSEDE User Access Services (XUAS) © Ian Foster 20
  • 21. 29.03.12 SysFera Hadoop environment PIG (Data Flow) HIVE (Batch SQL) SQOOP (Data Import) ZOOKEEPER (Coordination) AVRO (Serialization) CHUKWA (Displaying, Monitoring, Analysing Logs) MAP REDUCE (Job scheduling – Raw processing) HBASE (Real Time Query) HDFS (Hadoop Distributed File System – Unstructured Storage) 21
  • 22. 29.03.12 SysFera IBM Big Data Platform InfoSphere BigInsights Hadoop-based low latency analytics for variety and volume Hadoop Information Stream Computing InfoSphere Information Server Integration InfoSphere Streams High volume data integration and Low Latency Analytics for streaming transformation data MPP Data Warehouse IBM InfoSphere IBM Netezza High Capacity IBM Netezza 1000 IBM Smart Analytics System IBM Informix Timeseries Warehouse Appliance BI+Ad Hoc Analytics Structured Data Operational Analytics on Time-structured analytics Large volume structured Queryable Archive Structured Structured Data data analytics Data 22
  • 23. 29.03.12 SysFera SysFera-DS 23
  • 24. 29.03.12 SysFera Dataflows • Iteration strategies • Automatic parallelism • Control structure (if/then/else, do/while) • Fault tolerant • Multi-workflow scheduling HALOMAKER GALAXYMAKER MOMAF GALAXYMAKER MOMAF GALAXYMAKER MOMAF GALAXYMAKER MOMAF ... ... ... ... ... ... ... ... ... RAMSES RAMSES GRAFIC2 RAMSES HALOMAKER TREEMAKER GALAXYMAKER MOMAF RAMSES GALAXYMAKER MOMAF GALAXYMAKER MOMAF GALAXYMAKER MOMAF MPI Mock catalogues ... ... ... ... ... ... ... ... ... n snapshots GALAXYMAKER MOMAF HALOMAKER GALAXYMAKER MOMAF GALAXYMAKER MOMAF GALAXYMAKER MOMAF x tree files Parameter 24 sweep
  • 25. 29.03.12 SysFera DAGDA • Meta data-manager • Data management from end to end • Data replication – Explicit – Implicit • Data persistency • Memory and disk quotas • Replacement algorithms (LRU, LFU, FIFO) • Best source selection • Strong link with task manager • Pluggable policies, local data managers 25
  • 26. 29.03.12 SysFera Thank you! Questions? Benjamin.Depardon@SysFera.com http://www.sysfera.com 26
  • 27. 29.03.12 SysFera Bibliography • « Big Data & Open Source: Une convergence inévitable ? », Stefane Fermigier, http://www.fermigier.com/blog/2012/03/new- whitepaper-big-data-open-source/ • « Visual Guide to NoSQL Systems », http://blog.beany.co.kr/archives/275 • The Cassandra Distributed Database », Eric Evans, http://www.parleys.com/#st=5&id=1866&sl=40 • « Big Data Architecture », Julio Philippe, http://www.slideshare.net/PhilippeJulio/big-data- architecture • « Big Data in Real-Time analysis at Twitter », Nick Allen, http://www.slideshare.net/nkallen/q-con-3770885 • … 27