SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
Applications in Social Networks
Jason Riedy, David Bader, David Ediger, ...
Outline
• Background on applications in social
  network analysis
  – Static applications
  – Growing need for dynamic analysis
• Quick, high-level description of two
  algorithm areas with potential for
  acceleration.
  – k-Betweenness Centrality
  – Agglomerative clustering / community
    identification
Graph –theoretic problems in social networks




                                             – Community identification:
                                               clustering
                                             – Targeted advertising: centrality
                                             – Information spreading: modeling




Image Source: Nexus (Facebook application)


                                                                                  3
Driving Forces in Social Network Analysis
• Facebook          has more than 500 million active users


3 orders of
magnitude
growth in 3
years!


• Note the graph is changing as well as growing.
• Traditional graph partitioning often fails:
    –   Topology: Interaction graph is low-diameter, and has no good separators
    –   Irregularity: Communities are not uniform in size
    –   Overlap: individuals are members of one or more communities
• Currently recompute ad targeting once per hour. Accelerate?
    – Now consider targeting usage on a power grid, etc.
    – Similar size, but static. Dynamic identification of issues
      definitely needs accelerated.

                                                                                  4
Massive Data Analytics in Health/EMS                              ENSG00
                                                                  0001453
                                                                    32.2
                                                                   Kelch-
       Public Health                                                 like
                                                                   protein
                                                                       8
• CDC / Nation-scale                                              implicat
                                                                    ed in
  surveillance of public health                                    breast
                                                                   cancer
• Cancer genomics and drug
  design
    – computed Betweenness
      Centrality of Human Proteome

Rank    H1N1                       atlflood           ●
                                                       Identify locally
1       @CDCFlu                    @ajc               important news and
2       @addthis                   @driveafastercar   information sources.
3       @Official_PAX              @ATLCheap
                                                          ●
                                                           Spread correct
                                                          information.
4       @FluGov                    @TWCi                  ●
                                                           Prevent
5       @nytimes                   @HelloNorthGA
                                                          misinformation.
6       @tweetmeme                 @11AliveNews       ●
                                                       Similar uses: Identify
7       @mercola                   @WSB_TV            regions being
8       @CNN                       @shaunking         affected by disaster /
               (Collaboration w/PNNL)                 disease.
                                                                             5
Social/Economic Policy
• NYSE “Flash crash” of 6 May 2010:
  – Dropped 700 pts in 20 minutes.
  – Simple “circuit breakers” were of no use.
    • "A number of the [regulatory pauses] in effect on
      May 6 were resolved in less than one second, […]"
      – Congressional Testimony of Larry Leibowitz, CEO
      NYSE Euronext
  – Breakers based on levels, not structure.
• 1 Oct: Finally announce the reason:
  – One single large trade triggered a network
    of reactions.
• Regulators need accelerated analysis.
Current Example Data Rates
• Financial / regulatory:
   – NYSE processes 1.5TB daily, maintains 8PB
• Social:
   – Facebook adds >100k users, 55M “status” updates,
     80M photos daily; report more than 500M active
     users with an average of 130 “friend” connections
     each.
   – Foursquare, a new service, reports 1.2M location
     check-ins per week
• Scientific:
   – MEDLINE adds from 1 to 140 publications a day
 Shared features: All data is rich, irregularly connected
   to other data. All is a mix of “good” and “bad” data...
    And much real data may be missing or inconsistent.
                                                             7
Current Unserved Applications
• Separate the “good” from the “bad”
   – Spam. Frauds. Irregularities.
   – Pick news from world-wide events tailored to
     interests as the events & interests change.
• Identify and track changes
   – Disease outbreaks. Social trends. Utility & service
     changes during weather events.
• Discover new relationships
   – Similarities in scientific publications.
• Predict upcoming events
   – Present advertisements before a user searches.
 Shared features: Relationships are abstract. Physical
   locality is only one aspect, unlike physical simulation.
                                                              8
Streaming Data Analysis
Current needs, future
                        • Massive, irregularly structured
knowledge:               input data.
                        • New simulation, analysis methods

    ?                   • Widely varied, unexplored
                         response / control methods




                                                 ?

                                                        9
Streaming Data Analysis
Current needs, future
                           • Massive, irregularly structured
knowledge:                  input data.
                           • New simulation, analysis methods

    ?                      • Widely varied, unexplored
                            response / control methods




                                                     ?
          ...            Analysts need us here. Yesterday.
                        Closing the loop needs acceleration.
                                                               10
Streaming Data Analysis
Current needs, future
                        • Massive, irregularly structured
knowledge:               input data.
                        • New simulation, analysis methods

    ?                   • Widely varied, unexplored
                         response / control methods




                                     Facebook
                                     friendship
                                     graph: 30k
                                     edges per pixel


          ...
                                     on 1600x1200
                                     screen!

                                                       11
Algorithms
• Just to set the stage for discussing
  accelerators:
  – k-Betweenness Centrality
    • (showing effectiveness on one alternative
      architecture, the Cray XMT)
  – Agglomerative community identification
    • (could be very useful to assist “acceleration”
      by data filtering)
K-Betweenness Centrality
• Count short paths (shortest +
  k) relative to all paths.
• Maintain multiple BFS fronts
  (single: Dijkstra's algorithm).
• Each vertex has a forward,
  lock-free queue of relevant
  edges.
• Distances and BCk values can
  be computed by a second pass
  along queued edges.
• Not directly expressible (by our
  attempts) in linear algebra or
  mathematical optimization.
• Cost exponential in k, O(mn)
  for fixed k. Can be approx.

     Brandes, 2001; Bader, et al. 2009 & 2009.
                                                 13
IMDB Movie Actor Network (Approx BC0)
            An undirected graph of 1.54 million vertices (movie actors) and 78 million
            edges. An edge corresponds to a link between two actors if they have acted
            together in a movie.




                                                                             Betweenness
Frequency




                                                                                                                            Kevin
                                                                                                                            Bacon



                                        Degree                                                              Degree


                   Cray XMT (16)




            Intel Xeon 2.4GHz (4)                                                                      Disparity grows with k...
                                                                                                       So what is the XMT?
                                    0    100     200      300     400       500            600   700
                                                       Running Time (sec)
                                                                                                                                   14 14
Alternative architecture: Cray XMT
• Tolerates latency by extreme multithreading
   –   Each processor supports 128 hardware threads
   –   Context switch in a single tick
   –   No cache or local memory
   –   Context switch on memory request
   –   Multiple outstanding loads
• Remote memory requests do not stall processors
   – Other streams work while the request
     gets fulfilled
• Light-weight, word-level synchronization
   – Minimizes access conflicts
• Hashed global shared memory
   – 64-byte granularity, minimizes hotspots
• High-productivity graph analysis!

                                                      15
Community Identification
• Implicit communities in large-
  scale networks are of interest in
  many cases.
   – WWW
   – Social networks
   – Biological networks
• Formulated as a graph
  clustering problem.
   – Informally, identify/extract “dense”
     sub-graphs.
• Several different objective
  functions exist.
   – Metrics based on intra-cluster vs. inter-
     cluster edges, community sizes,
     number of communities, overlap …
• Agglomerative, bottom-up:
   – Evaluate metric change, merge
     (independent) sets to maximize
     change.
                                                 16
Seed Set Expansion
• Useful to find communities to which
  several vertices belong.
• Blue vertices are
  are seeds, red vertices
  belong to a community
  of interest.
• Selection for viz,
  analysis...
• Consider agglomerative.
                                        17
Possible accelerators
• Map-Reduce: A large
  aggregate disk capacity
  with data replication
  support (Hadoop File
  System)....
• Netezza Twin-Fin: FPGAs
  to filter/reduce data...



                             ●
                               Currently experimenting with
                             agglomerative methods on
                             multithreaded architectures.
                             ●
                               Can we express clustering /
                             community detection as a
                             selection rule instead?
      Source: Netezza
                                                          18
Acknowledgment of Support




     David A. Bader         19
Extra information on sizes / rates
Data Volumes in Commercial Sector
• EBay (April 2009) has a pair of data warehouses
   – >2 PB, traditional
   – 6.5PB, 17 trillion records, 1.5B records/day, each web click is 50-150 details
   – Source: http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/ 


• Facebook (May 2009):
   – Estimate of 2.5PB of user data
   – 15 TB of new data per day
   – Queries to develop targeted ads are run hourly
   – Source: http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/


•  In   2008: http://www.dbms2.com/2008/10/15/teradatas-petabyte-power-players/
   –    Walmart: 2.5PB
   –    Bank of America: 1.5PB
   –    Dell: 1PB




                                                                                      21
Data Volumes: Current data sets
• NYSE: 1.5TB daily, 8PB maintained
• Google: “Several dozen” 1PB sets (CACM Jan 2010)
• LHC: 15PB per year (avg 41TB/day)
   • (http://public.web.cern.ch/public/en/lhc/Computing-en.html)
• LSST: 13TB nightly
   • (http://www.lsst.org/Project/docs/data-challenge.pdf)
• Wal-Mart: 536TB, 1B entries daily (2006)
• Facebook: 350M users, 3.5B shared items/week



                           
                               All data is rich.
                           
                               Data rates do not include
                               building relationships.
          David A. Bader                                           22
Data Volumes: Current data rates

• NYSE: 1.5TB daily               • 1 Gb Ethernet: 8.7TB daily at
• LHC: 41TB daily                   100%, 5-6TB daily realistic
• LSST: 13TB daily
                                  • Multi-TB storage on 10GE:
                                    300TB daily read, 90TB daily
                                    write

                            
                                Current data is at the
                                limit of current systems.
                            
                                Not counting
                                relationships...
           David A. Bader                                      23
Data Volumes: Future data rates
• Facebook: >2x yearly            • Ethernet: 4x in next 2 years.
• Twitter: >10x yearly              Maybe.
• Growing sources:
  – Bioinformatics
                                  • Flash storage, direct: 10x
  – Nano-scale devices              write, 4x read. Huge cost for
  – Security                        multi-PB storage.

                            
                                Data rate growth is
                                outstripping technology.
                            
                                Then consider: latency,
                                ingest, processing,
                                response...
           David A. Bader                                       24
Data Volumes: Current data sets
• NYSE: 8PB                  • CPU ↔ Memory:
                               – QPI,HT: 2PB/day@100%
• Google: >12PB
                               – Power7: 8.7PB/day
• LHC: >15PB
                             • Mem:
                               – NCSA Blue Waters target:
                                 2PB
                       
                           Even with parallelism,
                           current (in-progress)
                           systems cannot handle
                           more than a few passes...
                           per day.
      David A. Bader                                    25

Más contenido relacionado

Similar a Social Network Applications

Introduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia ResearchIntroduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia ResearchDavid De Roure
 
BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm. BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm. maigva
 
Maps and data esri health care 2012
Maps and data   esri health care 2012Maps and data   esri health care 2012
Maps and data esri health care 2012J T "Tom" Johnson
 
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...Jason Riedy
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forumChris Dwan
 
Big Data and Smart Healthcare
Big Data and Smart Healthcare Big Data and Smart Healthcare
Big Data and Smart Healthcare Sujan Perera
 
Biostatistics and Statistical Bioinformatics
Biostatistics and Statistical BioinformaticsBiostatistics and Statistical Bioinformatics
Biostatistics and Statistical BioinformaticsSetia Pramana
 
Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?Maryann Martone
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsPerficient, Inc.
 
Trends in Annotation of Genomic Data
Trends in Annotation of Genomic DataTrends in Annotation of Genomic Data
Trends in Annotation of Genomic Databiobase
 
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...Amit Sheth
 
Genomics: Big Data Leading to Big Opportunities
Genomics: Big Data Leading to Big OpportunitiesGenomics: Big Data Leading to Big Opportunities
Genomics: Big Data Leading to Big OpportunitiesHannes Smárason
 
COMPUTERS IN PHARMACEUTICAL DEVELOPMENT
COMPUTERS IN PHARMACEUTICAL DEVELOPMENTCOMPUTERS IN PHARMACEUTICAL DEVELOPMENT
COMPUTERS IN PHARMACEUTICAL DEVELOPMENTArunpandiyan59
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaMaria de la Iglesia
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
Big Data for Development
Big Data for DevelopmentBig Data for Development
Big Data for DevelopmentJoud Khattab
 
Big Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical DevicesBig Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical DevicesPremNarayanan6
 

Similar a Social Network Applications (20)

Introduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia ResearchIntroduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia Research
 
Hadoop Enabled Healthcare
Hadoop Enabled HealthcareHadoop Enabled Healthcare
Hadoop Enabled Healthcare
 
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
 
BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm. BIMCV: The Perfect "Big Data" Storm.
BIMCV: The Perfect "Big Data" Storm.
 
Maps and data esri health care 2012
Maps and data   esri health care 2012Maps and data   esri health care 2012
Maps and data esri health care 2012
 
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Big Data and Smart Healthcare
Big Data and Smart Healthcare Big Data and Smart Healthcare
Big Data and Smart Healthcare
 
EHR - The Killer App ?
EHR - The Killer App ?EHR - The Killer App ?
EHR - The Killer App ?
 
Biostatistics and Statistical Bioinformatics
Biostatistics and Statistical BioinformaticsBiostatistics and Statistical Bioinformatics
Biostatistics and Statistical Bioinformatics
 
Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and Analytics
 
Trends in Annotation of Genomic Data
Trends in Annotation of Genomic DataTrends in Annotation of Genomic Data
Trends in Annotation of Genomic Data
 
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
Ontology-enabled Healthcare Applications exploiting Physical-Cyber-Social Big...
 
Genomics: Big Data Leading to Big Opportunities
Genomics: Big Data Leading to Big OpportunitiesGenomics: Big Data Leading to Big Opportunities
Genomics: Big Data Leading to Big Opportunities
 
COMPUTERS IN PHARMACEUTICAL DEVELOPMENT
COMPUTERS IN PHARMACEUTICAL DEVELOPMENTCOMPUTERS IN PHARMACEUTICAL DEVELOPMENT
COMPUTERS IN PHARMACEUTICAL DEVELOPMENT
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Big Data for Development
Big Data for DevelopmentBig Data for Development
Big Data for Development
 
Big Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical DevicesBig Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical Devices
 

Más de Jason Riedy

Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFJason Riedy
 
LAGraph 2021-10-13
LAGraph 2021-10-13LAGraph 2021-10-13
LAGraph 2021-10-13Jason Riedy
 
Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFJason Riedy
 
Graph analysis and novel architectures
Graph analysis and novel architecturesGraph analysis and novel architectures
Graph analysis and novel architecturesJason Riedy
 
GraphBLAS and Emus
GraphBLAS and EmusGraphBLAS and Emus
GraphBLAS and EmusJason Riedy
 
Reproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to ArchitectureReproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to ArchitectureJason Riedy
 
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...Jason Riedy
 
ICIAM 2019: Reproducible Linear Algebra from Application to Architecture
ICIAM 2019: Reproducible Linear Algebra from Application to ArchitectureICIAM 2019: Reproducible Linear Algebra from Application to Architecture
ICIAM 2019: Reproducible Linear Algebra from Application to ArchitectureJason Riedy
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 
Novel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondNovel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondJason Riedy
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksJason Riedy
 
CRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateCRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateJason Riedy
 
Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Jason Riedy
 
Graph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesGraph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesJason Riedy
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsJason Riedy
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsJason Riedy
 
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisA New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs Jason Riedy
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsHigh-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsJason Riedy
 
Updating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsUpdating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsJason Riedy
 

Más de Jason Riedy (20)

Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoF
 
LAGraph 2021-10-13
LAGraph 2021-10-13LAGraph 2021-10-13
LAGraph 2021-10-13
 
Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoF
 
Graph analysis and novel architectures
Graph analysis and novel architecturesGraph analysis and novel architectures
Graph analysis and novel architectures
 
GraphBLAS and Emus
GraphBLAS and EmusGraphBLAS and Emus
GraphBLAS and Emus
 
Reproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to ArchitectureReproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to Architecture
 
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
 
ICIAM 2019: Reproducible Linear Algebra from Application to Architecture
ICIAM 2019: Reproducible Linear Algebra from Application to ArchitectureICIAM 2019: Reproducible Linear Algebra from Application to Architecture
ICIAM 2019: Reproducible Linear Algebra from Application to Architecture
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
Novel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondNovel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and Beyond
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with Microbenchmarks
 
CRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateCRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery Update
 
Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018
 
Graph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesGraph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New Architectures
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
 
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisA New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsHigh-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
 
Updating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsUpdating PageRank for Streaming Graphs
Updating PageRank for Streaming Graphs
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

Social Network Applications

  • 1. Applications in Social Networks Jason Riedy, David Bader, David Ediger, ...
  • 2. Outline • Background on applications in social network analysis – Static applications – Growing need for dynamic analysis • Quick, high-level description of two algorithm areas with potential for acceleration. – k-Betweenness Centrality – Agglomerative clustering / community identification
  • 3. Graph –theoretic problems in social networks – Community identification: clustering – Targeted advertising: centrality – Information spreading: modeling Image Source: Nexus (Facebook application) 3
  • 4. Driving Forces in Social Network Analysis • Facebook has more than 500 million active users 3 orders of magnitude growth in 3 years! • Note the graph is changing as well as growing. • Traditional graph partitioning often fails: – Topology: Interaction graph is low-diameter, and has no good separators – Irregularity: Communities are not uniform in size – Overlap: individuals are members of one or more communities • Currently recompute ad targeting once per hour. Accelerate? – Now consider targeting usage on a power grid, etc. – Similar size, but static. Dynamic identification of issues definitely needs accelerated. 4
  • 5. Massive Data Analytics in Health/EMS ENSG00 0001453 32.2 Kelch- Public Health like protein 8 • CDC / Nation-scale implicat ed in surveillance of public health breast cancer • Cancer genomics and drug design – computed Betweenness Centrality of Human Proteome Rank H1N1 atlflood ● Identify locally 1 @CDCFlu @ajc important news and 2 @addthis @driveafastercar information sources. 3 @Official_PAX @ATLCheap ● Spread correct information. 4 @FluGov @TWCi ● Prevent 5 @nytimes @HelloNorthGA misinformation. 6 @tweetmeme @11AliveNews ● Similar uses: Identify 7 @mercola @WSB_TV regions being 8 @CNN @shaunking affected by disaster / (Collaboration w/PNNL) disease. 5
  • 6. Social/Economic Policy • NYSE “Flash crash” of 6 May 2010: – Dropped 700 pts in 20 minutes. – Simple “circuit breakers” were of no use. • "A number of the [regulatory pauses] in effect on May 6 were resolved in less than one second, […]" – Congressional Testimony of Larry Leibowitz, CEO NYSE Euronext – Breakers based on levels, not structure. • 1 Oct: Finally announce the reason: – One single large trade triggered a network of reactions. • Regulators need accelerated analysis.
  • 7. Current Example Data Rates • Financial / regulatory: – NYSE processes 1.5TB daily, maintains 8PB • Social: – Facebook adds >100k users, 55M “status” updates, 80M photos daily; report more than 500M active users with an average of 130 “friend” connections each. – Foursquare, a new service, reports 1.2M location check-ins per week • Scientific: – MEDLINE adds from 1 to 140 publications a day Shared features: All data is rich, irregularly connected to other data. All is a mix of “good” and “bad” data... And much real data may be missing or inconsistent. 7
  • 8. Current Unserved Applications • Separate the “good” from the “bad” – Spam. Frauds. Irregularities. – Pick news from world-wide events tailored to interests as the events & interests change. • Identify and track changes – Disease outbreaks. Social trends. Utility & service changes during weather events. • Discover new relationships – Similarities in scientific publications. • Predict upcoming events – Present advertisements before a user searches. Shared features: Relationships are abstract. Physical locality is only one aspect, unlike physical simulation. 8
  • 9. Streaming Data Analysis Current needs, future • Massive, irregularly structured knowledge: input data. • New simulation, analysis methods ? • Widely varied, unexplored response / control methods ? 9
  • 10. Streaming Data Analysis Current needs, future • Massive, irregularly structured knowledge: input data. • New simulation, analysis methods ? • Widely varied, unexplored response / control methods ? ... Analysts need us here. Yesterday. Closing the loop needs acceleration. 10
  • 11. Streaming Data Analysis Current needs, future • Massive, irregularly structured knowledge: input data. • New simulation, analysis methods ? • Widely varied, unexplored response / control methods Facebook friendship graph: 30k edges per pixel ... on 1600x1200 screen! 11
  • 12. Algorithms • Just to set the stage for discussing accelerators: – k-Betweenness Centrality • (showing effectiveness on one alternative architecture, the Cray XMT) – Agglomerative community identification • (could be very useful to assist “acceleration” by data filtering)
  • 13. K-Betweenness Centrality • Count short paths (shortest + k) relative to all paths. • Maintain multiple BFS fronts (single: Dijkstra's algorithm). • Each vertex has a forward, lock-free queue of relevant edges. • Distances and BCk values can be computed by a second pass along queued edges. • Not directly expressible (by our attempts) in linear algebra or mathematical optimization. • Cost exponential in k, O(mn) for fixed k. Can be approx. Brandes, 2001; Bader, et al. 2009 & 2009. 13
  • 14. IMDB Movie Actor Network (Approx BC0) An undirected graph of 1.54 million vertices (movie actors) and 78 million edges. An edge corresponds to a link between two actors if they have acted together in a movie. Betweenness Frequency Kevin Bacon Degree Degree Cray XMT (16) Intel Xeon 2.4GHz (4) Disparity grows with k... So what is the XMT? 0 100 200 300 400 500 600 700 Running Time (sec) 14 14
  • 15. Alternative architecture: Cray XMT • Tolerates latency by extreme multithreading – Each processor supports 128 hardware threads – Context switch in a single tick – No cache or local memory – Context switch on memory request – Multiple outstanding loads • Remote memory requests do not stall processors – Other streams work while the request gets fulfilled • Light-weight, word-level synchronization – Minimizes access conflicts • Hashed global shared memory – 64-byte granularity, minimizes hotspots • High-productivity graph analysis! 15
  • 16. Community Identification • Implicit communities in large- scale networks are of interest in many cases. – WWW – Social networks – Biological networks • Formulated as a graph clustering problem. – Informally, identify/extract “dense” sub-graphs. • Several different objective functions exist. – Metrics based on intra-cluster vs. inter- cluster edges, community sizes, number of communities, overlap … • Agglomerative, bottom-up: – Evaluate metric change, merge (independent) sets to maximize change. 16
  • 17. Seed Set Expansion • Useful to find communities to which several vertices belong. • Blue vertices are are seeds, red vertices belong to a community of interest. • Selection for viz, analysis... • Consider agglomerative. 17
  • 18. Possible accelerators • Map-Reduce: A large aggregate disk capacity with data replication support (Hadoop File System).... • Netezza Twin-Fin: FPGAs to filter/reduce data... ● Currently experimenting with agglomerative methods on multithreaded architectures. ● Can we express clustering / community detection as a selection rule instead? Source: Netezza 18
  • 19. Acknowledgment of Support David A. Bader 19
  • 20. Extra information on sizes / rates
  • 21. Data Volumes in Commercial Sector • EBay (April 2009) has a pair of data warehouses – >2 PB, traditional – 6.5PB, 17 trillion records, 1.5B records/day, each web click is 50-150 details – Source: http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/  • Facebook (May 2009): – Estimate of 2.5PB of user data – 15 TB of new data per day – Queries to develop targeted ads are run hourly – Source: http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/ •  In 2008: http://www.dbms2.com/2008/10/15/teradatas-petabyte-power-players/ – Walmart: 2.5PB – Bank of America: 1.5PB – Dell: 1PB 21
  • 22. Data Volumes: Current data sets • NYSE: 1.5TB daily, 8PB maintained • Google: “Several dozen” 1PB sets (CACM Jan 2010) • LHC: 15PB per year (avg 41TB/day) • (http://public.web.cern.ch/public/en/lhc/Computing-en.html) • LSST: 13TB nightly • (http://www.lsst.org/Project/docs/data-challenge.pdf) • Wal-Mart: 536TB, 1B entries daily (2006) • Facebook: 350M users, 3.5B shared items/week  All data is rich.  Data rates do not include building relationships. David A. Bader 22
  • 23. Data Volumes: Current data rates • NYSE: 1.5TB daily • 1 Gb Ethernet: 8.7TB daily at • LHC: 41TB daily 100%, 5-6TB daily realistic • LSST: 13TB daily • Multi-TB storage on 10GE: 300TB daily read, 90TB daily write  Current data is at the limit of current systems.  Not counting relationships... David A. Bader 23
  • 24. Data Volumes: Future data rates • Facebook: >2x yearly • Ethernet: 4x in next 2 years. • Twitter: >10x yearly Maybe. • Growing sources: – Bioinformatics • Flash storage, direct: 10x – Nano-scale devices write, 4x read. Huge cost for – Security multi-PB storage.  Data rate growth is outstripping technology.  Then consider: latency, ingest, processing, response... David A. Bader 24
  • 25. Data Volumes: Current data sets • NYSE: 8PB • CPU ↔ Memory: – QPI,HT: 2PB/day@100% • Google: >12PB – Power7: 8.7PB/day • LHC: >15PB • Mem: – NCSA Blue Waters target: 2PB  Even with parallelism, current (in-progress) systems cannot handle more than a few passes... per day. David A. Bader 25