Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Using Graphs to Enable National-Scale Analytics

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 24 Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Using Graphs to Enable National-Scale Analytics (20)

Anuncio

Más de Neo4j (20)

Más reciente (20)

Anuncio

Using Graphs to Enable National-Scale Analytics

  1. 1. Using Graphs to Enable National-Scale Analytics http://www.cs.njit.edu/~bader Connections: Graphs in Government
  2. 2. David A. Bader Distinguished Professor and Director, Institute for Data Science • IEEE Fellow, SIAM Fellow, AAAS Fellow • Recent Service: • White House's National Strategic Computing Initiative (NSCI) panel • Computing Research Association Board • NSF Advisory Committee on Cyberinfrastructure • Council on Competitiveness HPC Advisory Committee • IEEE Computer Society Board of Governors • IEEE IPDPS Steering Committee • Editor-in-Chief, ACM Transactions on Parallel Computing • Editor-in-Chief, IEEE Transactions on Parallel and Distributed Systems • Over $184M of research awards • 250+ publications, ≥ 9,700 citations, h-index ≥ 57 • National Science Foundation CAREER Award recipient • Directed: Facebook AI Systems • Directed: NVIDIA GPU Center of Excellence, NVIDIA AI Lab (NVAIL) • Directed: Sony-Toshiba-IBM Center for the Cell/B.E. Processor • Founder: Graph500 List benchmarking “Big Data” platforms • Recognized as a “RockStar” of High Performance Computing by InsideHPC in 2012 and as HPCwire’s People to Watch in 2012 and 2014. 15 September 2020 David Bader 2
  3. 3. Solving real-world challenges • Urban sustainability • Healthcare analytics • Trustworthy, Free and Fair Elections • Insider threat detection • Utility infrastructure protection • Cyberattack defense • Disease outbreak and epidemic monitoring 15 September 2020 David Bader 3
  4. 4. Strategic Intelligence 4 “Advances in communications and the democratization of other technologies have also generated an ability to create and share vast and exponentially growing amounts of information farther and faster than ever before. This abundance of data provides significant opportunities for the IC, including new avenues for collection and the potential for greater insight, but it also challenges the IC’s ability to collect, process, evaluate, and analyze such enormous volumes of data quickly enough to provide relevant and useful insight to its customers.” à “Develop and maintain capabilities to acquire and evaluate data to obtain a deep understanding of the global political, diplomatic, military, economic, security, and informational environment. “ 15 September 2020 David Bader
  5. 5. Data Science: Discovery and Innovation The National Strategic Computing Initiative (NSCI) The NSCI was launched by Executive Order (EO) 13702 in July 2015 to advance U.S. leadership in high performance computing (HPC). McKinsey predicts that data-driven technologies will bring an additional $300 billion of value to the U.S. health care sector alone, and by 2020, 1.5 million more “data-savvy managers” will be needed to capitalize on the potential of data, “big” and otherwise. Manyika, J. et al. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute. Retrieved from http://www.mckinsey.com/business-functions/business-technology/our- insights/big-data-the-next-frontier-for-innovation The ability to manipulate data and understand Data Science is becoming increasingly critical to current and future discovery and innovation. REALIZING THE POTENTIAL OF DATA SCIENCE Final Report from the National Science Foundation Computer and Information Science and Engineering Advisory Committee Data Science Working Group. Francine Berman and Rob Rutenbar, co-Chairs Henrik Christensen, Susan Davidson, Deborah Estrin, Michael Franklin, Brent Hailpern, Margaret Martonosi, Padma Raghavan, Victoria Stodden, Alex Szalay. December 2016 15 September 2020 David Bader 5
  6. 6. National Strategic Computing Initiative (NSCI) Update 14 Nov 2019 • Computing hardware, with a focus on the 10-year horizon and beyond; • Software infrastructure that will enable effective and sustainable use of new computing; • Overall infrastructure, from data usage and management to cybersecurity, foundries, and prototypes; • And the development of new real-world applications, systems, and opportunities for future computing. In recognition of the fast-changing computing landscape, the updated plan places new emphasis on the following areas as compared to the 2016 plan: 15 September 2020 David Bader 6
  7. 7. The Reality • This image is a visualization of a personal friendster network (circa February 2004) to 3 hops out. The network consists of 47,471 people connected by 432,430 edges. Credit: Jeffrey Heer, UC Berkeley 15 September 2020 David Bader 7
  8. 8. 8 Advantages of Graph Analytics • Much smaller than raw data. Can fit in memory of large computer • Fast response to queries • Pre-join of database • Combine data from different sources and of different types • Some common intelligence and law enforcement queries are naturally posed on graphs • Particularly for the terrorist threat 15 September 2020 David Bader
  9. 9. Query Example I: Short Paths David Bader 9
  10. 10. Query Example II: Motif Finding Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47 David Bader 10
  11. 11. The Big Picture Analyst makes queries. Massive Databases Fast Graph Query Extract “Window” High Latency Query Graph resides in memory. 15 September 2020 David Bader 11
  12. 12. Data-Quad Known Known Unknown Unknown Objects Patterns 15 September 2020 David Bader 12
  13. 13. Graph Data Science: Real-world challenges All involve exascale streaming graphs: • Health care à disease spread, detection and prevention of epidemics/pandemics (e.g. SARS, Avian flu, H1N1 “swine” flu) • Massive social networks à understanding communities, intentions, population dynamics, pandemic spread, transportation and evacuation • Intelligence à business analytics, anomaly detection, security, knowledge discovery from massive data sets • Systems Biology à understanding complex life systems, drug design, microbial research, unravel the mysteries of the HIV virus; understand life, disease, • Electric Power Grid à communication, transportation, energy, water, food supply • Modeling and Simulation à Perform full-scale economic-social-political simulations REQUIRES PREDICTING / INFLUENCE CHANGE IN REAL-TIME AT SCALE 15 September 2020 David Bader 13
  14. 14. Graphs are pervasive in large-scale data analysis • Sources of massive data: peta- and exa-scale simulations, experimental devices, the Internet, scientific applications. • New challenges for analysis: data sizes, heterogeneity, uncertainty, data quality. Astrophysics Problem: Outlier detection. Challenges: massive datasets, temporal variations. Graph problems: clustering, matching. Bioinformatics Problem: Identifying drug target proteins. Challenges: Data heterogeneity, quality. Graph problems: centrality, clustering. Social Informatics Problem: Discover emergent communities, model spread of information. Challenges: new analytics routines, uncertainty in data. Graph problems: clustering, shortest paths, flows. Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com15 September 2020 David Bader 14
  15. 15. 15 September 2020 David Bader 15
  16. 16. Massive Data Analytics: Infrastructure • The U.S. high-voltage transmission grid has >150,000 miles of line. • Real-time detection of changes and anomalies in the grid is a large-scale problem. • May mitigate impact of widespread blackouts due to equipment failure or intentional damage. 15 September 2020 David Bader 16
  17. 17. Network Analysis for Intelligence and Surveillance • [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information • Plot masterminds correctly identified from interaction patterns: centrality • A global view of entities is often more insightful • Detect anomalous activities by exact/approximate graph matching Image Source: http://www.orgnet.com/hijackers.html Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47 15 September 2020 David Bader 17
  18. 18. Massive Data Analytics: Public Health • CDC/national-scale surveillance of public health • Cancer genomics and drug design • Computed Betweenness Centrality of Human Proteome Human Genome core protein interactions Degree vs. Betweenness Centrality Degree 1 10 100 BetweennessCentrality 1e-7 1e-6 1e-5 1e-4 1e-3 1e-2 1e-1 1e+0 ENSG000001 45332.2 Kelch-like protein implicated in breast cancer 15 September 2020 David Bader 18
  19. 19. Massive Streaming Graph Analytics (A, B, t1, poke) (A, C, t2, msg) (A, D, t3, view wall) (A, D, t4, post) (B, A, t2, poke) (B, A, t3, view wall) (B, A, t4, msg) Billions of nodes … n9 n8 n7 n6 n5 n4 n3 n2 n1 … Analysts 15 September 2020 David Bader 19
  20. 20. Centrality in Massive Social Network Analysis • Centrality metrics: Quantitative measures to capture the importance of person in a social network • Betweenness is a global index related to shortest paths that traverse through the person • Can be used for community detection as well • Identifying central nodes in large complex networks is the key metric in several applications: • Biological networks, protein-protein interactions • Sexual networks and AIDS • Identifying key actors in terrorist networks • Organizational behavior • Supply chain management • Transportation networks 15 September 2020 David Bader 20
  21. 21. Betweenness Centrality (BC) • Key metric in social network analysis [Freeman ’77, Goh ’02, Newman ’03, Brandes ’03] • : Number of shortest paths between vertices s and t • : Number of shortest paths between vertices s and t passing through v ( ) ( )st s v t V st v BC v s s¹ ¹ Î = å )(vsts sts 15 September 2020 David Bader 21
  22. 22. Mining Twitter for Social Good ICPP 2010 Image credit: bioethicsinstitute.org 15 September 2020 David Bader 22
  23. 23. Conclusions • Graph Data Science is an important technique for solving real-world grand challenges • Graphs are a natural abstraction for Big Data and connect people, places, and things • Graphs are useful in problems such as Data Tagging, Triage, Exploratory Data Analysis, Anomaly Detection, Finding Patterns, Insider Threats, Fraud Detection, and Advanced Analytics • Graph technologies such as Neo4j provide Enterprise- class performance • Getting started with Graph Databases is easy 15 September 2020 David Bader 23
  24. 24. Graph500 Benchmark, www.graph500.org • Cybersecurity • 15 Billion Log Entries/Day (for large enterprises) • Full Data Scan with End-to-End Join Required • Medical Informatics • 50M patient records, 20-200 records/patient, billions of individuals • Entity Resolution Important • Social Networks • Example, Facebook, Twitter • Nearly Unbounded Dataset Size • Data Enrichment • Easily PB of data • Example: Maritime Domain Awareness • Hundreds of Millions of Transponders • Tens of Thousands of Cargo Ships • Tens of Millions of Pieces of Bulk Cargo • May involve additional data (images, etc.) • Symbolic Networks • Example, the Human Brain • 25B Neurons • 7,000+ Connections/Neuron Defining a new set of benchmarks to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads. Executive Committee: D.A. Bader, R. Murphy, M. Snir, A. Lumsdaine • Five Business Area Data Sets: 15 September 2020 David Bader 24

×