SlideShare una empresa de Scribd logo
1 de 66
The Role of Data Management in Science Anand Deshpande Persistent Systems September  7,  2011
Good references Microsoft Faculty Summit 2011 http://research.microsoft.com/en-us/events/fs2011/ Tony Hey’s presentations at the event http://research.microsoft.com/en-us/events/fs2011/welcome_introduction_hey_faculitysummit_071811.pdf The Fourth Paradigm book http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_complete_lr.pdf Jim Gray’s work http://research.microsoft.com/en-us/um/people/gray/ Alex Szalay’s work on Large Databases and Science http://www.sdss.jhu.edu/~szalay/servers.html 2
Availability and abilityto handlevery large volumes of storage and complex computing is redefining how we do Science 3
Galileo and his telescope First Paradigm: For thousands of years, Science was about empirically describing natural phenomenon 4
Second Paradigm:Theoretical Science using models and generalization Newton Keplar Maxwell 5
Third Paradigm:Computational Science: Simulating Complex Phenomenon 6 Over the last 25 years Scientists have used computer simulation to validate theories. A hurricane computer simulation.
Are You Living In a Computer Simulation? Nick Bostrom. Philosophical Quarterly, 2003, Vol. 53, No. 211, pp. 243-255. [html] [pdf] (An earlier draft was circulated in 2001.) ABSTRACT. This paper argues that at least one of the following propositions is true: (1) the human species is very likely to go extinct before reaching a “posthuman” stage; (2) any posthuman civilization is extremely unlikely to run a significant number of simulations of their evolutionary history (or variations thereof); (3) we are almost certainly living in a computer simulation. It follows that the belief that there is a significant chance that we will one day become posthumans who run ancestor-simulations is false, unless we are currently living in a simulation. A number of other consequences of this result are also discussed. 7
Fourth Paradigm:Data Intensive Science The scientific method was traditionally driven by hypothesis.  First scientists predict a good response, then collect experimental data to validate the data against its predictions.  However, in the new data-driven approach researchers start with collecting data and analyze data later.  8
Scientists are collecting data How to codify data and extract insights and knowledge? 9 Experiments and Instruments Simulations Question Answer Literature Other Archives
Collaboration is the key to good science Isaac Newton famously remarked in a letter to his rival Robert Hookedated February 5, 1676 that: "What Descartes did was a goodstep. You have added much several ways, and especially in taking the colours of thin plates into philosophical consideration. If I have seen a little further it is by standing on the shoulders of Giants." 10
Collaboration in Science in the facebook age. 11
Examples of Microsoft Environmental Informatics Framework 12
13
14
15
16
17
Virtual Observatory - India VOPlot VOPlot3D VOMegaPLot VOStat VOConvert VOCat
R4 R3 R0 R1 R2
caBIG: Cancer and Biomedical Informatics Grid 28
caBIG networkPersistent is a participant in caBIG network 29
Getting All Scientific Data Online 30 Many disciplines overlap and use data from other sciences The Internet can unify all literature and data Increase Scientific Information Velocity From Jim Gray’s last talk.
How much data are we collecting? 31
32
The SKA radio telescope will be a virtual time machine… Ableto look back more than 10 billion years. But will require us to process 1 Terabyte every second at a speed of 3000 TeraFlops (processor operations per second)! 33
It is not that the scientists are the only ones collecting data! 34
The impact of aggregate data http://data.mint.com/ 35
Concur:  Aggregating Travel Data 36
37
38 http://www.technologyreview.com/blog/arxiv/26097/
39 Internet and historical snapshotsInternet Archive / Wayback machineThe Internet Archive offers permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format. Founded in 1996, now the Internet Archive includes texts, audio, moving images, and software as well as archived WikipediaWikipedia is the most famous cooperatively edited encyclopedia. Since every change is stored, Web pages' history can offer a detailed subject-based overview of the most important references of the past. The Knowledge CentersA collection of links to other resources for fifinding Web pages as they used to exist in the past. WhenagoWhenago provides quick access to historical information about what happened in the past on a given day. World Digital LibraryThe World Digital Library (WDL) makes available on the Internet, free of charge and in multilingual format, signifificant primary materials from countries and cultures around the world. Information retrieval enginesFreebaseFreebase is an open, Creative Commons licensed repository of structured data of more than 12 million entities. It provides collaborative tools to link entities together and keep them updated. Wolfram Alpha Computational Knowledge EngineAn attempt to compute whatever can be computed about anything. It aims to provide a single source that can be relied on by everyone for defifinitive answers to factual queries. Text mining on the WebGoogle TrendsGoogle Trends shows visual statistics about how often keywords have been searched on Google over time. Google Trends also shows how frequently topics have appeared in Google News stories, and in which geographic regions people have searched for them most. Google Flu TrendsGoogle Flu Trends uses aggregated Google search data to estimate flu activity. Data available for download as well. The ObservatoriumThe Observatorium project focuses on complex network dynamics in the Internet, proposing to monitor its evolution in real-time, with the general objective of better understanding the processes of knowledge generation and opinion dynamics. We Feel FineA database of several million human feelings, harvested from blogs and social pages in the Web. Using a series of playful interfaces, the feelings can be searched and sorted across a number of demographic slices. Web api available as well. CyberEmotionsThe CyberEmotions project focuses on the role of collective emotions in creating, forming and breaking-up ecommunities. It makes available for download three datasets containing news and comments from the BBC News forum, Digg and MySpace, only for academic research and only after the submission of an application form. Social data sharingLinked DataLinked Data is about using the Web to connect related data that was not previously linked, or using the Web to lower the barriers to linking data currently linked using other methods. Dataverse Network ProjectThe Dataverse Network is an application to publish, share, reference, extract and analyze research data. It facilitates making data available to others, and allows to replicate others work. Researchers and data authors get credit, publishers and distributors get credit, affiliated institutions get credit. Data360Data360 is an open-source, collaborative and free Web site. The site hosts a common and shared database, which any person or organization, committed to neutrality and non-partisanship (meaning let the data speak), can use for presentations and visualizations. SwivelSwivel is a web site where people share reports of charts and numbers. It is free for public data, and charges a monthly fee to people who want to use it in private. Many EyesA IBM initiative that allows users to upload their datasets and use a collection of tools to obtain meaningful visualizations from them. Each visualization is publicly stored on a dedicated page, where users can comment, rate and tag it. Reuse of the data is possible and encouraged. Conflict dataCSCW Data on Armed ConflictCSCW and Uppsala Conflict Data Program (UCDP) at the Department of Peace and Conflict Research, Uppsala University, have collaborated in the production of a dataset of armed conflicts, both internal and external, in the period 1946 to the present. Currently, probably the most extensive dataset repository available, in particular for historic data. WarViewsThe aim of the WarViews project is to create an easy-to-use front-end for the exploration of GIS data on conflict. It can run on a Web browser or it can be displayed using Google Earth. The following are civil war specifific datasets with additional empirical information:Ethnic group location datasetEthnic power balances datasetCollection of updated datasets and codebooks from the Uppsala Conflict Data Program (UCDP). ACLEDPartially contained in the PRIO dataset, ACLED (Armed Conflict Location and Events Dataset) is designed for disaggregated conflict analysis and crisis mapping. This dataset codes the location of all reported conflict events in 50 countries in the developing world. Data are currently being coded from 1997 to 2009 and the project continues to backdate conflict information for African states to the year of independence. CERACThe Conflict Analysis Resource Center hosts several cross country conflict data sets and a few datasets of particular countries. Repositories also have datasets of political instability and conflict. The Cross-National Time-Series Data ArchiveThe Cross-National Time-Series Data Archive provides annual data for a range of countries from 1815 to the present. Frequently cited, it is one of the eading datasets on political violence", according to Robert Bates at Harvard University. It is ossibly the most widely used event dataset" according to HenrikUrdal, International Peace Research Institute, Oslo (PRIO). Country specifific repositories: Iraq, AfghanistanCollection of datasets of terrorist acts. Data in economics and fifinanceBloombergInternational real-time data provider for decision makers in fifinance, business and government. Maddison DataHistorical statistics about GDP and population data. UNCTAD StatisticsThe UNCTAD Handbook of Statistics on-line provides time series of economic data and development indicators, in some cases going back as far as 1950; the Commodity Price Statistics Online Database; the UNCTAD-TRAINS on the Internet (Trade Analysis and Information System) for trade control measures as well as import flows by origin for over 130 countries; the Foreign Direct Investment database (FDI). OECD Statistics PortalLarge collection of datasets covering economics, demographics. Extractions are freely available, full access requires subscription. EUROSTATDetailed statistics on the EU and candidate countries, and various statistical publications for sale. Where's George?Spatial tracking system for U.S. and Canadian dollars. EurobilltrackerSpatial tracking system for Euro banknotes. Scientifific collaboration dataISI Web of KnowledgeComprehensive source of information in the sciences, social sciences, arts, and humanities. It encompasses several datasets, among which the following are maybe the most noteworthy: Journal Citation Reports. It allows one to evaluate and compare journals using citation data drawn from over 7,500 scholarly and technical journals; Web of Science. It consists of seven databases containing information gathered from thousands of scholarly journals, books, book series, reports, conferences, and more. Google ScholarGoogle Scholar is search engine specialized in scholarly literature. It indexes different sources (articles, books, abstract, thesis, etc.) from several disciplines and sorts them according to number of citations, author and journal impact factor. ScholarometerScholarometer is a social tool to facilitate citation analysis and help evaluate the impact of an author's publications. It works as a software plug-in for the Firefox browser. ScopusScopus is a very large abstract and citation database of research literature. It is available only for registered users. Living ScienceLiving Science is a real time global science observatory based on publications submitted to arXiv.org. It covers real time (daily) submissions of publications in areas as diverse as Physics, Astronomy, Computer Science, Mathematics and Quantitative Biology. Currently, contents are dynamically updated each day. Living Science is a powerful analysis tool to identify the magnitude and impact of scientifific work worldwide. Social sciencesICPSR of the University of MichiganICPSR offers more than 500,000 digital fifiles containing social science research data. Disciplines represented include political science, sociology, demography, economics, history, gerontology, criminal justice, public health, foreign policy, terrorism, health and medical care, early education, education, racial and ethnic minorities, psychology, law, substance abuse and mental health, and more. UK Data Center of the University of EssexThe UK's largest collection of digital research data in the social sciences and humanities. Berkeley's UC DATA ArchiveUC DATA's data holdings are primarily in the areas of Political, Social and Health Sciences. The Economic and Social Data Service (ESDS)The Economic and Social Data Service (ESDS) is a national data service providing access and support for an extensive range of key economic and social data, both quantitative and qualitative, spanning many disciplines and themes. It contains a map of additional datasets from several European countries. CESSDAWide data collections including sociological surveys, election studies, longitudinal studies, opinion polls, and census data. Among the materials are international and European data such as the European Social Survey, the Eurobarometers, and the International Social Survey Programme. Gapminder DataGapminder is a popular technology and Web application for cross-visualisation of trends in time series of data. It also opens an archive of multiple datasets on diverse socio-economic indicators. World Value SurveyThe World Value Survey provides data about values and cultural changes in societies all over the world. Urban dataGlobal Urban Observatory databaseThe Global Urban Observatory (GUO) offers policy-oriented urban indicators, statistics and other urban information. Urban ObservatoryU.S. based datasets about wealth, innovation and crime across cities. Traffic dataNGSIMThe Next Generation Simulation (NGSIM) program was initiated by the United States Department of Transportation (US DOT). The program developed a core of open behavioral algorithms in support of traffic simulation, and collected high-quality primary trac and trajectory data intended to support the research and testing of the new algorithms. Swiss Federal Roads Office FEDRO The Swiss Federal Roads Office offers a comprehensive overview on traffic flows in Switzerland. Data are collected by permanent automatic traffic counting stations and complemented by regular manual checking since 1961. TrafficDataThe aim of the International Traffic Database (ITDb) project is to provide traffic data to various groups (researchers, practitioners, public entities) in a format according to their particular needs, ranging from raw measurement data to statistical analysis. ITDb promotes a flexible traffic data provision format based on user needs and standard habits. Clearing House for Transport DataThe Clearing House for Transport Data in the German Aerospace Center is the fifirst point of contact for a quick overview of the available data. It is targeted at both organizations who gather transport-relevant data and those who wish to use the results of such research. The information offered includes the preparation of detailed metadata on the data sets, as well as notes on possible uses and sources. Desweiteren das Regiolab DelftThe regiolab-delft initiative started just after 2000 as a joint project led by TU Delft in association with the Municipality of Delft, the TRAIL research school, the Province of South Holland, the Ministry of Transport and several industrial partners. The archived dataset consists of over 6 years of 1 minute averaged speed and aggregate flow data from densely spaced inductive loops on the freeway network in the province of south Holland and other data from intersection controllers, license plate detection camera's and much more. RITAThe Research and Innovative Technology Administration (RITA) of the U.S. Department of Transportation offers several datasets about maritime, freights, airline, passengers, etc. traffic statistics. ETH Travel Data Archive (ETHTDA)The ETH Travel Data Archive (ETHTDA) is a virtual platform allowing end users to browse the archived travel data over the Web and enabling simple statistical analysis. Metropolitan Travel Survey ArchiveThe Metropolitan Travel Survey Archive to store, preserve, and make publicly available, via the Internet, travel surveys conducted by metropolitan areas, states and localities. InfobluInfoblu is a private company providing real-time traffic monitoring services for Italy. All services are available for a fee. Open mapsGoogle MapsWorld-famous map service. It offers several additional services such as: Street View, user-uploaded content (photos, comments and ratings) and personalized overlays through service apis. OpenStreetMapOpenStreetMap (by UCL) is a free editable map of the whole world. OpenStreetMap allows you to view, edit and use geographical data in a collaborative way from anywhere on Earth. TracksourceBrasilTracksource is a collaborative project aimed at creating and distributing for free maps of Brasil. Logistics dataNational Household Travel SurveyThe National Household Travel Survey (NHTS) collect data on both long-distance and local travel by the American public. The joint survey gathers trip-related data such as mode of transportation, duration, distance and purpose of trip. It also gathers demographic, geographic, and economic data for analysis purposes. It is part of RITA. Commodity Flow SurveyThe Commodity Flow Survey (CFS) is the primary source of national and state-level data on domestic freight shipments by American establishments in mining, manufacturing, wholesale, auxiliaries, and selected retail industries. Data are provided on the types, origins and destinations, values, weights, modes of transport, distance shipped, and ton-miles of commodities shipped. It is part of RITA and it is conducted every fifive years (last sampling on 2007). Climate dataJulichClimate data from Julich Research Center. Google.orgGoogle introduces its data-driven philanthropic projects, among which two environmental satellite observatories: the Earth Engine: for monitoring trends in world deforestation;the Crisis Response: for monitoring the oil spill from the Deep Horizon sank platform. Reality miningReality MiningBehavioral data collected from 100 mobile phones over 9 months. Includes both proximity and phone usage statistics. Two anonymized datasets available: single user (MySQL) and global (Matlab). Other open data initiativesData.govWide collection of public US datasets available for research. Data.gov.ukWide collection of public UK datasets available for research. Digging Into DataLaunched by the National Science Foundation (NSF), it offers a collection of diverse data repositories. Guardian Data BlogData journalism initiative that posts public interest (primarily UK relevant) datasets together with their analysis. A few collaborations with data visualization artists are present as well. Google Public DataGoogle offers several large datasets on diverse world socio-economic indicators and provides tools for easy visualization.
40
We are getting overwhelmed by data! Sensors are generating new kinds of data in real time. 80% of the new data is unstructured Analyzing large volumes of data is challenging and is critical 41
3.3 Exabytes of Digital Information will Be Created EveryDay this Year One exabyte is the equivalent of about 50,000 years of DVD quality video (1 exabyte = 1018 bytes ) 42
What are petabytes and exabytes? 43
44 The Storage Pyramid
Disks are growing in capacity and reducing in price 45
1956:  5MB Hard Disk 46
1980:  1GB Hard Disk 47
2008: 1TB Hard Disk 48
49
1990 2000 2010 2020 The Continued Explosion of Information 80% of new information growth is unstructured content – with 90% of that unmanaged The volume, variety, and velocity of information is driving unprecedented complexity – and opportunity 2020 35 zettabytes 44x as much 	Data and Content 	Over Coming Decade 2009 800,000 petabytes Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010 50
Need for Semantic Computing  51
52
"Scalable, fast access to over 100 petabytes of LSST data is essential to enable exploration, experimentation, and discovery by professional astronomers, students, and the public. SciDB's MPP architecture and array data model are a good match."...LSST "Atmospheric data from the DOE-funded Atmospheric Radiation Measurement Program's ground-based sensors and other atmospheric programs' satellites and models is well suited for the SciDB design."...PNNL 53
New analytical database for massive datasets Massively scalable advanced analytics with integrated & transparent data management First class support for scientific data and scientific research 54 What is SciDB?
Broad range of use cases  large scale statistical analysis is fundamental to all Pharma/Biotech  Agro-tech  Healthcare Analytics  Oil/Gas  Smart sensors  Insurance  Weblog Analytics  Quantitative Finance 55
O(100) petabytes LSST 56
Technical problems being solved Scale-up the decision-making tools, not just the data storage Organize the data optimally for analysis  Especially with machine-generated data  Enable increased productivity Less code, less data movement: more analysis  Provide massive speed and scalability on commodity HW 57
Designed for Scientific Research Data is updatable but never overwritten  Uncertainty stored with data and can be propagated through calculations Error bars, confidence metrics, normal probability distribution functions  Support for versioning and time series Time is an automatically supported extra dimension Provenance maintained for reproducibility  Keep the raw data, the derived data, and the derivation 58
Visualization is important Better displays and devices GPUs provide acceleration through specialized hardware at affordable prices Game boxes are available and inexpensive Gamification is getting interesting  59
60
61
http://www.visual-literacy.org/periodic_table/periodic_table.html 62
The Black Swan: The Impact of the Highly Improbable  Nassim Nicholas Taleb 63 The book focuses on the extreme impact of certain kinds of rare and unpredictable events (outliers) and humans' tendency to find simplistic explanations for these events retrospectively, after the fact. This theory has since become known as the black swan theory.
 "the only thing I know is that I do not know“ Socrates 64
Join 65 http://india.acm.org/
Thank you 66

Más contenido relacionado

La actualidad más candente

wireless sensor network
wireless sensor networkwireless sensor network
wireless sensor network
parry prabhu
 
Enabling Computational Journalism: Automated Fact-Checking and Story-Finding
Enabling Computational Journalism: Automated Fact-Checking and Story-FindingEnabling Computational Journalism: Automated Fact-Checking and Story-Finding
Enabling Computational Journalism: Automated Fact-Checking and Story-Finding
The Innovative Data Intelligence Research (IDIR) Laboratory, University of Texas at Arlington
 

La actualidad más candente (20)

Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
 
wireless sensor network
wireless sensor networkwireless sensor network
wireless sensor network
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Designing a second generation of open data platforms
Designing a second generation of open data platformsDesigning a second generation of open data platforms
Designing a second generation of open data platforms
 
Massive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsMassive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and Applications
 
An Open Data Story
An Open Data StoryAn Open Data Story
An Open Data Story
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 
Advanced web searching
Advanced web searchingAdvanced web searching
Advanced web searching
 
A Paradox but a Possibility: Modern Data Technologies in the Humanitarian World
A Paradox but a Possibility: Modern Data Technologies in the Humanitarian WorldA Paradox but a Possibility: Modern Data Technologies in the Humanitarian World
A Paradox but a Possibility: Modern Data Technologies in the Humanitarian World
 
Geographic Information Management Transformation
Geographic Information Management TransformationGeographic Information Management Transformation
Geographic Information Management Transformation
 
Providing geospatial information as Linked Open Data
Providing geospatial information as Linked Open DataProviding geospatial information as Linked Open Data
Providing geospatial information as Linked Open Data
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
The GDELT project
The GDELT project The GDELT project
The GDELT project
 
Broad Data
Broad DataBroad Data
Broad Data
 
What does “BIG DATA” mean for official statistics?
What does “BIG DATA” mean for official statistics?What does “BIG DATA” mean for official statistics?
What does “BIG DATA” mean for official statistics?
 
Enabling Computational Journalism: Automated Fact-Checking and Story-Finding
Enabling Computational Journalism: Automated Fact-Checking and Story-FindingEnabling Computational Journalism: Automated Fact-Checking and Story-Finding
Enabling Computational Journalism: Automated Fact-Checking and Story-Finding
 
Big data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing PlatformsBig data Mining Using Very-Large-Scale Data Processing Platforms
Big data Mining Using Very-Large-Scale Data Processing Platforms
 
Tim Berners-Lee's 5-Star Open Data Scheme
Tim Berners-Lee's 5-Star Open Data SchemeTim Berners-Lee's 5-Star Open Data Scheme
Tim Berners-Lee's 5-Star Open Data Scheme
 

Similar a Data and science

The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?
Anna Fensel
 
Sensory transformation
Sensory transformationSensory transformation
Sensory transformation
Karlos Svoboda
 

Similar a Data and science (20)

The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?
 
Linked Data Tutorial (Florianópolis)
Linked Data Tutorial (Florianópolis)Linked Data Tutorial (Florianópolis)
Linked Data Tutorial (Florianópolis)
 
Foresight Analytics
Foresight AnalyticsForesight Analytics
Foresight Analytics
 
Data science innovations
Data science innovations Data science innovations
Data science innovations
 
Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science  Data Science Innovations : Democratisation of Data and Data Science
Data Science Innovations : Democratisation of Data and Data Science
 
Zeng marcia ifla-subjectaccesssmartdatadh
Zeng marcia ifla-subjectaccesssmartdatadhZeng marcia ifla-subjectaccesssmartdatadh
Zeng marcia ifla-subjectaccesssmartdatadh
 
Sensory transformation
Sensory transformationSensory transformation
Sensory transformation
 
Ongoing Research in Data Studies
Ongoing Research in Data StudiesOngoing Research in Data Studies
Ongoing Research in Data Studies
 
OPEN KNOWLEDGE PLATFORM USE-CASES - TugaIT 2018
OPEN KNOWLEDGE PLATFORM USE-CASES - TugaIT 2018OPEN KNOWLEDGE PLATFORM USE-CASES - TugaIT 2018
OPEN KNOWLEDGE PLATFORM USE-CASES - TugaIT 2018
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
 
Dataset Sources Repositories.pptx
Dataset Sources Repositories.pptxDataset Sources Repositories.pptx
Dataset Sources Repositories.pptx
 
Dataset Sources Repositories.pptx
Dataset Sources Repositories.pptxDataset Sources Repositories.pptx
Dataset Sources Repositories.pptx
 
Data science Innovations January 2018
Data science Innovations January 2018Data science Innovations January 2018
Data science Innovations January 2018
 
Emerging Forms of Data and Analytics
Emerging Forms of Data and AnalyticsEmerging Forms of Data and Analytics
Emerging Forms of Data and Analytics
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?
 
Drowning in information – the need of macroscopes for research funding
Drowning in information – the need of macroscopes for research fundingDrowning in information – the need of macroscopes for research funding
Drowning in information – the need of macroscopes for research funding
 
Decomposing Social and Semantic Networks in Emerging “Big Data” Research
Decomposing Social and Semantic Networks in Emerging “Big Data” ResearchDecomposing Social and Semantic Networks in Emerging “Big Data” Research
Decomposing Social and Semantic Networks in Emerging “Big Data” Research
 
Dataset Sources Repositories.pptx
Dataset Sources Repositories.pptxDataset Sources Repositories.pptx
Dataset Sources Repositories.pptx
 
Building COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhyBuilding COVID-19 Knowledge Graph at CoronaWhy
Building COVID-19 Knowledge Graph at CoronaWhy
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
 

Más de Anand Deshpande

CII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingCII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud Computing
Anand Deshpande
 
Software products in the cloud world
Software products in the cloud worldSoftware products in the cloud world
Software products in the cloud world
Anand Deshpande
 

Más de Anand Deshpande (20)

Second Orbit - Action COACH Business Redefined Summit
Second Orbit  - Action COACH Business Redefined SummitSecond Orbit  - Action COACH Business Redefined Summit
Second Orbit - Action COACH Business Redefined Summit
 
Managing Your Professional Career
Managing Your Professional CareerManaging Your Professional Career
Managing Your Professional Career
 
You are the CEO. What's next?
You are the CEO.  What's next?You are the CEO.  What's next?
You are the CEO. What's next?
 
Managing my career (isb august 2019)
Managing my career (isb  august 2019)Managing my career (isb  august 2019)
Managing my career (isb august 2019)
 
Sharing the deAsra Experience at Bhutan Economic Forum
Sharing the deAsra Experience at Bhutan Economic ForumSharing the deAsra Experience at Bhutan Economic Forum
Sharing the deAsra Experience at Bhutan Economic Forum
 
Presentation at the code gladiators finale 2019
Presentation at the code gladiators finale 2019Presentation at the code gladiators finale 2019
Presentation at the code gladiators finale 2019
 
Pune TiECON -- Second Orbit Presentation
Pune TiECON -- Second Orbit PresentationPune TiECON -- Second Orbit Presentation
Pune TiECON -- Second Orbit Presentation
 
Data Collaboration in Healthcare -- presented at VLDB 2018
Data Collaboration in Healthcare -- presented at VLDB 2018Data Collaboration in Healthcare -- presented at VLDB 2018
Data Collaboration in Healthcare -- presented at VLDB 2018
 
Managing my career (as presented for toastmasters)
Managing my career (as presented for toastmasters)Managing my career (as presented for toastmasters)
Managing my career (as presented for toastmasters)
 
I am a Test Engineer: Why should I care about DevOps?
I am a Test Engineer: Why should I care about DevOps?I am a Test Engineer: Why should I care about DevOps?
I am a Test Engineer: Why should I care about DevOps?
 
Technology for india's development
Technology for india's developmentTechnology for india's development
Technology for india's development
 
Pune Connect presentation
Pune Connect presentationPune Connect presentation
Pune Connect presentation
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012
 
Presentation from IBM/RTL in Pune
Presentation from IBM/RTL in PunePresentation from IBM/RTL in Pune
Presentation from IBM/RTL in Pune
 
Customer summit - big data (final)
Customer summit  - big data (final)Customer summit  - big data (final)
Customer summit - big data (final)
 
CII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingCII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud Computing
 
Technology Driving Growth. Kotak Investor Conference
Technology Driving Growth.  Kotak Investor ConferenceTechnology Driving Growth.  Kotak Investor Conference
Technology Driving Growth. Kotak Investor Conference
 
Cloud and mobility (slideshare)
Cloud and mobility (slideshare)Cloud and mobility (slideshare)
Cloud and mobility (slideshare)
 
Opportunities for IT and SLA Professionals to Collaborate
Opportunities for IT and SLA Professionals to CollaborateOpportunities for IT and SLA Professionals to Collaborate
Opportunities for IT and SLA Professionals to Collaborate
 
Software products in the cloud world
Software products in the cloud worldSoftware products in the cloud world
Software products in the cloud world
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Data and science

  • 1. The Role of Data Management in Science Anand Deshpande Persistent Systems September 7, 2011
  • 2. Good references Microsoft Faculty Summit 2011 http://research.microsoft.com/en-us/events/fs2011/ Tony Hey’s presentations at the event http://research.microsoft.com/en-us/events/fs2011/welcome_introduction_hey_faculitysummit_071811.pdf The Fourth Paradigm book http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_complete_lr.pdf Jim Gray’s work http://research.microsoft.com/en-us/um/people/gray/ Alex Szalay’s work on Large Databases and Science http://www.sdss.jhu.edu/~szalay/servers.html 2
  • 3. Availability and abilityto handlevery large volumes of storage and complex computing is redefining how we do Science 3
  • 4. Galileo and his telescope First Paradigm: For thousands of years, Science was about empirically describing natural phenomenon 4
  • 5. Second Paradigm:Theoretical Science using models and generalization Newton Keplar Maxwell 5
  • 6. Third Paradigm:Computational Science: Simulating Complex Phenomenon 6 Over the last 25 years Scientists have used computer simulation to validate theories. A hurricane computer simulation.
  • 7. Are You Living In a Computer Simulation? Nick Bostrom. Philosophical Quarterly, 2003, Vol. 53, No. 211, pp. 243-255. [html] [pdf] (An earlier draft was circulated in 2001.) ABSTRACT. This paper argues that at least one of the following propositions is true: (1) the human species is very likely to go extinct before reaching a “posthuman” stage; (2) any posthuman civilization is extremely unlikely to run a significant number of simulations of their evolutionary history (or variations thereof); (3) we are almost certainly living in a computer simulation. It follows that the belief that there is a significant chance that we will one day become posthumans who run ancestor-simulations is false, unless we are currently living in a simulation. A number of other consequences of this result are also discussed. 7
  • 8. Fourth Paradigm:Data Intensive Science The scientific method was traditionally driven by hypothesis. First scientists predict a good response, then collect experimental data to validate the data against its predictions. However, in the new data-driven approach researchers start with collecting data and analyze data later. 8
  • 9. Scientists are collecting data How to codify data and extract insights and knowledge? 9 Experiments and Instruments Simulations Question Answer Literature Other Archives
  • 10. Collaboration is the key to good science Isaac Newton famously remarked in a letter to his rival Robert Hookedated February 5, 1676 that: "What Descartes did was a goodstep. You have added much several ways, and especially in taking the colours of thin plates into philosophical consideration. If I have seen a little further it is by standing on the shoulders of Giants." 10
  • 11. Collaboration in Science in the facebook age. 11
  • 12. Examples of Microsoft Environmental Informatics Framework 12
  • 13. 13
  • 14. 14
  • 15. 15
  • 16. 16
  • 17. 17
  • 18. Virtual Observatory - India VOPlot VOPlot3D VOMegaPLot VOStat VOConvert VOCat
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26. R4 R3 R0 R1 R2
  • 27.
  • 28. caBIG: Cancer and Biomedical Informatics Grid 28
  • 29. caBIG networkPersistent is a participant in caBIG network 29
  • 30. Getting All Scientific Data Online 30 Many disciplines overlap and use data from other sciences The Internet can unify all literature and data Increase Scientific Information Velocity From Jim Gray’s last talk.
  • 31. How much data are we collecting? 31
  • 32. 32
  • 33. The SKA radio telescope will be a virtual time machine… Ableto look back more than 10 billion years. But will require us to process 1 Terabyte every second at a speed of 3000 TeraFlops (processor operations per second)! 33
  • 34. It is not that the scientists are the only ones collecting data! 34
  • 35. The impact of aggregate data http://data.mint.com/ 35
  • 36. Concur: Aggregating Travel Data 36
  • 37. 37
  • 39. 39 Internet and historical snapshotsInternet Archive / Wayback machineThe Internet Archive offers permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format. Founded in 1996, now the Internet Archive includes texts, audio, moving images, and software as well as archived WikipediaWikipedia is the most famous cooperatively edited encyclopedia. Since every change is stored, Web pages' history can offer a detailed subject-based overview of the most important references of the past. The Knowledge CentersA collection of links to other resources for fifinding Web pages as they used to exist in the past. WhenagoWhenago provides quick access to historical information about what happened in the past on a given day. World Digital LibraryThe World Digital Library (WDL) makes available on the Internet, free of charge and in multilingual format, signifificant primary materials from countries and cultures around the world. Information retrieval enginesFreebaseFreebase is an open, Creative Commons licensed repository of structured data of more than 12 million entities. It provides collaborative tools to link entities together and keep them updated. Wolfram Alpha Computational Knowledge EngineAn attempt to compute whatever can be computed about anything. It aims to provide a single source that can be relied on by everyone for defifinitive answers to factual queries. Text mining on the WebGoogle TrendsGoogle Trends shows visual statistics about how often keywords have been searched on Google over time. Google Trends also shows how frequently topics have appeared in Google News stories, and in which geographic regions people have searched for them most. Google Flu TrendsGoogle Flu Trends uses aggregated Google search data to estimate flu activity. Data available for download as well. The ObservatoriumThe Observatorium project focuses on complex network dynamics in the Internet, proposing to monitor its evolution in real-time, with the general objective of better understanding the processes of knowledge generation and opinion dynamics. We Feel FineA database of several million human feelings, harvested from blogs and social pages in the Web. Using a series of playful interfaces, the feelings can be searched and sorted across a number of demographic slices. Web api available as well. CyberEmotionsThe CyberEmotions project focuses on the role of collective emotions in creating, forming and breaking-up ecommunities. It makes available for download three datasets containing news and comments from the BBC News forum, Digg and MySpace, only for academic research and only after the submission of an application form. Social data sharingLinked DataLinked Data is about using the Web to connect related data that was not previously linked, or using the Web to lower the barriers to linking data currently linked using other methods. Dataverse Network ProjectThe Dataverse Network is an application to publish, share, reference, extract and analyze research data. It facilitates making data available to others, and allows to replicate others work. Researchers and data authors get credit, publishers and distributors get credit, affiliated institutions get credit. Data360Data360 is an open-source, collaborative and free Web site. The site hosts a common and shared database, which any person or organization, committed to neutrality and non-partisanship (meaning let the data speak), can use for presentations and visualizations. SwivelSwivel is a web site where people share reports of charts and numbers. It is free for public data, and charges a monthly fee to people who want to use it in private. Many EyesA IBM initiative that allows users to upload their datasets and use a collection of tools to obtain meaningful visualizations from them. Each visualization is publicly stored on a dedicated page, where users can comment, rate and tag it. Reuse of the data is possible and encouraged. Conflict dataCSCW Data on Armed ConflictCSCW and Uppsala Conflict Data Program (UCDP) at the Department of Peace and Conflict Research, Uppsala University, have collaborated in the production of a dataset of armed conflicts, both internal and external, in the period 1946 to the present. Currently, probably the most extensive dataset repository available, in particular for historic data. WarViewsThe aim of the WarViews project is to create an easy-to-use front-end for the exploration of GIS data on conflict. It can run on a Web browser or it can be displayed using Google Earth. The following are civil war specifific datasets with additional empirical information:Ethnic group location datasetEthnic power balances datasetCollection of updated datasets and codebooks from the Uppsala Conflict Data Program (UCDP). ACLEDPartially contained in the PRIO dataset, ACLED (Armed Conflict Location and Events Dataset) is designed for disaggregated conflict analysis and crisis mapping. This dataset codes the location of all reported conflict events in 50 countries in the developing world. Data are currently being coded from 1997 to 2009 and the project continues to backdate conflict information for African states to the year of independence. CERACThe Conflict Analysis Resource Center hosts several cross country conflict data sets and a few datasets of particular countries. Repositories also have datasets of political instability and conflict. The Cross-National Time-Series Data ArchiveThe Cross-National Time-Series Data Archive provides annual data for a range of countries from 1815 to the present. Frequently cited, it is one of the eading datasets on political violence", according to Robert Bates at Harvard University. It is ossibly the most widely used event dataset" according to HenrikUrdal, International Peace Research Institute, Oslo (PRIO). Country specifific repositories: Iraq, AfghanistanCollection of datasets of terrorist acts. Data in economics and fifinanceBloombergInternational real-time data provider for decision makers in fifinance, business and government. Maddison DataHistorical statistics about GDP and population data. UNCTAD StatisticsThe UNCTAD Handbook of Statistics on-line provides time series of economic data and development indicators, in some cases going back as far as 1950; the Commodity Price Statistics Online Database; the UNCTAD-TRAINS on the Internet (Trade Analysis and Information System) for trade control measures as well as import flows by origin for over 130 countries; the Foreign Direct Investment database (FDI). OECD Statistics PortalLarge collection of datasets covering economics, demographics. Extractions are freely available, full access requires subscription. EUROSTATDetailed statistics on the EU and candidate countries, and various statistical publications for sale. Where's George?Spatial tracking system for U.S. and Canadian dollars. EurobilltrackerSpatial tracking system for Euro banknotes. Scientifific collaboration dataISI Web of KnowledgeComprehensive source of information in the sciences, social sciences, arts, and humanities. It encompasses several datasets, among which the following are maybe the most noteworthy: Journal Citation Reports. It allows one to evaluate and compare journals using citation data drawn from over 7,500 scholarly and technical journals; Web of Science. It consists of seven databases containing information gathered from thousands of scholarly journals, books, book series, reports, conferences, and more. Google ScholarGoogle Scholar is search engine specialized in scholarly literature. It indexes different sources (articles, books, abstract, thesis, etc.) from several disciplines and sorts them according to number of citations, author and journal impact factor. ScholarometerScholarometer is a social tool to facilitate citation analysis and help evaluate the impact of an author's publications. It works as a software plug-in for the Firefox browser. ScopusScopus is a very large abstract and citation database of research literature. It is available only for registered users. Living ScienceLiving Science is a real time global science observatory based on publications submitted to arXiv.org. It covers real time (daily) submissions of publications in areas as diverse as Physics, Astronomy, Computer Science, Mathematics and Quantitative Biology. Currently, contents are dynamically updated each day. Living Science is a powerful analysis tool to identify the magnitude and impact of scientifific work worldwide. Social sciencesICPSR of the University of MichiganICPSR offers more than 500,000 digital fifiles containing social science research data. Disciplines represented include political science, sociology, demography, economics, history, gerontology, criminal justice, public health, foreign policy, terrorism, health and medical care, early education, education, racial and ethnic minorities, psychology, law, substance abuse and mental health, and more. UK Data Center of the University of EssexThe UK's largest collection of digital research data in the social sciences and humanities. Berkeley's UC DATA ArchiveUC DATA's data holdings are primarily in the areas of Political, Social and Health Sciences. The Economic and Social Data Service (ESDS)The Economic and Social Data Service (ESDS) is a national data service providing access and support for an extensive range of key economic and social data, both quantitative and qualitative, spanning many disciplines and themes. It contains a map of additional datasets from several European countries. CESSDAWide data collections including sociological surveys, election studies, longitudinal studies, opinion polls, and census data. Among the materials are international and European data such as the European Social Survey, the Eurobarometers, and the International Social Survey Programme. Gapminder DataGapminder is a popular technology and Web application for cross-visualisation of trends in time series of data. It also opens an archive of multiple datasets on diverse socio-economic indicators. World Value SurveyThe World Value Survey provides data about values and cultural changes in societies all over the world. Urban dataGlobal Urban Observatory databaseThe Global Urban Observatory (GUO) offers policy-oriented urban indicators, statistics and other urban information. Urban ObservatoryU.S. based datasets about wealth, innovation and crime across cities. Traffic dataNGSIMThe Next Generation Simulation (NGSIM) program was initiated by the United States Department of Transportation (US DOT). The program developed a core of open behavioral algorithms in support of traffic simulation, and collected high-quality primary trac and trajectory data intended to support the research and testing of the new algorithms. Swiss Federal Roads Office FEDRO The Swiss Federal Roads Office offers a comprehensive overview on traffic flows in Switzerland. Data are collected by permanent automatic traffic counting stations and complemented by regular manual checking since 1961. TrafficDataThe aim of the International Traffic Database (ITDb) project is to provide traffic data to various groups (researchers, practitioners, public entities) in a format according to their particular needs, ranging from raw measurement data to statistical analysis. ITDb promotes a flexible traffic data provision format based on user needs and standard habits. Clearing House for Transport DataThe Clearing House for Transport Data in the German Aerospace Center is the fifirst point of contact for a quick overview of the available data. It is targeted at both organizations who gather transport-relevant data and those who wish to use the results of such research. The information offered includes the preparation of detailed metadata on the data sets, as well as notes on possible uses and sources. Desweiteren das Regiolab DelftThe regiolab-delft initiative started just after 2000 as a joint project led by TU Delft in association with the Municipality of Delft, the TRAIL research school, the Province of South Holland, the Ministry of Transport and several industrial partners. The archived dataset consists of over 6 years of 1 minute averaged speed and aggregate flow data from densely spaced inductive loops on the freeway network in the province of south Holland and other data from intersection controllers, license plate detection camera's and much more. RITAThe Research and Innovative Technology Administration (RITA) of the U.S. Department of Transportation offers several datasets about maritime, freights, airline, passengers, etc. traffic statistics. ETH Travel Data Archive (ETHTDA)The ETH Travel Data Archive (ETHTDA) is a virtual platform allowing end users to browse the archived travel data over the Web and enabling simple statistical analysis. Metropolitan Travel Survey ArchiveThe Metropolitan Travel Survey Archive to store, preserve, and make publicly available, via the Internet, travel surveys conducted by metropolitan areas, states and localities. InfobluInfoblu is a private company providing real-time traffic monitoring services for Italy. All services are available for a fee. Open mapsGoogle MapsWorld-famous map service. It offers several additional services such as: Street View, user-uploaded content (photos, comments and ratings) and personalized overlays through service apis. OpenStreetMapOpenStreetMap (by UCL) is a free editable map of the whole world. OpenStreetMap allows you to view, edit and use geographical data in a collaborative way from anywhere on Earth. TracksourceBrasilTracksource is a collaborative project aimed at creating and distributing for free maps of Brasil. Logistics dataNational Household Travel SurveyThe National Household Travel Survey (NHTS) collect data on both long-distance and local travel by the American public. The joint survey gathers trip-related data such as mode of transportation, duration, distance and purpose of trip. It also gathers demographic, geographic, and economic data for analysis purposes. It is part of RITA. Commodity Flow SurveyThe Commodity Flow Survey (CFS) is the primary source of national and state-level data on domestic freight shipments by American establishments in mining, manufacturing, wholesale, auxiliaries, and selected retail industries. Data are provided on the types, origins and destinations, values, weights, modes of transport, distance shipped, and ton-miles of commodities shipped. It is part of RITA and it is conducted every fifive years (last sampling on 2007). Climate dataJulichClimate data from Julich Research Center. Google.orgGoogle introduces its data-driven philanthropic projects, among which two environmental satellite observatories: the Earth Engine: for monitoring trends in world deforestation;the Crisis Response: for monitoring the oil spill from the Deep Horizon sank platform. Reality miningReality MiningBehavioral data collected from 100 mobile phones over 9 months. Includes both proximity and phone usage statistics. Two anonymized datasets available: single user (MySQL) and global (Matlab). Other open data initiativesData.govWide collection of public US datasets available for research. Data.gov.ukWide collection of public UK datasets available for research. Digging Into DataLaunched by the National Science Foundation (NSF), it offers a collection of diverse data repositories. Guardian Data BlogData journalism initiative that posts public interest (primarily UK relevant) datasets together with their analysis. A few collaborations with data visualization artists are present as well. Google Public DataGoogle offers several large datasets on diverse world socio-economic indicators and provides tools for easy visualization.
  • 40. 40
  • 41. We are getting overwhelmed by data! Sensors are generating new kinds of data in real time. 80% of the new data is unstructured Analyzing large volumes of data is challenging and is critical 41
  • 42. 3.3 Exabytes of Digital Information will Be Created EveryDay this Year One exabyte is the equivalent of about 50,000 years of DVD quality video (1 exabyte = 1018 bytes ) 42
  • 43. What are petabytes and exabytes? 43
  • 44. 44 The Storage Pyramid
  • 45. Disks are growing in capacity and reducing in price 45
  • 46. 1956: 5MB Hard Disk 46
  • 47. 1980: 1GB Hard Disk 47
  • 48. 2008: 1TB Hard Disk 48
  • 49. 49
  • 50. 1990 2000 2010 2020 The Continued Explosion of Information 80% of new information growth is unstructured content – with 90% of that unmanaged The volume, variety, and velocity of information is driving unprecedented complexity – and opportunity 2020 35 zettabytes 44x as much Data and Content Over Coming Decade 2009 800,000 petabytes Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010 50
  • 51. Need for Semantic Computing 51
  • 52. 52
  • 53. "Scalable, fast access to over 100 petabytes of LSST data is essential to enable exploration, experimentation, and discovery by professional astronomers, students, and the public. SciDB's MPP architecture and array data model are a good match."...LSST "Atmospheric data from the DOE-funded Atmospheric Radiation Measurement Program's ground-based sensors and other atmospheric programs' satellites and models is well suited for the SciDB design."...PNNL 53
  • 54. New analytical database for massive datasets Massively scalable advanced analytics with integrated & transparent data management First class support for scientific data and scientific research 54 What is SciDB?
  • 55. Broad range of use cases large scale statistical analysis is fundamental to all Pharma/Biotech Agro-tech Healthcare Analytics Oil/Gas Smart sensors Insurance Weblog Analytics Quantitative Finance 55
  • 57. Technical problems being solved Scale-up the decision-making tools, not just the data storage Organize the data optimally for analysis Especially with machine-generated data Enable increased productivity Less code, less data movement: more analysis Provide massive speed and scalability on commodity HW 57
  • 58. Designed for Scientific Research Data is updatable but never overwritten Uncertainty stored with data and can be propagated through calculations Error bars, confidence metrics, normal probability distribution functions Support for versioning and time series Time is an automatically supported extra dimension Provenance maintained for reproducibility Keep the raw data, the derived data, and the derivation 58
  • 59. Visualization is important Better displays and devices GPUs provide acceleration through specialized hardware at affordable prices Game boxes are available and inexpensive Gamification is getting interesting 59
  • 60. 60
  • 61. 61
  • 63. The Black Swan: The Impact of the Highly Improbable Nassim Nicholas Taleb 63 The book focuses on the extreme impact of certain kinds of rare and unpredictable events (outliers) and humans' tendency to find simplistic explanations for these events retrospectively, after the fact. This theory has since become known as the black swan theory.
  • 64.  "the only thing I know is that I do not know“ Socrates 64