1) Big data refers to the immense volume, variety and velocity of data that is now available.
2) As our ability to analyze big data increases, it will lead to changes such as the rise of data scientist roles and more accessible information.
3) These changes will impact media, mentality and accelerate the pace of change in society. Governance of big data use is needed to balance business, social and individual interests.
We have been working with an Indian telco client for some time now to help reduce their billing costs and improve customer satisfaction. Challenge: Call Detail Record (CDR) processing within their data warehouse was sub-optimal, Could not achieve real time billing which required handling billions of CDRs per day and de-duplication against 15 days worth of CDR data Unable to support for future IT and Business with real-time analytics Solution: Single platform for mediation and real time analytics reduces IT complexity The PMML standard is used to import data mining models from InfoSphere Warehouse. Offloaded the CDR processing to InfoSphere Streams resulting in enhanced data warehouse performance and improved TCO Each incoming CDR is analyzed using these data mining models, allowing immediate detection of events (ex: dropped calls) that might create customer satisfaction issues. Business Benefit: Data now processed at the speed of Business - from 12 hours to 1 second HW Costs reduced to 1/8th Support for future growth without the need to re-architect, more data, more analysis Platform in-place for real-time analytics to drive revenue
IBM has been working with one of the leading non-profit research institutes leading a regional project to prove the viability and benefits of smart grid technology and test the concept of demand-based electrical power pricing Background: The project is the largest initiative of its kind in the US and is designed to test and quantify smart grid costs and benefits with over 60,000 consumers in five states - Washington, Oregon, Idaho, Montana and Wyoming. The smart grid technique uses an incentive and a feedback signal to help coordinate smart grid resources. The two-way communication of this information - from power source to destination - allows intelligent devices and consumers to make smart decisions about using this energy. The requirements of the project call for a robust infrastructure that facilitates two-way data flow and computing power capable of continuously processing petabytes of data. Solution: IBM is building the infrastructure to disseminate the project ’ s transactive incentive signal and interlace it with the participants ’ responsive assets. The solution consists of: - IBM streams computing software running on IBM x86 servers to allow for the effective streaming of data - IBM data warehouse appliance provide to analyze and understand the project data (up to 10 petabytes) in minutes Benefits: • Enabled a town to avoid a power outage by using a two-way advanced meter system to shut off home water heaters during peak periods, reducing strain on an unreliable underwater cable • Empowers consumers to make educated choices about how and when to use electricity, and at what price • Increases grid efficiency and reliability through system self-monitoring and feedback
Most of you know of Watson, our computing system designed to compete on the Jeopardy game show. Watson represents a breakthrough in terms of volume of information stored, and the ability to access it quickly (answering natural language questions). I think Watson is impressive, because there are many commercial uses for this technology – and the technology exists today! The game Jeopardy provides the ultimate challenge for Watson because the game’s clues involve analyzing subtle meanings, irony, riddles, and other complexities in which humans excel and computers traditionally do not. If you think about Deep Blue, the 1997 IBM machine that defeated the reigning world chess champion, Watson is yet another major leap in capability of IT systems to identify patterns, gain critical insight and enhance decision-making despite daunting complexities. While Deep Blue was amazing, it was an achievement of the application of compute power to a computationally well-defined and well-bound game: Chess. Watson, on the other hand, faces a challenge that is open-ended, defies the well-bounded mathematical formulation of a game like Chess. Watson has to operate in the near limitless, ambiguous, and high contextual domain of human language and knowledge. Watson answers a Grand Challenge: Can IBM design a computing system that rivals a human’s ability to answer questions posed in natural language by interpreting meaning and context and then retrieving, analyzing and understanding vast amounts of information in real-time? IBM Watson is a breakthrough in analytic innovation, proving that it is possible to harness vast amounts of information and rival a human’s ability to answer questions posted in natural language in real-time. But it doesn't matter how good the machine is if we don’t have good information to feed it. We live in a time where a computer can compete against humans at answering questions in plain English, based on storing, retrieving, analyzing and understanding vast amounts of information at real-time speeds. These same capabilities can enable you to improve and optimize your business, too. IBM just showed the value of putting that information to work by creating a computing system capable of competing on Jeopardy Well there ’ s a lot of technology that went into Watson – and a lot of Big Data technology in there as well. Now take a moment and think about how this iconic game show is played: you have to answer a question within three seconds. The technology used to analyze and return answers in Watson was a pre-cursor to the Streams technology, in fact, Streams was invented because that technology used in Watson wasn’t fast enough for some of the in-motion requirements needed by companies today. Jeopardy questions are not straight forward, they have pun and tricks to make them harder – so some of our text analytic technology with natural language processing, which is part of the IBM Big Data platform, is in there too (that ’ s yet another MAJOR DIFFERENTIATOR for IBM in Big Data: our Text Analytic Toolkit, which you will hear more about later in this presentation). It wasn’t always smooth sailing for Watson, the big breakthrough came when they started to use machine learning (ML), and the IBM Big Data platform will further differentiate itself from the field in 2012 when a corresponding toolkit came to market just like the text analytics toolkit. Finally, Watson had to have access to a heck of a lot of data – and Big Data technologies were used to load and index over 200 million pages of data; Watson had everything from encyclopedias, to the bible, to the world famous music and movie databases, etc. All these technologies mentioned in the previous paragraph had to work together as well. So IBM clearly has some inflection point understanding of these technologies and how to get them working together. In the case of the text analytics and machine learning – well we have to make that easier to consume because you don ’ t have the world ’ s largest commercial research organization for math at your fingertips. So we need to build tooling, and optimization, and accelerators around that and put these technologies inside consumable toolkits: which are we doing now.
In order to know we are making progress on scientific problems like open-domain QA well-defined challenges help demonstrate we can solve concrete & difficult tasks. As you might know Jeopardy! Is a long-standing, well-regarded and highly challenging Television quiz show in the US that demands human contestants to quickly understand and answer richly expressed natural language questions over a staggering array of topics. The Jeopardy! Challenge uniquely provides a palpable, compelling and notable way to drive the technology of Question Answering along key dimensions If you are familiar with the quiz show it asks an I incredibly broad range of questions over a huge variety of topics. In a single round there is a grid of 6 Categories and for each category 5 rows with increasing $ values. Once a cell is chosen by 1 of three players, A question, or what is often called a Clue is revealed. Here you see some example questions. <read some of the questions> Jeopardy uses complex and often subtle language to describe what is being asked. To win you have to be extraordinarily precise. You must deliver the exact answer – no more and no less – it is not good enough for it be somewhere in the top 2, 10 or 20 documents – you must know it exactly and get it in first place – otherwise no credit – in fact you loose points. You must demonstrate Accurate Confidences -- That is -- you must know what you know – if you “buzz –in” and then get it wrong you lose the $$ value of the question. And you have to do this all very quickly – deeply analyze huge volumes of content, consider many possible answers, compute your confidence and buzz in – all in just seconds. As we shall see compete with human champions at this game represents a Grand Challenge in Automatic Open-Domain Question Answering. <STOP> <NEXT SLIDE>
01/18/12 IOD2011 4/9/12 GS302_ManojSaxena_v7
Main point: At the core of what makes Watson different are three powerful technologies - natural language, hypothesis generation, and evidence based learning. But Watson is more than the sum of its individual parts. Watson is about bringing these capabilities together in a way that ’s never been done before resulting in a fundamental change in the way businesses look at quickly solving problems Further speaking points: . Looking at these one by one, understanding natural language and the way we speak breaks down the communication barrier that has stood in the way between people and their machines for so long. Hypothesis generation bypasses the historic deterministic way that computers function and recognizes that there are various probabilities of various outcomes rather than a single definitive ‘right’ response. And adaptation and learning helps Watson continuously improve in the same way that humans learn….it keeps track of which of its selections were selected by users and which responses got positive feedback thus improving future response generation Additional information : The result is a machine that functions along side of us as an assistant rather than something we wrestle with to get an adequate outcome
Challenge Reduce the occurrence of high cost Congestive Heart Failure (CHF) readmissions by proactively identifying patients likely to be readmitted on an emergent basis. Solution Seton Healthcare is a not-for-profit organization, the Seton Family is the leading provider of healthcare services in Central Texas, serving an 11-county population of 1.9 million Target and understand high-risk CHF patients for care management programs using natural language processing. Used predictive models that have demonstrated high positive predictive value against extracted structured and unstructured data Results Proactively targeted care management which will reduce re-admission of CHF patients. Identified patients likely for re-admission and introduced early interventions which will reduce cost, mortality rates, and improve patient quality of life. Background Seton Healthcare is a not-for-profit organization, the Seton Family is the leading provider of healthcare services in Central Texas, serving an 11-county population of 1.8 million. Seton Healthcare identified an opportunity to significantly reduce the occurrence of high cost CHF readmissions by proactively identifying patients likely to be readmitted on an emergent basis. Objectives Seton will partner with IBM to implement content and predictive analytics to identify patients who should receive proactive medical case management and intervention. The expectation is that Seton can reduce the occurrence of costly readmissions, mortality rates and improve the quality of life for these patients. Project Description CHF prevention and reduced re-admission is a main focuses of Seton’s Clinical Design Center. The key clinical, financial, and contextual data for CHF patients span many applications and are stored in both structured and unstructured content. To achieve the Design Center objectives, the following capabilities are needed: Integrate these data into longitudinal patient records Identify important information in the unstructured data Develop predictive models that show Likelihood of readmission Likelihood of ambulatory-sensitive ED visits and admissions Forecasted next year costs Display predictive model results along with aggregated patient record data in an visual, easily-navigable system IOD2011_BA KEYNOTEIBM IOD 2011 05/10/12 D1_BA Keynote_v4
Key Points Traditional technologies are very well suited to structured, repeatable tasks – when you do something many times it makes sense to structure it Also have controls in place for the accuracy and quality of the data Historical data – trend analysis New technologies are complementary – they address speed and flexibility Very good an one-time or ad-hoc analysis Also good at exploration – determining new questions to ask The point is organizations need both sides – and data growth (or big data) is a challenge for both sides. A big data platform has to address both sides to truly address enterprise needs.
Obviously, there are many other forms and sources of data. Let ’ s start with the hottest topic associated with Big Data today: social networks. Twitter generates about 12 terabytes a day of tweet data – which is every single day. Now, keep in mind, these numbers are hard to count on , so the point is that they ’ re big, right? So don ’ t fixate on the actual number because they change all the time and realize that even if these numbers are out of date in 2 years, it ’ s at a point where it ’ s too staggering to handle exclusively using traditional approaches. +CLICK+ Facebook over a year ago was generating 25 terabytes of log data every day ( Facebook log data reference: http://www.datacenterknowledge.com/archives/2009/04/17/a-look-inside-facebooks-data-center/ ) and probably about 7 to 8 terabytes of data that goes up on the Internet. +CLICK+ Google, who knows? Look at Google Plus, YouTube, Google Maps, and all that kind of stuff. So that ’ s the left hand of this chart – the social network layer. +CLICK+ Now let ’ s get back to instrumentation: there are massive amounts of proliferated technologies that allow us to be more interconnected than in the history of the world – and it just isn’t P2P (people to people) interconnections, it ’ s M2M (machine to machine) as well. Again, with these numbers, who cares what the current number is, I try to keep them updated, but it ’ s the point that even if they are out of date, it ’ s almost unimaginable how large these numbers are. Over 4.6 billion camera phones that leverage built-in GP S to tag the location or your photos, purpose built GPS devices, smart metres. If you recall the bridge that collapsed in Minneapolis a number of years ago in the USA, it was rebuilt with smart sensors inside it that measure the contraction and flex of the concrete based on weather conditions, ice build up, and so much more. So I didn’t realise how true it was when Sam P launched Smart Planet: I thought it was a marketing play. But truly the world is more instrumented, interconnected, and intelligent than it ’ s ever been and this capability allows us to address new problems and gain new insight never before thought possible and that ’ s what the Big Data opportunity is all about!
We like to define the characteristics of Big Data at IBM as Variety, Velocity and Volume. +CLICK+ If you start at the bottom, volume is pretty simple. We all understand we ’ re going from the terabytes to petabytes and into a zettabytes world, I think most of us understand today just how much data is out there now and what ’ s coming (at least you should after the first couple of slides in this presentation). The variety aspect is something kind of new to us in the data warehousing rule and it essentially that our analytics no longer just be for structured data; more so, analytics on structured data doesn’t have to be in a traditional database that requires consistency and integrity (since the data won’t be kept long, for example, a log file). The Big Data era is characterized by the need and desires to explore beyond structured data: we want to fold in unstructured data as well. If you look at a Facebook post or a tweet, they may come in a structured format (JSON), but the true value is in the unstructured part; the part that you tweet or your Facebook status and your post, that ’ s really a kind of unstructured data, so we refer to that as semi-structured data. So now we ’ re looking at all sorts of different kinds of data. Finally, there ’ s velocity. Other vendors who don ’ t have as big of a Big Data scope as we have at IBM will call velocity the speed at which the volume grows, but I think it ’ s fair to say that that ’ s part of volume. We talk about velocity as being how fast the data arrives at the enterprise , and of course, it ’ s going to lead to the question, and how long does it take you to do something about it ? Velocity in this context is a MAJOR IBM differentiator. Now keep in mind that a Big Data problem could involve solely one of these characteristics, or all of them.
We all know there exists a SQL-controlled relational database warehouse , so why are we at this era of Big Data? I think the two images on this slide really sum it up with a decent analogy around gold mining. If you think about the guy on the left, where you see this old-timer gold miner sifting for gold in a river and he is hoping to find big chunks of gold in his sifter. If someone found finds big chunks of gold, word spread s and that would spark a big gold rush. The find would pave the way for lots of investment, and eventually a town would spring up around this valuable find. What ’ s a characteristic of this scenario? When you look at that gold, you can visually see it and I would refer to gold (data) as having a visible value (high value per byte data). You can see it. It ’ s obvious. It ’ s valuable and therefore I can build a business case and invest in bringing this obvious high value per byte data into the warehouse– which indeed is a Big Data technology. Now bringing data into an warehouse is inherently more expensive (for good reasons), because in a warehouse we are taught that this is pristine data, the single version of the truth, it ’ s got to be enriched, it ’ s got to be documented, glossarized, transformed; and we do that because we know there ’ s a high value per byte data. Now, although mining towns sprung up around a gold find, folks didn’t go and dig up the mountains around the stream. Why? Because there is so much dirt (low value per byte data), and you didn’t have enough information or the right capital equipment to process all that dirt on a hunch. Now think of gold mining today, it ’ s a very different process than what I outlined on the left. In today ’ s gold mining, you actually can ’ t see most of the gold that ’ s found today. Gold has to be 30 parts per million (ppm) ore or greater for you to see it, so most gold mined today isn’t visible to the naked eye. Instead, today there exists massive capital equipment that ’ s able to go through and process lots and lots of dirt (low value per byte data) and find for extraction strains of gold (high value data) that are otherwise invisible to the naked eye. So today ’ s gold mine collects all these strains of gold and brings together value (insight). I was watching a gold mining documentary the other day – and they talked about how they chemically treat the dirt to find even finer grains of gold after a recent discovery, so this particular company was going to go back to the dirt that they’ve already processed, chemically treat it, and find more gold (value) than what was found in the initial extraction. I think analytics is (or will be) just like that and that ’ s yet another reason why Big Data compliments the existing warehouse. Five years from now, we’ll be able to do more and more analytically on the data we have today, and we ’ re going to understand inflection points and trends better that what we can today, and that ’ s just one of the reasons why developing a corpus of information, and keeping it, not only makes today ’ s models more accurate, but presents unknown opportunities for the future. In the end we have to look at ways to synergizing the analysis of data because producing data is much easier than making sense of it, and that rings more and more true each day in a Big Data era.
Slide #4: IBV-MIT DATA You see the data on this chart… from study conducted by our Institute of Business Value and MIT Sloan Management Review Number of enterprises using analytics to create a competitive advantage jumped almost 60 percent in just one year… Nearly 6 out of 10 organizations now differentiating through analytics. We found that the overall increase in advantage went almost exclusively to organizations who were already experienced users of analytics… so the early adopters are extending their leadership. Those organizations are more than twice as likely to substantially outperform their peers So we’re seeing early bifurcation of the market – leaders and followers. Reinforced by a separate MIT Study that found analytics led to 5-6 percent productivity increases… which is big enough in most industries to separate the winners from the losers. That’s all change that’s happening within enterprises….