4. What is big data?
More and more data is being collected and stored each day
4
5. Four main components
• Data
– Structured and unstructured
• Databases
– Proprietary and open source
• Query language
– Querying the database
• Analytics
– Analysing the data
5
6. How big is big?
• Large data sets
– Greater than 1,000 Terabytes? (1 Petabyte)
– 1,000,000 Terabytes? (1 Exabyte)
• Excel 2013 can have 1,048,576 rows by 16,384 columns
– About 10 Gigabyte of data
• Only going to get bigger
– 90% of all data produced in the past two years !
– Rate is increasing
• Recall
– Giga = 10⁹
– Tera = 10¹²
– Peta = 10¹⁵
– Exa = 10¹⁸
6
9. Where does the data come from?
• Science – particle, astrophysics
• Industry – oil, finance, telecom
– Actually all verticals
• Social – Facebook, LinkedIn, Twitter
• Medicine – genome, neuroscience
• Government – census, education, police
• Sports – statistics
• Environment – weather, sensors
9
10. Unstructured Data
• 80% of data is unstructured
• NoSQL
• Document based
– Documents
– Texts, tweets
– Emails
– Machine logs
– Blogs
– Web pages
– Photos
– Videos (YouTube)
• Graph based
– Social media sites
– Facebook has 1.1billions users (Microstrategy, July 27, 2013)
10
11. Why do we need to use big data?
Use in public and private sector to:
• Make faster and more accurate business decisions
• Make accurate predictions
• Gain competitive advantage
• Implement smarter marketing – CRM
• Discover new opportunities
• Enhance Business Intelligence
• Enable fraud detection
• Reduce crime
• Improve scientific research
• Quicken analysis (up to real time)
– Weeks, days minutes, seconds
11
12. Big Data Startup - Case Study
• Rocket Fuel
• No. 4 on Forbes' 2013 Most Promising Companies In
America list
• Digital advertising startup
• Screens over 26 billion ads per day
• “Advertising that learns” big data platform
• Distributed planet-scale computing engine
• Hadoop implementation
• Founders from Yahoo!, Salesforce.com, DoubleClick
• Targeting algorithms use lifestyle, purchase intent and
social data
12
19. Big Data Analytics - Incumbents
• Oracle – Exadata, Exalytics
• Microsoft – HDInsight, xVelocity
• IBM – Netezza, Cognos, BigInsights
• SAP – HANA, Business Objects
• EMC – Pivotal (Greenplum)
• HP – Vertica, HAVEn
• All run on Hadoop
19
20. Big Data Analytics – Pure Plays
• Pure plays – definition:
– Been around more than 20 years
– Purely data analytic companies
• Teradata - Aster
• SAS
• Microstrategy
20
21. Big Data Analytics – New Entrants
• Hortonworks
• Cloudera
• MapR
• Acunu
• Pentaho
• Tableau
• Talend
• Splunk
21
22. (Some of) IBM’s Big Data Acquisitions
• Algorithmics
– Oct 2011, $400million
• OpenPages
– Oct 2010, ?
• Netezza
– Sept 2010, $1.7billion
• SPSS
– Jan 2010, $1.2billion
• Cognos
– Jan 2008, $4.9billion
• About $10billion in four years
http://en.wikipedia.org/wiki/List_of_mergers_and_acquisitions_by_IBM
22
23. Big Data Science Tools
• Hadoop
• NoSQL
• MapReduce
• R
• Matlab
• Python
• Statistics
23
24. Big Data Hadoop Stack
• Hadoop is the de facto big data operating system
• Developed from Google and Yahoo! (2005)
• It is distributed, open source and managed by Apache
24
25. Analytic Technologies
• A/B testing
• Genetic algorithms
• Machine learning
• Natural language
processing
• Neural networks
• Pattern recognition
• Anomaly detection
• Decision tree
• Predictive modeling
• Regression testing
• Sentiment analysis
• Signal processing
• Simulations
• Time series analysis
• Visualization
• Multivariate analysis
• Text analytics
25
26. Glossary
• OLTP = On Line Transactional Processing
• OLAP = On Line Analytic Processing
• ODBC = Open DataBase Connectivity
• IMDB = In Memory DataBase
• CRUD = Create, Read, Update, Delete
• ETL = Extract, Transform and Load
• CDO = Chief Data Officer
• NLP = Natural Language Processing
• GQL = Graph Query Language
• AaaS = Analytics as a Service
• EDW = Enterprise Data Warehouse
26
27. References
• Microstrategy website, 27 July, 2013, Michael Saylor
Presentation at Microstrategy World 2013,
http://www.microstrategy.com/
• Teradata website www.teradata.com
• Wikipedia http://en.wikipedia.org/wiki/
• Google images www.google.co.uk
• IBM website www.ibm.com
• Youtube www.youtube.com
• Hadoop www.hortonworks.com
27