The document discusses the growth of data and the field of data science. It begins by noting the large amounts of data being generated daily by various sources like web/e-commerce transactions, social networks, and scientific projects. It then discusses some of the challenges of big data including volume, velocity, and variety. The document provides an overview of the multidisciplinary nature of data science and the skills required of data scientists. It also summarizes different approaches to and job roles in data science.
5. Lots of data is being collected and warehoused
Web data, e-commerce
Financial transactions, bank/credit transactions
Online trading and purchasing
Social Network
6. Google processes 20 PB a day (2008)
Facebook has 60 TB of daily logs
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
1000 genomes project: 200 TB
Cost of 1 TB of disk: $35
Time to read 1 TB disk: 3 hrs
(100 MB/s)
7.
8. There's certainly a lot of it!
2015
1 Zettabyte
1 Exabyte
1 Petabyte
(brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store
(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm
1 Petabyte == 1000 TB 2002 2009
(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf
(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf
2006 2011
(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video
(w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!
5 EB
161 EB
800 EB
1.8 ZB 8.0 ZB
14 PB
60 PB
Data produced each year
100-years of HD video + audio
Human brain's capacity
References
1 TB = 1000 GB
120 PB
logarithmic
scale
Data, data everywhere…
11. Big Data is any data that is expensive to manage and hard to extract value from
Volume
The size of the data
Velocity
The latency of data processing relative to the growing demand for
interactivity
Variety and Complexity
the diversity of sources, formats, quality, structures.
16. Make data easier to use ~ by using it!
It may be true that
Data Science isn't a
science – but that
doesn't mean it's
not useful!
17. • Naive definition:
• Big data only depends on the data size
• 1 Gigabyte? 1 Terabyte? 1 Petabyte?
• Naive interpretation misses important aspects
• Time:
• Analyzing 1 Gigabyte of data per day is different from analyzing 1 Gigabyte of data per
second
• Diversity:
• Analyzing spread sheets with numeric data is different from analyzing Web pages that
contain a mixture of text and images
• Distribution:
• Analyzing data from a single source is different from analyzing data from multiple
sources
18. • Following Gartner‘s IT Glossary:
• Big data is high-volume, high-velocity and/or high-variety information assets that demand
cost-effective, innovative forms of information processing that enable enhanced insight,
decision making, and process automation.
• The three Vs
• Volume
• Velocity
• Variety
Some people actually use 10 Vs to define
big data!
• Variability
• Veracity
• Validity
• Vulnerability
• Volatility
• Visualization
• Value
20. • Speed at which new data is created
• Speed at which data must be processed and analyzed
• Often close to real-time
21. • Diversity in data types and data sources
Structured
Semi-
Structured
Quasi-Structured
Unstructured
• Data with defined types and structure
• Example: comma separated values
• Textual data with parseable pattern
• Example: XML files with schema
• Textual data with erratic formats that can be
formated with effort
• Example: Clickstream data
• Data that has no inherent structure, often with
multiple formats
• Example: Web site, videos
23. Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can afford to scan the data once
24. Aggregation and Statistics
Data warehousing and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
25. • “… the sexy job in the next 10 years will be statisticians,” Hal Varian, Google Chief
Economist
• The U.S. will need 140,000-190,000 predictive analysts and 1.5 million
managers/analysts by 2018. McKinsey Global Institute’s June 2011
• New Data Science institutes being created or repurposed – NYU, Columbia,
Washington, UCB,...
• New degree programs, courses, boot-camps:
• e.g., at Berkeley: Stats, I-School, CS, Astronomy…
• One proposal (elsewhere) for an MS in “Big Data Science”
26. • An area that manages, manipulates, extracts, and interprets knowledge from tremendous
amount of data
• Data science (DS) is a multidisciplinary field of study with goal to address the challenges in big
data
• Data science principles apply to all data – big and small
27. • Theories and techniques from many fields and disciplines are used to investigate
and analyze a large amount of data to help decision makers in many industries
such as science, engineering, economics, politics, finance, and education
• Computer Science
• Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
• Mathematics
• Mathematical Modeling
• Statistics
• Statistical and Stochastic modeling, Probability.
31. • Companies learn your secrets, shopping patterns, and preferences
• For example, can we know if a woman is pregnant, even if she doesn’t want us
to know? Target case study
• Data Science and election (2008, 2012)
• 1 million people installed the Obama Facebook app that gave access to info on
“friends”
32. • Data Scientist
• The Sexiest Job of the 21st Century
• They find stories, extract knowledge. They are not reporters
33. • Data scientists are the key to realizing the opportunities presented by big data.
They bring structure to it, find compelling patterns in it, and advise executives
on the implications for products, processes, and decisions
34. • National Security
• Cyber Security
• Business Analytics
• Engineering
• Healthcare
• And more ….
35. • Mathematics and Applied Mathematics
• Applied Statistics/Data Analysis
• Solid Programming Skills (R, Python, Julia, SQL)
• Data Mining
• Data Base Storage and Management
• Machine Learning and discovery
36.
37. • Unfortunately, there is no clear definition (yet?)
• Goal is the extraction of knowledge from data
• Combination of techniques from different disciplines
• Scientific principles guide the data analysis
43. • 1. Understand the statistics and machine learning concepts that are vital for data science
• 2. Learn to statistically analyze a dataset
• 3. Critically evaluate data visualization based on their design and use for communicating stories
from data
44. • The Student will be able to
• 1. Understand the key concepts in data science, its applications and the toolkit used by data
scientists;
• 2. Realize how data is collected, managed and stored for data science;
• 3. Apply various machine learning techniques in real-world applications
• 4. Implement data collection and management
• 5. Apply visualization tools for data visualization
• 6. Possess the required knowledge and expertise to become a proficient data scientist
45. • Module 1: Introduction to Data Analytics (7 hrs)
Introduction, Terminology, data science process, data science toolkit, Types of data, Introduction to Python, Data
Analysis in Excel, Analytics Problem Solving, Exploratory Data Analysis, Example applications.
• Module 2: Data collection management and Statistics (7 hrs)
Introduction, Sources of data, Data collection and APIs, Exploring and fixing data, Data storage and management,
Advanced SQL using multiple data sources, Statistics and Hypothesis Testing, Inferential Statistics, Big Data
Storage and Processing Framework, Hadoob
• Module 3: Data analysis (8 hrs)
Introduction, Terminology and concepts, Introduction to statistics, Central tendencies and distributions, Variance,
Distribution properties and arithmetic, Samples/CLT, Basic machine learning algorithms, Linear regression, SVM,
Naive Bayes
46. • Module 4: Data visualization (8 hrs)
Introduction, Types of data visualization, Data for visualization: Data types, Data encodings, Retinal
variables, Mapping variables to encodings, Visual encodings, Data Visualization in Python-Superset or
in Microsoft Power BI
• Module 5: Computing and Applications (8 hrs)
Using Python for Data Science - Using Open Source R for Data Science - Using SQL in Data Science
- Software Applications for Data Science. Applications of Data Science, Technologies for visualization
like Data Visualization in Microsoft Power BI.
• Module 6: Trends and Technologies (7 hrs)
Recent trends in various data collection and analysis techniques, various visualization techniques,
application development methods used in data science, NYC Parking Case Study: Apache Spark
47. Text Books:
• 1. Cathy O’Neil and Rachel Schutt, Doing Data Science, Straight Talk from The Frontline. O’Reilly, 2014. ISBN: 978-1-
449-35865-5
• 2. Davy Cielen. Arno D.B Meysman, Mohamed Ali, “Introducing Data Science”, Dreamtech Press, 2016. ISBN: 978-93-
5119-937-3
Reference Books:
• 1. Joel Grus, Data Science from Scratch, O’Reilly, 2015, ISBN: 978-1-491-90142-7
• 2. Jure Leskovek, Anand Rajaraman and Jeffrey Ullman, Mining of Massive Datasets. v2.1, Cambridge University
Press, 2014. ISBN : 9781139924801
• 3. John W. Foreman, Using Data Science to Transform Information into Insight – Data Smart, Wiley, 2014. ISBN:
978-81-265-4614-5
• 4. https://github.com/maximrohit/SPARK-R-SQL-NYC-PArking-Ticket-Analysis
48.
49.
50.
51.
52.
53.
54. • Big data has a high volume, velocity, and variety
• Different data structures
• Structured, semi-structured, quasi-structured, unstructured
• Data science is a very diverse discipline
• Maths, computer science, statistics, applications
→ Data scientists require a diverse skillset
55. • Not computer scientists
• But should know about databases, data structures, algorithms, etc.
• Not mathematicians
• But should know about optimization, stochastics, etc.
• Not statisticians
• But should know about regression, statistical tests, etc.
• Not domain experts
• But must work together with them
56. Data
Scientists
Quantitative
• Maths
• Algorithms
• Statistics
Technical
• Programming
• Infrastructures
Skeptical
• Create hypotheses,
but be skeptical
about them
Collaborative
• Teamwork
• Communication
skills
A bit of everything
… but actually as much as
possible of everything
57.
58. • According to Microsoft Research:
• Polymath
• Do it all
• Data Evangelist
• Data analysis, disseminating and acting on
insights
• Data Preparer
• Querying existing data, preparing data for
analysis
• Data Shapers
• Analyzing and preparing data
• Data Analyzer
• Analyzing data
• Platform Builder
• Collect data and create infrastructures
• Moonlighters (50%/20%)
• Spare time data scientists
• Insight Actors
• Use the outcome and act on insights.
Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software Teams: State of the Art and Challenges, IEEE Transactions on Software
Engineering (Online First)