SlideShare una empresa de Scribd logo
1 de 58
Descargar para leer sin conexión
19EC2029
Dr. D. Sugumar
Associate Professor/ECE
Karunya
Lots of data is being collected and warehoused
Web data, e-commerce
Financial transactions, bank/credit transactions
Online trading and purchasing
Social Network
Google processes 20 PB a day (2008)
Facebook has 60 TB of daily logs
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
1000 genomes project: 200 TB
Cost of 1 TB of disk: $35
Time to read 1 TB disk: 3 hrs
(100 MB/s)
There's certainly a lot of it!
2015
1 Zettabyte
1 Exabyte
1 Petabyte
(brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store
(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm
1 Petabyte == 1000 TB 2002 2009
(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf
(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf
2006 2011
(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video
(w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!
5 EB
161 EB
800 EB
1.8 ZB 8.0 ZB
14 PB
60 PB
Data produced each year
100-years of HD video + audio
Human brain's capacity
References
1 TB = 1000 GB
120 PB
logarithmic
scale
Data, data everywhere…
data
information
knowledge
wisdom
I'd call it data,
not information
2000’s
Content Management
1990’s
Relational Databases
& Data Warehouses
2010’s
Key-Value Storages
& Unstructured Data
VOLUME
OF
INFORMATION
LARGE
SMALL
TERABYTES PETABYTES EXABYTES
Big Data is any data that is expensive to manage and hard to extract value from
Volume
The size of the data
Velocity
The latency of data processing relative to the growing demand for
interactivity
Variety and Complexity
the diversity of sources, formats, quality, structures.
Hmmm… where am I
on this diagram?
the companies are expanding as fast as the data!
Is this really
about size?
Big Data?
I agree with this…
Make data easier to use ~ by using it!
It may be true that
Data Science isn't a
science – but that
doesn't mean it's
not useful!
• Naive definition:
• Big data only depends on the data size
• 1 Gigabyte? 1 Terabyte? 1 Petabyte?
• Naive interpretation misses important aspects
• Time:
• Analyzing 1 Gigabyte of data per day is different from analyzing 1 Gigabyte of data per
second
• Diversity:
• Analyzing spread sheets with numeric data is different from analyzing Web pages that
contain a mixture of text and images
• Distribution:
• Analyzing data from a single source is different from analyzing data from multiple
sources
• Following Gartner‘s IT Glossary:
• Big data is high-volume, high-velocity and/or high-variety information assets that demand
cost-effective, innovative forms of information processing that enable enhanced insight,
decision making, and process automation.
• The three Vs
• Volume
• Velocity
• Variety
Some people actually use 10 Vs to define
big data!
• Variability
• Veracity
• Validity
• Vulnerability
• Volatility
• Visualization
• Value
• Scale of the data must be „big“
• No clear definition
• „that demand […] innovative forms of information processing“ (Gartner)
© Statista 2018
Data center storage worldwide
• Speed at which new data is created
• Speed at which data must be processed and analyzed
• Often close to real-time
• Diversity in data types and data sources
Structured
Semi-
Structured
Quasi-Structured
Unstructured
• Data with defined types and structure
• Example: comma separated values
• Textual data with parseable pattern
• Example: XML files with schema
• Textual data with erratic formats that can be
formated with effort
• Example: Clickstream data
• Data that has no inherent structure, often with
multiple formats
• Example: Web site, videos
Structured Quasi-Structured
Semi-Structured Unstructured
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can afford to scan the data once
Aggregation and Statistics
Data warehousing and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
• “… the sexy job in the next 10 years will be statisticians,” Hal Varian, Google Chief
Economist
• The U.S. will need 140,000-190,000 predictive analysts and 1.5 million
managers/analysts by 2018. McKinsey Global Institute’s June 2011
• New Data Science institutes being created or repurposed – NYU, Columbia,
Washington, UCB,...
• New degree programs, courses, boot-camps:
• e.g., at Berkeley: Stats, I-School, CS, Astronomy…
• One proposal (elsewhere) for an MS in “Big Data Science”
• An area that manages, manipulates, extracts, and interprets knowledge from tremendous
amount of data
• Data science (DS) is a multidisciplinary field of study with goal to address the challenges in big
data
• Data science principles apply to all data – big and small
• Theories and techniques from many fields and disciplines are used to investigate
and analyze a large amount of data to help decision makers in many industries
such as science, engineering, economics, politics, finance, and education
• Computer Science
• Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
• Mathematics
• Mathematical Modeling
• Statistics
• Statistical and Stochastic modeling, Probability.
• Gartner’s 2014 Hype Cycle
Is "Data Science"
important or just trendy?
• Companies learn your secrets, shopping patterns, and preferences
• For example, can we know if a woman is pregnant, even if she doesn’t want us
to know? Target case study
• Data Science and election (2008, 2012)
• 1 million people installed the Obama Facebook app that gave access to info on
“friends”
• Data Scientist
• The Sexiest Job of the 21st Century
• They find stories, extract knowledge. They are not reporters
• Data scientists are the key to realizing the opportunities presented by big data.
They bring structure to it, find compelling patterns in it, and advise executives
on the implications for products, processes, and decisions
• National Security
• Cyber Security
• Business Analytics
• Engineering
• Healthcare
• And more ….
• Mathematics and Applied Mathematics
• Applied Statistics/Data Analysis
• Solid Programming Skills (R, Python, Julia, SQL)
• Data Mining
• Data Base Storage and Management
• Machine Learning and discovery
• Unfortunately, there is no clear definition (yet?)
• Goal is the extraction of knowledge from data
• Combination of techniques from different disciplines
• Scientific principles guide the data analysis
Tools? Big Data?
Machine Learning?
Computational
Geometry
Optimization Stochastics
Scientific
Computing
Machine
Learning
Data Structures and
Algorithms
Databases Distributed Computing
Software Engineering Artificial Intelligence Machine Learning
Linear Models Statistical Tests Inference
Time Series Analysis Machine Learning
Intelligent Systems Robotics Marketing
Medicine Autonomous Driving Social Networks
• 1. Understand the statistics and machine learning concepts that are vital for data science
• 2. Learn to statistically analyze a dataset
• 3. Critically evaluate data visualization based on their design and use for communicating stories
from data
• The Student will be able to
• 1. Understand the key concepts in data science, its applications and the toolkit used by data
scientists;
• 2. Realize how data is collected, managed and stored for data science;
• 3. Apply various machine learning techniques in real-world applications
• 4. Implement data collection and management
• 5. Apply visualization tools for data visualization
• 6. Possess the required knowledge and expertise to become a proficient data scientist
• Module 1: Introduction to Data Analytics (7 hrs)
Introduction, Terminology, data science process, data science toolkit, Types of data, Introduction to Python, Data
Analysis in Excel, Analytics Problem Solving, Exploratory Data Analysis, Example applications.
• Module 2: Data collection management and Statistics (7 hrs)
Introduction, Sources of data, Data collection and APIs, Exploring and fixing data, Data storage and management,
Advanced SQL using multiple data sources, Statistics and Hypothesis Testing, Inferential Statistics, Big Data
Storage and Processing Framework, Hadoob
• Module 3: Data analysis (8 hrs)
Introduction, Terminology and concepts, Introduction to statistics, Central tendencies and distributions, Variance,
Distribution properties and arithmetic, Samples/CLT, Basic machine learning algorithms, Linear regression, SVM,
Naive Bayes
• Module 4: Data visualization (8 hrs)
Introduction, Types of data visualization, Data for visualization: Data types, Data encodings, Retinal
variables, Mapping variables to encodings, Visual encodings, Data Visualization in Python-Superset or
in Microsoft Power BI
• Module 5: Computing and Applications (8 hrs)
Using Python for Data Science - Using Open Source R for Data Science - Using SQL in Data Science
- Software Applications for Data Science. Applications of Data Science, Technologies for visualization
like Data Visualization in Microsoft Power BI.
• Module 6: Trends and Technologies (7 hrs)
Recent trends in various data collection and analysis techniques, various visualization techniques,
application development methods used in data science, NYC Parking Case Study: Apache Spark
Text Books:
• 1. Cathy O’Neil and Rachel Schutt, Doing Data Science, Straight Talk from The Frontline. O’Reilly, 2014. ISBN: 978-1-
449-35865-5
• 2. Davy Cielen. Arno D.B Meysman, Mohamed Ali, “Introducing Data Science”, Dreamtech Press, 2016. ISBN: 978-93-
5119-937-3
Reference Books:
• 1. Joel Grus, Data Science from Scratch, O’Reilly, 2015, ISBN: 978-1-491-90142-7
• 2. Jure Leskovek, Anand Rajaraman and Jeffrey Ullman, Mining of Massive Datasets. v2.1, Cambridge University
Press, 2014. ISBN : 9781139924801
• 3. John W. Foreman, Using Data Science to Transform Information into Insight – Data Smart, Wiley, 2014. ISBN:
978-81-265-4614-5
• 4. https://github.com/maximrohit/SPARK-R-SQL-NYC-PArking-Ticket-Analysis
• Big data has a high volume, velocity, and variety
• Different data structures
• Structured, semi-structured, quasi-structured, unstructured
• Data science is a very diverse discipline
• Maths, computer science, statistics, applications
→ Data scientists require a diverse skillset
• Not computer scientists
• But should know about databases, data structures, algorithms, etc.
• Not mathematicians
• But should know about optimization, stochastics, etc.
• Not statisticians
• But should know about regression, statistical tests, etc.
• Not domain experts
• But must work together with them
Data
Scientists
Quantitative
• Maths
• Algorithms
• Statistics
Technical
• Programming
• Infrastructures
Skeptical
• Create hypotheses,
but be skeptical
about them
Collaborative
• Teamwork
• Communication
skills
A bit of everything
… but actually as much as
possible of everything
• According to Microsoft Research:
• Polymath
• Do it all
• Data Evangelist
• Data analysis, disseminating and acting on
insights
• Data Preparer
• Querying existing data, preparing data for
analysis
• Data Shapers
• Analyzing and preparing data
• Data Analyzer
• Analyzing data
• Platform Builder
• Collect data and create infrastructures
• Moonlighters (50%/20%)
• Spare time data scientists
• Insight Actors
• Use the outcome and act on insights.
Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software Teams: State of the Art and Challenges, IEEE Transactions on Software
Engineering (Online First)

Más contenido relacionado

Similar a 00-01 DSnDA.pdf

Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptxAkhirulAminulloh2
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...teodroscampaus
 
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION Elvis Muyanja
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxelisarosa29
 
Computational intelligence for big data analytics bda 2013
Computational intelligence for big data analytics   bda 2013Computational intelligence for big data analytics   bda 2013
Computational intelligence for big data analytics bda 2013oj08
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptxshalini s
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxwahiba ben abdessalem
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introductionamiyadash
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxssuser1a4f0f
 
Unit 1 (DSBDA) PD.pptx
Unit 1 (DSBDA)  PD.pptxUnit 1 (DSBDA)  PD.pptx
Unit 1 (DSBDA) PD.pptxSamiksha880257
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesKimberley Mitchell
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Building Data Scientists
Building Data ScientistsBuilding Data Scientists
Building Data ScientistsMitch Sanders
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in EducationPhilip Piety
 

Similar a 00-01 DSnDA.pdf (20)

Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...
 
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptx
 
Computational intelligence for big data analytics bda 2013
Computational intelligence for big data analytics   bda 2013Computational intelligence for big data analytics   bda 2013
Computational intelligence for big data analytics bda 2013
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
BigData.pptx
BigData.pptxBigData.pptx
BigData.pptx
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Unit 1 (DSBDA) PD.pptx
Unit 1 (DSBDA)  PD.pptxUnit 1 (DSBDA)  PD.pptx
Unit 1 (DSBDA) PD.pptx
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Building Data Scientists
Building Data ScientistsBuilding Data Scientists
Building Data Scientists
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in Education
 

Más de SugumarSarDurai (19)

Parking NYC.pdf
Parking NYC.pdfParking NYC.pdf
Parking NYC.pdf
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Power BI.pdf
Power BI.pdfPower BI.pdf
Power BI.pdf
 
Unit 6.pdf
Unit 6.pdfUnit 6.pdf
Unit 6.pdf
 
Unit 5.pdf
Unit 5.pdfUnit 5.pdf
Unit 5.pdf
 
07 Data-Exploration.pdf
07 Data-Exploration.pdf07 Data-Exploration.pdf
07 Data-Exploration.pdf
 
06 Excel.pdf
06 Excel.pdf06 Excel.pdf
06 Excel.pdf
 
05 python.pdf
05 python.pdf05 python.pdf
05 python.pdf
 
03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf
 
UNit4.pdf
UNit4.pdfUNit4.pdf
UNit4.pdf
 
UNit4d.pdf
UNit4d.pdfUNit4d.pdf
UNit4d.pdf
 
Unit 4 Time Study.pdf
Unit 4 Time Study.pdfUnit 4 Time Study.pdf
Unit 4 Time Study.pdf
 
Unit 3 Micro and Memo motion study.pdf
Unit 3 Micro and Memo motion study.pdfUnit 3 Micro and Memo motion study.pdf
Unit 3 Micro and Memo motion study.pdf
 
02 Work study -Part_1.pdf
02 Work study -Part_1.pdf02 Work study -Part_1.pdf
02 Work study -Part_1.pdf
 
02 Method Study part_2.pdf
02 Method Study part_2.pdf02 Method Study part_2.pdf
02 Method Study part_2.pdf
 
01 Production_part_2.pdf
01 Production_part_2.pdf01 Production_part_2.pdf
01 Production_part_2.pdf
 
01 Production_part_1.pdf
01 Production_part_1.pdf01 Production_part_1.pdf
01 Production_part_1.pdf
 
01 Industrial Management_Part_1a .pdf
01 Industrial Management_Part_1a .pdf01 Industrial Management_Part_1a .pdf
01 Industrial Management_Part_1a .pdf
 
01 Industrial Management_Part_1 .pdf
01 Industrial Management_Part_1 .pdf01 Industrial Management_Part_1 .pdf
01 Industrial Management_Part_1 .pdf
 

Último

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 

Último (20)

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 

00-01 DSnDA.pdf

  • 1. 19EC2029 Dr. D. Sugumar Associate Professor/ECE Karunya
  • 2.
  • 3.
  • 4.
  • 5. Lots of data is being collected and warehoused Web data, e-commerce Financial transactions, bank/credit transactions Online trading and purchasing Social Network
  • 6. Google processes 20 PB a day (2008) Facebook has 60 TB of daily logs eBay has 6.5 PB of user data + 50 TB/day (5/2009) 1000 genomes project: 200 TB Cost of 1 TB of disk: $35 Time to read 1 TB disk: 3 hrs (100 MB/s)
  • 7.
  • 8. There's certainly a lot of it! 2015 1 Zettabyte 1 Exabyte 1 Petabyte (brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store (2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm 1 Petabyte == 1000 TB 2002 2009 (2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf (2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf 2006 2011 (2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly! 5 EB 161 EB 800 EB 1.8 ZB 8.0 ZB 14 PB 60 PB Data produced each year 100-years of HD video + audio Human brain's capacity References 1 TB = 1000 GB 120 PB logarithmic scale Data, data everywhere…
  • 10. 2000’s Content Management 1990’s Relational Databases & Data Warehouses 2010’s Key-Value Storages & Unstructured Data VOLUME OF INFORMATION LARGE SMALL TERABYTES PETABYTES EXABYTES
  • 11. Big Data is any data that is expensive to manage and hard to extract value from Volume The size of the data Velocity The latency of data processing relative to the growing demand for interactivity Variety and Complexity the diversity of sources, formats, quality, structures.
  • 12. Hmmm… where am I on this diagram?
  • 13. the companies are expanding as fast as the data!
  • 15. Big Data? I agree with this…
  • 16. Make data easier to use ~ by using it! It may be true that Data Science isn't a science – but that doesn't mean it's not useful!
  • 17. • Naive definition: • Big data only depends on the data size • 1 Gigabyte? 1 Terabyte? 1 Petabyte? • Naive interpretation misses important aspects • Time: • Analyzing 1 Gigabyte of data per day is different from analyzing 1 Gigabyte of data per second • Diversity: • Analyzing spread sheets with numeric data is different from analyzing Web pages that contain a mixture of text and images • Distribution: • Analyzing data from a single source is different from analyzing data from multiple sources
  • 18. • Following Gartner‘s IT Glossary: • Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. • The three Vs • Volume • Velocity • Variety Some people actually use 10 Vs to define big data! • Variability • Veracity • Validity • Vulnerability • Volatility • Visualization • Value
  • 19. • Scale of the data must be „big“ • No clear definition • „that demand […] innovative forms of information processing“ (Gartner) © Statista 2018 Data center storage worldwide
  • 20. • Speed at which new data is created • Speed at which data must be processed and analyzed • Often close to real-time
  • 21. • Diversity in data types and data sources Structured Semi- Structured Quasi-Structured Unstructured • Data with defined types and structure • Example: comma separated values • Textual data with parseable pattern • Example: XML files with schema • Textual data with erratic formats that can be formated with effort • Example: Clickstream data • Data that has no inherent structure, often with multiple formats • Example: Web site, videos
  • 23. Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), … Streaming Data You can afford to scan the data once
  • 24. Aggregation and Statistics Data warehousing and OLAP Indexing, Searching, and Querying Keyword based search Pattern matching (XML/RDF) Knowledge discovery Data Mining Statistical Modeling
  • 25. • “… the sexy job in the next 10 years will be statisticians,” Hal Varian, Google Chief Economist • The U.S. will need 140,000-190,000 predictive analysts and 1.5 million managers/analysts by 2018. McKinsey Global Institute’s June 2011 • New Data Science institutes being created or repurposed – NYU, Columbia, Washington, UCB,... • New degree programs, courses, boot-camps: • e.g., at Berkeley: Stats, I-School, CS, Astronomy… • One proposal (elsewhere) for an MS in “Big Data Science”
  • 26. • An area that manages, manipulates, extracts, and interprets knowledge from tremendous amount of data • Data science (DS) is a multidisciplinary field of study with goal to address the challenges in big data • Data science principles apply to all data – big and small
  • 27. • Theories and techniques from many fields and disciplines are used to investigate and analyze a large amount of data to help decision makers in many industries such as science, engineering, economics, politics, finance, and education • Computer Science • Pattern recognition, visualization, data warehousing, High performance computing, Databases, AI • Mathematics • Mathematical Modeling • Statistics • Statistical and Stochastic modeling, Probability.
  • 28. • Gartner’s 2014 Hype Cycle
  • 29. Is "Data Science" important or just trendy?
  • 30.
  • 31. • Companies learn your secrets, shopping patterns, and preferences • For example, can we know if a woman is pregnant, even if she doesn’t want us to know? Target case study • Data Science and election (2008, 2012) • 1 million people installed the Obama Facebook app that gave access to info on “friends”
  • 32. • Data Scientist • The Sexiest Job of the 21st Century • They find stories, extract knowledge. They are not reporters
  • 33. • Data scientists are the key to realizing the opportunities presented by big data. They bring structure to it, find compelling patterns in it, and advise executives on the implications for products, processes, and decisions
  • 34. • National Security • Cyber Security • Business Analytics • Engineering • Healthcare • And more ….
  • 35. • Mathematics and Applied Mathematics • Applied Statistics/Data Analysis • Solid Programming Skills (R, Python, Julia, SQL) • Data Mining • Data Base Storage and Management • Machine Learning and discovery
  • 36.
  • 37. • Unfortunately, there is no clear definition (yet?) • Goal is the extraction of knowledge from data • Combination of techniques from different disciplines • Scientific principles guide the data analysis
  • 40. Data Structures and Algorithms Databases Distributed Computing Software Engineering Artificial Intelligence Machine Learning
  • 41. Linear Models Statistical Tests Inference Time Series Analysis Machine Learning
  • 42. Intelligent Systems Robotics Marketing Medicine Autonomous Driving Social Networks
  • 43. • 1. Understand the statistics and machine learning concepts that are vital for data science • 2. Learn to statistically analyze a dataset • 3. Critically evaluate data visualization based on their design and use for communicating stories from data
  • 44. • The Student will be able to • 1. Understand the key concepts in data science, its applications and the toolkit used by data scientists; • 2. Realize how data is collected, managed and stored for data science; • 3. Apply various machine learning techniques in real-world applications • 4. Implement data collection and management • 5. Apply visualization tools for data visualization • 6. Possess the required knowledge and expertise to become a proficient data scientist
  • 45. • Module 1: Introduction to Data Analytics (7 hrs) Introduction, Terminology, data science process, data science toolkit, Types of data, Introduction to Python, Data Analysis in Excel, Analytics Problem Solving, Exploratory Data Analysis, Example applications. • Module 2: Data collection management and Statistics (7 hrs) Introduction, Sources of data, Data collection and APIs, Exploring and fixing data, Data storage and management, Advanced SQL using multiple data sources, Statistics and Hypothesis Testing, Inferential Statistics, Big Data Storage and Processing Framework, Hadoob • Module 3: Data analysis (8 hrs) Introduction, Terminology and concepts, Introduction to statistics, Central tendencies and distributions, Variance, Distribution properties and arithmetic, Samples/CLT, Basic machine learning algorithms, Linear regression, SVM, Naive Bayes
  • 46. • Module 4: Data visualization (8 hrs) Introduction, Types of data visualization, Data for visualization: Data types, Data encodings, Retinal variables, Mapping variables to encodings, Visual encodings, Data Visualization in Python-Superset or in Microsoft Power BI • Module 5: Computing and Applications (8 hrs) Using Python for Data Science - Using Open Source R for Data Science - Using SQL in Data Science - Software Applications for Data Science. Applications of Data Science, Technologies for visualization like Data Visualization in Microsoft Power BI. • Module 6: Trends and Technologies (7 hrs) Recent trends in various data collection and analysis techniques, various visualization techniques, application development methods used in data science, NYC Parking Case Study: Apache Spark
  • 47. Text Books: • 1. Cathy O’Neil and Rachel Schutt, Doing Data Science, Straight Talk from The Frontline. O’Reilly, 2014. ISBN: 978-1- 449-35865-5 • 2. Davy Cielen. Arno D.B Meysman, Mohamed Ali, “Introducing Data Science”, Dreamtech Press, 2016. ISBN: 978-93- 5119-937-3 Reference Books: • 1. Joel Grus, Data Science from Scratch, O’Reilly, 2015, ISBN: 978-1-491-90142-7 • 2. Jure Leskovek, Anand Rajaraman and Jeffrey Ullman, Mining of Massive Datasets. v2.1, Cambridge University Press, 2014. ISBN : 9781139924801 • 3. John W. Foreman, Using Data Science to Transform Information into Insight – Data Smart, Wiley, 2014. ISBN: 978-81-265-4614-5 • 4. https://github.com/maximrohit/SPARK-R-SQL-NYC-PArking-Ticket-Analysis
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54. • Big data has a high volume, velocity, and variety • Different data structures • Structured, semi-structured, quasi-structured, unstructured • Data science is a very diverse discipline • Maths, computer science, statistics, applications → Data scientists require a diverse skillset
  • 55. • Not computer scientists • But should know about databases, data structures, algorithms, etc. • Not mathematicians • But should know about optimization, stochastics, etc. • Not statisticians • But should know about regression, statistical tests, etc. • Not domain experts • But must work together with them
  • 56. Data Scientists Quantitative • Maths • Algorithms • Statistics Technical • Programming • Infrastructures Skeptical • Create hypotheses, but be skeptical about them Collaborative • Teamwork • Communication skills A bit of everything … but actually as much as possible of everything
  • 57.
  • 58. • According to Microsoft Research: • Polymath • Do it all • Data Evangelist • Data analysis, disseminating and acting on insights • Data Preparer • Querying existing data, preparing data for analysis • Data Shapers • Analyzing and preparing data • Data Analyzer • Analyzing data • Platform Builder • Collect data and create infrastructures • Moonlighters (50%/20%) • Spare time data scientists • Insight Actors • Use the outcome and act on insights. Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software Teams: State of the Art and Challenges, IEEE Transactions on Software Engineering (Online First)