SlideShare una empresa de Scribd logo
1 de 43
Background © Jim Kaskade: Big Data 
BIG DATA AND DATA SCIENCE 
study materials and online courses by @dspadawan
WHAT IS DATA SCIENCE 
2 Copyright © 2013-2014 by Teradata. All rights reserved. 
THE DATA SCIENCE VENN DIAGRAM 
@dspadawan
DATA SCIENCE DOMAINS 
All links go to Wiki. 
If you are not sure 
what something 
means you can learn. 
1. Data Science (Fundamentals) 
2. Statistics 
3. Programming languages 
4. Machine Learning / Data Mining 
5. Text Mining / Natural Language Processing 
6. Data Visualization 
7. Big Data (Hadoop, MapReduce, NoSQL) 
8. Data Ingestion 
9. Data Munging or Data Wrangling 
10. Toolbox (Weka, …, Spark, Storm, …, Sqoop, RHIPE, etc.) 
3 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
DATA SCIENCE METRO MAP 
4 Copyright © 2013-2014 by Teradata. All rights reserved. 
BECOMING A DATA SCIENTIST
MASSIVE OPEN ONLINE COURSES (MOOC) 
• Aggregator 
> http://www.mooc-list.com 
• Platforms 
> https://www.coursera.org 
> https://www.edx.org 
> https://www.open2study.com 
> https://www.udacity.com 
> https://www.udemy.com 
> http://online.stanford.edu 
• Interactive platforms 
> http://www.codecademy.com 
> https://www.datacamp.com 
5 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
WANT TO WORK AS DATA SCIENTIST? 
6 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
DATA SCIENCE & ANALYTICS 
• Coursera 
> Core Concepts in Data Analysis 
https://www.coursera.org/course/datan 
> Introduction to Data Science: 
https://www.coursera.org/course/datasci 
> Data Science Specialization: 
https://www.coursera.org/specialization/jhudatascience/1 
– 9 courses + 1 capstone project 
– Each course or capstone takes 4 weeks 
– You can do it for free or you can pay 49 USD for certification 
> Welcome To Process Mining: Data science in Action! 
https://www.coursera.org/course/procmin 
7 Copyright © 2013-2014 by Teradata. All rights reserved. 
1 
@dspadawan
DATA SCIENCE & ANALYTICS 1 
• Edx 
> The Analytics Edge 
http://www.edx.org/course/mitx/mitx-15-071x-analytics-edge- 
1416 
> Data, Analytics and Learning 
http://www.edx.org/course/utarlingtonx/utarlingtonx-link5-10x-data- 
analytics-2186 
• Udacity 
$ 
> Intro to Data Science 
https://www.udacity.com/course/ud359 
8 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
MATH DANCE 
9 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
STATISTICS COURSES 
• Coursera 
> Data analysis and statistical inference: 
https://www.coursera.org/course/statistics 
> Statistical inference and exploratory data analysis: 
https://www.coursera.org/specialization/jhudatascience/1/courses 
• EdX 
> Introduction to Statistics: Descriptive Statistics 
http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-1x-introduction- 
1138 
> Introduction to Statistics: Probability 
http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-2x-introduction- 
1534 
> Introduction to Statistics: Inference 
http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-3x-introduction- 
1533 
10 Copyright © 2013-2014 by Teradata. All rights reserved. 
2 
@dspadawan
STATISTICS COURSES CONT. 2 
• Udacity 
$ 
> Intro to statistics: 
https://www.udacity.com/course/st101 
> Exploratory data analysis: 
https://www.udacity.com/course/ud651 
> Intro to Inferential Statistics 
https://www.udacity.com/course/ud201 
• Mathematical monk 
> https://www.youtube.com/playlist?list=PL17567A1A3F5DB5E4 
11 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
PROGRAMMING LANGUAGES 
• Analysis/Data mining: 
> R language 
> Python 
> SQL 
> (Perl) 
> (Octave) 
• Big Data (Hadoop) 
> Java (!) 
> Python 
• Visualization 
> JavaScript 
12 Copyright © 2013-2014 by Teradata. All rights reserved. 
3 
@dspadawan
R LANGUAGE 
• Basic info and SW 
> R Language: 
http://www.r-project.org 
> R Studio (IDE): 
http://www.rstudio.com 
• Courses 
> R Programming: 
https://www.coursera.org/course/rprog 
• Practice 
> Interactive courses: 
https://www.datacamp.com/courses 
> Data mining examples in R: 
http://www.rdatamining.com 
13 Copyright © 2013-2014 by Teradata. All rights reserved. 
3 
@dspadawan
PYTHON 
• Basic info and SW: 
> Python language: 
https://www.python.org 
> Eclipse Python: 
http://pydev.org 
• Python for Java developers: 
> http://www.sthurlow.com/python 
• Google's Python Class 
> https://developers.google.com/edu/python 
• Code Academy Python 
> http://www.codecademy.com/tracks/python 
14 Copyright © 2013-2014 by Teradata. All rights reserved. 
3 
@dspadawan
OCTAVE 
• Basic info and SW: 
> http://octave.sourceforge.net 
> https://gnu.org/software/octave 
> http://en.wikipedia.org/wiki/GNU_Octave 
• Coursera: 
> Machine learning: https://www.coursera.org/course/ml 
15 Copyright © 2013-2014 by Teradata. All rights reserved. 
3 
Octave is mostly 
compatible with 
MatLab. 
@dspadawan
MACHINE LEARNING COURSES 
Subfield of computer 
science and artificial 
intelligence about 
learn from data. 
• Coursera 
> Machine Learning (Stanford): 
https://www.coursera.org/course/ml 
> Machine Learning: (University of Washington) 
https://www.coursera.org/course/machlearning 
> Practical Machine Learning (Johns Hopkins): 
https://www.coursera.org/course/predmachlearn 
– part of Data Science Specialization 
• Udacity 
> Machine Learning (Supervised, Reinforcement, Unsupervised) 
https://www.udacity.com/course/ud675 
https://www.udacity.com/course/ud820 
https://www.udacity.com/course/ud741 
16 Copyright © 2013-2014 by Teradata. All rights reserved. 
4A 
$ 
@dspadawan
MACHINE LEARNING VIDEOS 
• Udemy 
> Hilary Mason: An Intro to Machine Learning with Web Data 
https://www.udemy.com/hilary-mason-an-intro-to-machine-learning- 
with-web-data 
> Hilary Mason: Advanced Machine Learning 
https://www.udemy.com/hilary-mason-advanced-machine-learning/ 
• Mathematical monk 
> https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA 
• Videolectures.net 
> http://blog.videolectures.net/100-most-popular-machine-learning- 
talks-at-videolectures-net/ 
17 Copyright © 2013-2014 by Teradata. All rights reserved. 
4A 
$ 
@dspadawan
DATA MINING COURSES 
Process of discovery 
patterns in large data 
sets via machine 
learning or statistics. 
• Coursera 
> Mining Massive Datasets 
(Stanford) 
https://www.coursera.org/course/mmds 
• Udemy 
> Matthew Russell on Mining the Social Web 
https://www.udemy.com/matthew-russell-on-mining-the-social-web/ 
> Data Mining 
https://www.udemy.com/data-mining 
• Web page 
> http://www.rdatamining.com 
18 Copyright © 2013-2014 by Teradata. All rights reserved. 
4B 
$ 
@dspadawan
DATA MINING COURSES & TOOLS 
• Courses: 
> Data Mining with Weka: 
https://weka.waikato.ac.nz/dataminingwithweka/preview 
> More Data Mining with Weka: 
https://weka.waikato.ac.nz/moredataminingwithweka 
• Weka 
> SW: http://www.cs.waikato.ac.nz/ml/weka 
• Knime 
> SW: https://www.knime.org/downloads/overview 
• RapidMiner 
> Official site: http://rapidminer.com 
> SW: http://sourceforge.net/projects/rapidminer 
19 Copyright © 2013-2014 by Teradata. All rights reserved. 
4B 
@dspadawan
TEXT MINING 5A 
• R Data Mining (Word Cloud) 
TOP RECURRING THEMES ABOUT BIG DATA 
> http://www.rdatamining.com/examples/text-mining 
• Videolectures.net 
> http://videolectures.net/Top/Computer_Science/Text_Mining 
• Tool (Word Cloud) 
> Wordle.net 
20 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
NATURAL LANGUAGE PROCESSING COURSES 
• Coursera 
> Natural Language Processing 
Subfield of computer 
science and artificial 
intelligence and 
linguistics. 
(Columbia University): 
https://www.coursera.org/course/nlangp 
> Natural Language Processing (Stanford): 
https://www.coursera.org/course/nlp 
• Deeper Learning MOOC 
> http://dlmooc.deeper-learning.org/ 
• Wikipedia 
> http://en.wikipedia.org/wiki/Natural_language_processing 
21 Copyright © 2013-2014 by Teradata. All rights reserved. 
5B 
@dspadawan
VISUALIZATION TOOLS 6 
• Tableau 
> http://www.tableausoftware.com 
> Commercial visualization software 
• D3.js 
> http://d3js.org 
> Data Driven document visualization library 
• GraphViz 
> http://www.graphviz.org 
> Graph visualization tools 
• Gephi 
> https://gephi.github.io 
> Visualization platform 
22 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
TABLEAU 6 
• Trainings 
> http://www.tableausoftware.com/learn/training 
> On demand 
> Live Online planned for specific topic 
• Download 
> Tableau Public: http://www.tableausoftware.com/public 
> Tableau Trial: http://www.tableausoftware.com/products/trial 
• Certification 
> Desktop (Qualified associate, Certified Professional) 
> Server (Qualified associate, Certified Professional) 
> http://www.tableausoftware.com/support/certification 
23 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
HOW BIG, IS BIG ENOUGH? 
24 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
BIG DATA STUDY 7 
• MOOC 
> http://bigdatauniversity.com 
> http://bigdatacourse.appspot.com 
• Coursera 
> Web Intelligence and Big Data 
https://www.coursera.org/course/bigdata 
• Udemy 
$ 
> Big Data and Hadoop Essentials 
https://www.udemy.com/big-data-and-hadoop-essentials-free-tutorial 
• Open2Study 
> Big Data for Better Performance 
http://www.open2study.com/courses/big-data-for-better-performance 
25 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
BIG DATA TOOLS 
• Hadoop – Big Data Framework 
• Hive – DWH infrastructure build on top of Hadoop 
• HBase – Non-relational, distributed DB 
• Pig – Hadoop programming tool 
• Storm – Real time computation system for Hadoop 
• Solr – Search platform 
• Falcon – Data management and processing for Hadoop 
• Sqoop – CMD application for transfer data into Hadoop 
• Flume – Large scale log aggregation framework 
• Oozie – Workflow scheduler for Hadoop 
• Ambari – Simpler management for Hadoop clusters 
• Mahout – Machine Learning algorithms implemented on Hadoop 
• ZooKeeper – Coordination service for distributed applications 
• Knox - REST API Gateway for interacting with Hadoop clusters 
26 Copyright © 2013-2014 by Teradata. All rights reserved. 
7 
@dspadawan
HADOOP STUDY 
• Hadoop providers 
> http://www.cloudera.com 
> http://hortonworks.com 
> http://www.mapr.com 
> http://www.teradata.com/aster 
• Udacity 
> Intro to Hadoop and MapReduce 
https://www.udacity.com/course/ud617 
• Udemy 
> Become a Certified Hadoop Developer | Training | Tutorial 
https://www.udemy.com/hadoop-tutorial 
27 Copyright © 2013-2014 by Teradata. All rights reserved. 
7 
There is more 
Hadoop providers: 
IBM, Pivotal, etc. 
$ 
$ 
@dspadawan
NOT ONLY SQL DATABASES 
• MongoDB – JSON document store 
> http://www.mongodb.com 
> https://university.mongodb.com 
• CouchDB – JSON document store 
> http://couchdb.apache.org 
• CasandraDB – High performance column oriented DB 
> http://cassandra.apache.org 
• VoltDB – In-memory database 
> http://voltdb.com 
• Redis – High performance column oriented DB 
> http://redis.io 
• NuoDB – Distributed SQL DB 
> http://www.nuodb.com 
28 Copyright © 2013-2014 by Teradata. All rights reserved. 
7 
@dspadawan
BIG DATA UNIVERSITY 7 
• Big Data Courses path: 
> Big Data Fundamentals 
> Hadoop Fundamentals 
> Moving Data into Hadoop (Sqoop and Flume tools) 
> Query languages for Hadoop (Hive, Pig and Jaql) 
> SQL Access for Hadoop 
> Using HBase for Real-time Access to your Big Data 
> Accessing Hadoop Data Using Hive 
> Introduction to Pig 
> Controlling Hadoop Jobs using Oozie 
> Hadoop Reporting and Analysis 
> Introduction to MapReduce Programming 
• Courses are provided by IBM 
29 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
IT IS EVEN BETTER, DON’T YOU THINK? 
30 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
CLOUDERA HADOOP 7 
• Tutorials 
> 8 different paths 
> On demand and free 
> Lectured together with Udacity (paid on monthly basis) 
> http://cloudera.com/content/cloudera/en/training/courses.html 
> http://cloudera.com/content/cloudera/en/training/library.html 
• Sandbox 
> http://cloudera.com/content/support/en/downloads/quickstart_v 
ms/cdh-5-1-x1.html 
• Certification 
> 200 USD per exam 
> http://cloudera.com/content/cloudera/en/training/certification.ht 
ml 
31 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
HORTONWORKS HADOOP 7 
• Tutorials 
> http://hortonworks.com/tutorials 
> 3 paths for 
– Developers 
– Administrators 
– Data Scientists 
• Sandbox 
> http://hortonworks.com/hdp/downloads 
• Certifications 
> 200 USD per exam 
> http://hortonworks.com/training/certification 
32 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
MAPR HADOOP 7 
• Tutorials 
> https://www.mapr.com/services/mapr-academy/training-videos 
> 3 paths for 
– Developers 
– Administrators 
– Business users 
• Sandbox 
> https://www.mapr.com/products/mapr-sandbox-hadoop 
• Certification 
> For administrator only 
> You must pass Hadoop Cluster Administration on MapR course 
> https://www.mapr.com/services/mapr-academy/certification 
33 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
STREAMING – NO BIG DEAL 
34 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
STREAMING DATA PROCESSING 
• Storm (https://storm.incubator.apache.org) 
• Open source (ASF) real-time Hadoop 
• Twitter project 
• Spark (https://spark.apache.org) 
• Open source (ASF) in-memory Hadoop 
• Apache project 
• S4 (http://incubator.apache.org/s4) 
• Open source (ASF) processing of stream data 
• Yahoo project 
• Samza (http://samza.incubator.apache.org) 
• Open source processing messagining data 
• LinkedIn project 
35 Copyright © 2013-2014 by Teradata. All rights reserved. 
7 
@dspadawan
DATA INGESTION 8 
• Techniques 
Process of obtaining, 
importing and 
processing data for 
later use or storage. 
> Data import and export 
> Data fusion – integration multiple data 
> Data sampling – selection of data subset (rows) 
> Data discovery – detection patterns in data 
> Exploratory data analysis – summarize main data characteristics 
> Feature extraction – selection of data subset (columns) 
> Data scrubbing – data error correction 
> Missing data values – data correction 
> Etc. 
36 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
DATA WRANGLING / DATA MUNGING 9 
• Coursera 
> Getting and Cleaning Data 
Converting or 
mapping data from 
one "raw" form into 
another format. 
part of Data Science Specialization 
https://www.coursera.org/course/getdata 
• Udacity 
$ 
> Data Wrangling with MongoDB 
https://www.udacity.com/course/ud032 
• School of Data 
> Many different courses http://schoolofdata.org 
• Tools 
> OpenRefine, DataWrangler – clean up and transform tools 
> Talend, Pentaho – integration 
37 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
TOOLBOX 10 
• Hadoop and realtime 
> Apache Scibe 
• Machine Learning 
> H2O – In memory machine learning 
• Data Mining 
> Rattle – GUI for DM using R 
• Python and NLP 
> NLTK = Natural Language ToolKit for Python 
• R and Hadoop 
> RHIPE = R + Hadoop Integrated Programming Environment 
• Visualization 
> Many Eyes – Online visualization system from IBM 
38 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
ONLINE SOURCES 
• Data Science Servers: 
> http://www.datasciencecentral.com 
> http://www.hadoop360.com 
> http://www.datascienceweekly.org 
• Aggregators 
> https://trello.com/b/rbpEfMld/data-science 
• Blogs 
• http://datasciencemasters.org 
• http://www.kdnuggets.com 
• http://www.zipfianacademy.com/blog/post/46864003608/a-practical-intro- 
to-data-science 
• http://datascience101.wordpress.com 
• http://fivethirtyeight.blogs.nytimes.com 
39 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
FREE BOOKS 
• Data Science 
> Doing Data Science 
> Agile Data Science 
> Data Science for Business 
• Statistics 
> Think Stats 
• Programming 
> R language 
– 25 Recipes for Getting Started with R 
– Learning R 
> Python 
– Learning Python, 5th Edition 
– Think Python 
40 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
FREE BOOKS CONTINUED 
• Machine Learning / Data Mining 
> Machine Learning for Hackers 
> Mining the Social Web 
• Visualization 
> Visualizing Data 
> Getting Started with D3 
> Communicating Data with Tableau 
• Text mining / Natural Language Processing 
> 21 Recipes for Mining Twitter 
> Natural Language Processing with Python 
> Natural Language Annotation for Machine Learning 
41 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
FREE BOOKS CONTINUED 
• Big Data 
> Hadoop: The Definitive Guide, 3rd Edition 
> Ethics of Big Data 
> Big Data Analytics with R and Hadoop 
• Data Ingestion 
> Data Analysis with Open Source Tools 
> Python for Data Analysis 
• Data Wrangling and Munging 
> Using OpenRefine 
• Toolbox 
$ 
> Getting Started with Storm 
> Fast Data Processing with Spark 
42 Copyright © 2013-2014 by Teradata. All rights reserved. 
@dspadawan
QUESTIONS AND ANSWERS 
43 Copyright © 2013-2014 by Teradata. All rights reserved. 
By Tara Laskowski 
@dspadawan 
Contact me at datasciencepadawan@gmail.com 
Follow me at twitter @dspadawan 
Read my blog http://datasciencepadawan.blogspot.com

Más contenido relacionado

Destacado

Ideation and Design Principles Workshop
Ideation and Design Principles WorkshopIdeation and Design Principles Workshop
Ideation and Design Principles Workshop
Dan Saffer
 

Destacado (17)

Social BPM
Social BPMSocial BPM
Social BPM
 
Deep Learning in theano
Deep Learning in theanoDeep Learning in theano
Deep Learning in theano
 
Deep learning
Deep learningDeep learning
Deep learning
 
Data science
Data scienceData science
Data science
 
Big Data, Bigger Campaigns: Using IBM’s Unica and Netezza Platforms to Increa...
Big Data, Bigger Campaigns: Using IBM’s Unica and Netezza Platforms to Increa...Big Data, Bigger Campaigns: Using IBM’s Unica and Netezza Platforms to Increa...
Big Data, Bigger Campaigns: Using IBM’s Unica and Netezza Platforms to Increa...
 
Selection and on boarding process
Selection and on boarding processSelection and on boarding process
Selection and on boarding process
 
Machine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and PreparationMachine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and Preparation
 
Intégration des données avec Talend ETL
Intégration des données avec Talend ETLIntégration des données avec Talend ETL
Intégration des données avec Talend ETL
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Ideation and Design Principles Workshop
Ideation and Design Principles WorkshopIdeation and Design Principles Workshop
Ideation and Design Principles Workshop
 
Capturing Data Requirements
Capturing Data RequirementsCapturing Data Requirements
Capturing Data Requirements
 
2016 kcd 세미나 발표자료. 구글포토로 바라본 인공지능과 머신러닝
2016 kcd 세미나 발표자료. 구글포토로 바라본 인공지능과 머신러닝2016 kcd 세미나 발표자료. 구글포토로 바라본 인공지능과 머신러닝
2016 kcd 세미나 발표자료. 구글포토로 바라본 인공지능과 머신러닝
 
Talk on Industrial Internet of Things @ Intelligent systems tech forum 2014
Talk on Industrial Internet of Things @ Intelligent systems tech forum 2014Talk on Industrial Internet of Things @ Intelligent systems tech forum 2014
Talk on Industrial Internet of Things @ Intelligent systems tech forum 2014
 
RMPG Learning Series CRM Workshop Day 1 session 3
RMPG Learning Series CRM Workshop Day 1 session 3RMPG Learning Series CRM Workshop Day 1 session 3
RMPG Learning Series CRM Workshop Day 1 session 3
 
기계학습 / 딥러닝이란 무엇인가
기계학습 / 딥러닝이란 무엇인가기계학습 / 딥러닝이란 무엇인가
기계학습 / 딥러닝이란 무엇인가
 
The Field Guide to Data Science
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data Science
 
Webinar Smile et Talend : Faites communiquer vos applications en temps réel
Webinar Smile et Talend  : Faites communiquer vos applications en temps réelWebinar Smile et Talend  : Faites communiquer vos applications en temps réel
Webinar Smile et Talend : Faites communiquer vos applications en temps réel
 

Similar a Big data and data science study

SIC Finale Status Report August 6.pptx
SIC Finale Status Report August 6.pptxSIC Finale Status Report August 6.pptx
SIC Finale Status Report August 6.pptx
Shaista Ansari
 

Similar a Big data and data science study (20)

SWAD Timeline 4:3
SWAD Timeline 4:3SWAD Timeline 4:3
SWAD Timeline 4:3
 
Swad Timeline
Swad TimelineSwad Timeline
Swad Timeline
 
SIC Finale Status Report August 6.pptx
SIC Finale Status Report August 6.pptxSIC Finale Status Report August 6.pptx
SIC Finale Status Report August 6.pptx
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science Workbench
 
Internship final presentation GraphicPeople
Internship final presentation GraphicPeopleInternship final presentation GraphicPeople
Internship final presentation GraphicPeople
 
Data Science in the Enterprise
Data Science in the EnterpriseData Science in the Enterprise
Data Science in the Enterprise
 
Welcome Address for NUS-ISS e- Open House 2020: Designing Intelligent Edge C...
Welcome Address for NUS-ISS e- Open House 2020:  Designing Intelligent Edge C...Welcome Address for NUS-ISS e- Open House 2020:  Designing Intelligent Edge C...
Welcome Address for NUS-ISS e- Open House 2020: Designing Intelligent Edge C...
 
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
SODA Framework Projects 25 Sep 2022 v1.pptx
SODA Framework Projects 25 Sep 2022 v1.pptxSODA Framework Projects 25 Sep 2022 v1.pptx
SODA Framework Projects 25 Sep 2022 v1.pptx
 
Fast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud ServiceFast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud Service
 
Processing Twitter Stream with Oracle Event Processing (OEP)
Processing Twitter Stream with Oracle Event Processing (OEP)Processing Twitter Stream with Oracle Event Processing (OEP)
Processing Twitter Stream with Oracle Event Processing (OEP)
 
Scalable Machine Learning using R and Azure HDInsight - Parashar
Scalable Machine Learning using R and Azure HDInsight - ParasharScalable Machine Learning using R and Azure HDInsight - Parashar
Scalable Machine Learning using R and Azure HDInsight - Parashar
 
Introduction to Big data tdd and pig unit
Introduction to Big data tdd and pig unitIntroduction to Big data tdd and pig unit
Introduction to Big data tdd and pig unit
 
Data Curation Lifecycle Management at the University of Edinburgh
Data Curation Lifecycle Management at the University of EdinburghData Curation Lifecycle Management at the University of Edinburgh
Data Curation Lifecycle Management at the University of Edinburgh
 
Breed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxBreed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptx
 
Azure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applicationsAzure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applications
 
Keeping on Top of Your Research Data - 2014-05-07 - Social Sciences Division,...
Keeping on Top of Your Research Data - 2014-05-07 - Social Sciences Division,...Keeping on Top of Your Research Data - 2014-05-07 - Social Sciences Division,...
Keeping on Top of Your Research Data - 2014-05-07 - Social Sciences Division,...
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 

Big data and data science study

  • 1. Background © Jim Kaskade: Big Data BIG DATA AND DATA SCIENCE study materials and online courses by @dspadawan
  • 2. WHAT IS DATA SCIENCE 2 Copyright © 2013-2014 by Teradata. All rights reserved. THE DATA SCIENCE VENN DIAGRAM @dspadawan
  • 3. DATA SCIENCE DOMAINS All links go to Wiki. If you are not sure what something means you can learn. 1. Data Science (Fundamentals) 2. Statistics 3. Programming languages 4. Machine Learning / Data Mining 5. Text Mining / Natural Language Processing 6. Data Visualization 7. Big Data (Hadoop, MapReduce, NoSQL) 8. Data Ingestion 9. Data Munging or Data Wrangling 10. Toolbox (Weka, …, Spark, Storm, …, Sqoop, RHIPE, etc.) 3 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 4. DATA SCIENCE METRO MAP 4 Copyright © 2013-2014 by Teradata. All rights reserved. BECOMING A DATA SCIENTIST
  • 5. MASSIVE OPEN ONLINE COURSES (MOOC) • Aggregator > http://www.mooc-list.com • Platforms > https://www.coursera.org > https://www.edx.org > https://www.open2study.com > https://www.udacity.com > https://www.udemy.com > http://online.stanford.edu • Interactive platforms > http://www.codecademy.com > https://www.datacamp.com 5 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 6. WANT TO WORK AS DATA SCIENTIST? 6 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 7. DATA SCIENCE & ANALYTICS • Coursera > Core Concepts in Data Analysis https://www.coursera.org/course/datan > Introduction to Data Science: https://www.coursera.org/course/datasci > Data Science Specialization: https://www.coursera.org/specialization/jhudatascience/1 – 9 courses + 1 capstone project – Each course or capstone takes 4 weeks – You can do it for free or you can pay 49 USD for certification > Welcome To Process Mining: Data science in Action! https://www.coursera.org/course/procmin 7 Copyright © 2013-2014 by Teradata. All rights reserved. 1 @dspadawan
  • 8. DATA SCIENCE & ANALYTICS 1 • Edx > The Analytics Edge http://www.edx.org/course/mitx/mitx-15-071x-analytics-edge- 1416 > Data, Analytics and Learning http://www.edx.org/course/utarlingtonx/utarlingtonx-link5-10x-data- analytics-2186 • Udacity $ > Intro to Data Science https://www.udacity.com/course/ud359 8 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 9. MATH DANCE 9 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 10. STATISTICS COURSES • Coursera > Data analysis and statistical inference: https://www.coursera.org/course/statistics > Statistical inference and exploratory data analysis: https://www.coursera.org/specialization/jhudatascience/1/courses • EdX > Introduction to Statistics: Descriptive Statistics http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-1x-introduction- 1138 > Introduction to Statistics: Probability http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-2x-introduction- 1534 > Introduction to Statistics: Inference http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-3x-introduction- 1533 10 Copyright © 2013-2014 by Teradata. All rights reserved. 2 @dspadawan
  • 11. STATISTICS COURSES CONT. 2 • Udacity $ > Intro to statistics: https://www.udacity.com/course/st101 > Exploratory data analysis: https://www.udacity.com/course/ud651 > Intro to Inferential Statistics https://www.udacity.com/course/ud201 • Mathematical monk > https://www.youtube.com/playlist?list=PL17567A1A3F5DB5E4 11 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 12. PROGRAMMING LANGUAGES • Analysis/Data mining: > R language > Python > SQL > (Perl) > (Octave) • Big Data (Hadoop) > Java (!) > Python • Visualization > JavaScript 12 Copyright © 2013-2014 by Teradata. All rights reserved. 3 @dspadawan
  • 13. R LANGUAGE • Basic info and SW > R Language: http://www.r-project.org > R Studio (IDE): http://www.rstudio.com • Courses > R Programming: https://www.coursera.org/course/rprog • Practice > Interactive courses: https://www.datacamp.com/courses > Data mining examples in R: http://www.rdatamining.com 13 Copyright © 2013-2014 by Teradata. All rights reserved. 3 @dspadawan
  • 14. PYTHON • Basic info and SW: > Python language: https://www.python.org > Eclipse Python: http://pydev.org • Python for Java developers: > http://www.sthurlow.com/python • Google's Python Class > https://developers.google.com/edu/python • Code Academy Python > http://www.codecademy.com/tracks/python 14 Copyright © 2013-2014 by Teradata. All rights reserved. 3 @dspadawan
  • 15. OCTAVE • Basic info and SW: > http://octave.sourceforge.net > https://gnu.org/software/octave > http://en.wikipedia.org/wiki/GNU_Octave • Coursera: > Machine learning: https://www.coursera.org/course/ml 15 Copyright © 2013-2014 by Teradata. All rights reserved. 3 Octave is mostly compatible with MatLab. @dspadawan
  • 16. MACHINE LEARNING COURSES Subfield of computer science and artificial intelligence about learn from data. • Coursera > Machine Learning (Stanford): https://www.coursera.org/course/ml > Machine Learning: (University of Washington) https://www.coursera.org/course/machlearning > Practical Machine Learning (Johns Hopkins): https://www.coursera.org/course/predmachlearn – part of Data Science Specialization • Udacity > Machine Learning (Supervised, Reinforcement, Unsupervised) https://www.udacity.com/course/ud675 https://www.udacity.com/course/ud820 https://www.udacity.com/course/ud741 16 Copyright © 2013-2014 by Teradata. All rights reserved. 4A $ @dspadawan
  • 17. MACHINE LEARNING VIDEOS • Udemy > Hilary Mason: An Intro to Machine Learning with Web Data https://www.udemy.com/hilary-mason-an-intro-to-machine-learning- with-web-data > Hilary Mason: Advanced Machine Learning https://www.udemy.com/hilary-mason-advanced-machine-learning/ • Mathematical monk > https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA • Videolectures.net > http://blog.videolectures.net/100-most-popular-machine-learning- talks-at-videolectures-net/ 17 Copyright © 2013-2014 by Teradata. All rights reserved. 4A $ @dspadawan
  • 18. DATA MINING COURSES Process of discovery patterns in large data sets via machine learning or statistics. • Coursera > Mining Massive Datasets (Stanford) https://www.coursera.org/course/mmds • Udemy > Matthew Russell on Mining the Social Web https://www.udemy.com/matthew-russell-on-mining-the-social-web/ > Data Mining https://www.udemy.com/data-mining • Web page > http://www.rdatamining.com 18 Copyright © 2013-2014 by Teradata. All rights reserved. 4B $ @dspadawan
  • 19. DATA MINING COURSES & TOOLS • Courses: > Data Mining with Weka: https://weka.waikato.ac.nz/dataminingwithweka/preview > More Data Mining with Weka: https://weka.waikato.ac.nz/moredataminingwithweka • Weka > SW: http://www.cs.waikato.ac.nz/ml/weka • Knime > SW: https://www.knime.org/downloads/overview • RapidMiner > Official site: http://rapidminer.com > SW: http://sourceforge.net/projects/rapidminer 19 Copyright © 2013-2014 by Teradata. All rights reserved. 4B @dspadawan
  • 20. TEXT MINING 5A • R Data Mining (Word Cloud) TOP RECURRING THEMES ABOUT BIG DATA > http://www.rdatamining.com/examples/text-mining • Videolectures.net > http://videolectures.net/Top/Computer_Science/Text_Mining • Tool (Word Cloud) > Wordle.net 20 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 21. NATURAL LANGUAGE PROCESSING COURSES • Coursera > Natural Language Processing Subfield of computer science and artificial intelligence and linguistics. (Columbia University): https://www.coursera.org/course/nlangp > Natural Language Processing (Stanford): https://www.coursera.org/course/nlp • Deeper Learning MOOC > http://dlmooc.deeper-learning.org/ • Wikipedia > http://en.wikipedia.org/wiki/Natural_language_processing 21 Copyright © 2013-2014 by Teradata. All rights reserved. 5B @dspadawan
  • 22. VISUALIZATION TOOLS 6 • Tableau > http://www.tableausoftware.com > Commercial visualization software • D3.js > http://d3js.org > Data Driven document visualization library • GraphViz > http://www.graphviz.org > Graph visualization tools • Gephi > https://gephi.github.io > Visualization platform 22 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 23. TABLEAU 6 • Trainings > http://www.tableausoftware.com/learn/training > On demand > Live Online planned for specific topic • Download > Tableau Public: http://www.tableausoftware.com/public > Tableau Trial: http://www.tableausoftware.com/products/trial • Certification > Desktop (Qualified associate, Certified Professional) > Server (Qualified associate, Certified Professional) > http://www.tableausoftware.com/support/certification 23 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 24. HOW BIG, IS BIG ENOUGH? 24 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 25. BIG DATA STUDY 7 • MOOC > http://bigdatauniversity.com > http://bigdatacourse.appspot.com • Coursera > Web Intelligence and Big Data https://www.coursera.org/course/bigdata • Udemy $ > Big Data and Hadoop Essentials https://www.udemy.com/big-data-and-hadoop-essentials-free-tutorial • Open2Study > Big Data for Better Performance http://www.open2study.com/courses/big-data-for-better-performance 25 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 26. BIG DATA TOOLS • Hadoop – Big Data Framework • Hive – DWH infrastructure build on top of Hadoop • HBase – Non-relational, distributed DB • Pig – Hadoop programming tool • Storm – Real time computation system for Hadoop • Solr – Search platform • Falcon – Data management and processing for Hadoop • Sqoop – CMD application for transfer data into Hadoop • Flume – Large scale log aggregation framework • Oozie – Workflow scheduler for Hadoop • Ambari – Simpler management for Hadoop clusters • Mahout – Machine Learning algorithms implemented on Hadoop • ZooKeeper – Coordination service for distributed applications • Knox - REST API Gateway for interacting with Hadoop clusters 26 Copyright © 2013-2014 by Teradata. All rights reserved. 7 @dspadawan
  • 27. HADOOP STUDY • Hadoop providers > http://www.cloudera.com > http://hortonworks.com > http://www.mapr.com > http://www.teradata.com/aster • Udacity > Intro to Hadoop and MapReduce https://www.udacity.com/course/ud617 • Udemy > Become a Certified Hadoop Developer | Training | Tutorial https://www.udemy.com/hadoop-tutorial 27 Copyright © 2013-2014 by Teradata. All rights reserved. 7 There is more Hadoop providers: IBM, Pivotal, etc. $ $ @dspadawan
  • 28. NOT ONLY SQL DATABASES • MongoDB – JSON document store > http://www.mongodb.com > https://university.mongodb.com • CouchDB – JSON document store > http://couchdb.apache.org • CasandraDB – High performance column oriented DB > http://cassandra.apache.org • VoltDB – In-memory database > http://voltdb.com • Redis – High performance column oriented DB > http://redis.io • NuoDB – Distributed SQL DB > http://www.nuodb.com 28 Copyright © 2013-2014 by Teradata. All rights reserved. 7 @dspadawan
  • 29. BIG DATA UNIVERSITY 7 • Big Data Courses path: > Big Data Fundamentals > Hadoop Fundamentals > Moving Data into Hadoop (Sqoop and Flume tools) > Query languages for Hadoop (Hive, Pig and Jaql) > SQL Access for Hadoop > Using HBase for Real-time Access to your Big Data > Accessing Hadoop Data Using Hive > Introduction to Pig > Controlling Hadoop Jobs using Oozie > Hadoop Reporting and Analysis > Introduction to MapReduce Programming • Courses are provided by IBM 29 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 30. IT IS EVEN BETTER, DON’T YOU THINK? 30 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 31. CLOUDERA HADOOP 7 • Tutorials > 8 different paths > On demand and free > Lectured together with Udacity (paid on monthly basis) > http://cloudera.com/content/cloudera/en/training/courses.html > http://cloudera.com/content/cloudera/en/training/library.html • Sandbox > http://cloudera.com/content/support/en/downloads/quickstart_v ms/cdh-5-1-x1.html • Certification > 200 USD per exam > http://cloudera.com/content/cloudera/en/training/certification.ht ml 31 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 32. HORTONWORKS HADOOP 7 • Tutorials > http://hortonworks.com/tutorials > 3 paths for – Developers – Administrators – Data Scientists • Sandbox > http://hortonworks.com/hdp/downloads • Certifications > 200 USD per exam > http://hortonworks.com/training/certification 32 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 33. MAPR HADOOP 7 • Tutorials > https://www.mapr.com/services/mapr-academy/training-videos > 3 paths for – Developers – Administrators – Business users • Sandbox > https://www.mapr.com/products/mapr-sandbox-hadoop • Certification > For administrator only > You must pass Hadoop Cluster Administration on MapR course > https://www.mapr.com/services/mapr-academy/certification 33 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 34. STREAMING – NO BIG DEAL 34 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 35. STREAMING DATA PROCESSING • Storm (https://storm.incubator.apache.org) • Open source (ASF) real-time Hadoop • Twitter project • Spark (https://spark.apache.org) • Open source (ASF) in-memory Hadoop • Apache project • S4 (http://incubator.apache.org/s4) • Open source (ASF) processing of stream data • Yahoo project • Samza (http://samza.incubator.apache.org) • Open source processing messagining data • LinkedIn project 35 Copyright © 2013-2014 by Teradata. All rights reserved. 7 @dspadawan
  • 36. DATA INGESTION 8 • Techniques Process of obtaining, importing and processing data for later use or storage. > Data import and export > Data fusion – integration multiple data > Data sampling – selection of data subset (rows) > Data discovery – detection patterns in data > Exploratory data analysis – summarize main data characteristics > Feature extraction – selection of data subset (columns) > Data scrubbing – data error correction > Missing data values – data correction > Etc. 36 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 37. DATA WRANGLING / DATA MUNGING 9 • Coursera > Getting and Cleaning Data Converting or mapping data from one "raw" form into another format. part of Data Science Specialization https://www.coursera.org/course/getdata • Udacity $ > Data Wrangling with MongoDB https://www.udacity.com/course/ud032 • School of Data > Many different courses http://schoolofdata.org • Tools > OpenRefine, DataWrangler – clean up and transform tools > Talend, Pentaho – integration 37 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 38. TOOLBOX 10 • Hadoop and realtime > Apache Scibe • Machine Learning > H2O – In memory machine learning • Data Mining > Rattle – GUI for DM using R • Python and NLP > NLTK = Natural Language ToolKit for Python • R and Hadoop > RHIPE = R + Hadoop Integrated Programming Environment • Visualization > Many Eyes – Online visualization system from IBM 38 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 39. ONLINE SOURCES • Data Science Servers: > http://www.datasciencecentral.com > http://www.hadoop360.com > http://www.datascienceweekly.org • Aggregators > https://trello.com/b/rbpEfMld/data-science • Blogs • http://datasciencemasters.org • http://www.kdnuggets.com • http://www.zipfianacademy.com/blog/post/46864003608/a-practical-intro- to-data-science • http://datascience101.wordpress.com • http://fivethirtyeight.blogs.nytimes.com 39 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 40. FREE BOOKS • Data Science > Doing Data Science > Agile Data Science > Data Science for Business • Statistics > Think Stats • Programming > R language – 25 Recipes for Getting Started with R – Learning R > Python – Learning Python, 5th Edition – Think Python 40 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 41. FREE BOOKS CONTINUED • Machine Learning / Data Mining > Machine Learning for Hackers > Mining the Social Web • Visualization > Visualizing Data > Getting Started with D3 > Communicating Data with Tableau • Text mining / Natural Language Processing > 21 Recipes for Mining Twitter > Natural Language Processing with Python > Natural Language Annotation for Machine Learning 41 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 42. FREE BOOKS CONTINUED • Big Data > Hadoop: The Definitive Guide, 3rd Edition > Ethics of Big Data > Big Data Analytics with R and Hadoop • Data Ingestion > Data Analysis with Open Source Tools > Python for Data Analysis • Data Wrangling and Munging > Using OpenRefine • Toolbox $ > Getting Started with Storm > Fast Data Processing with Spark 42 Copyright © 2013-2014 by Teradata. All rights reserved. @dspadawan
  • 43. QUESTIONS AND ANSWERS 43 Copyright © 2013-2014 by Teradata. All rights reserved. By Tara Laskowski @dspadawan Contact me at datasciencepadawan@gmail.com Follow me at twitter @dspadawan Read my blog http://datasciencepadawan.blogspot.com

Notas del editor

  1. OK, done
  2. OK, done
  3. OK, done
  4. OK, done
  5. OK, done, code: REFcb75
  6. OK, done
  7. OK, done
  8. OK, done
  9. OK, done
  10. OK, done
  11. OK, done
  12. OK, done
  13. K-Mart predictive analysis