Más contenido relacionado
Similar a Big data and data science study (20)
Big data and data science study
- 1. Background © Jim Kaskade: Big Data
BIG DATA AND DATA SCIENCE
study materials and online courses by @dspadawan
- 2. WHAT IS DATA SCIENCE
2 Copyright © 2013-2014 by Teradata. All rights reserved.
THE DATA SCIENCE VENN DIAGRAM
@dspadawan
- 3. DATA SCIENCE DOMAINS
All links go to Wiki.
If you are not sure
what something
means you can learn.
1. Data Science (Fundamentals)
2. Statistics
3. Programming languages
4. Machine Learning / Data Mining
5. Text Mining / Natural Language Processing
6. Data Visualization
7. Big Data (Hadoop, MapReduce, NoSQL)
8. Data Ingestion
9. Data Munging or Data Wrangling
10. Toolbox (Weka, …, Spark, Storm, …, Sqoop, RHIPE, etc.)
3 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 4. DATA SCIENCE METRO MAP
4 Copyright © 2013-2014 by Teradata. All rights reserved.
BECOMING A DATA SCIENTIST
- 5. MASSIVE OPEN ONLINE COURSES (MOOC)
• Aggregator
> http://www.mooc-list.com
• Platforms
> https://www.coursera.org
> https://www.edx.org
> https://www.open2study.com
> https://www.udacity.com
> https://www.udemy.com
> http://online.stanford.edu
• Interactive platforms
> http://www.codecademy.com
> https://www.datacamp.com
5 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 6. WANT TO WORK AS DATA SCIENTIST?
6 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 7. DATA SCIENCE & ANALYTICS
• Coursera
> Core Concepts in Data Analysis
https://www.coursera.org/course/datan
> Introduction to Data Science:
https://www.coursera.org/course/datasci
> Data Science Specialization:
https://www.coursera.org/specialization/jhudatascience/1
– 9 courses + 1 capstone project
– Each course or capstone takes 4 weeks
– You can do it for free or you can pay 49 USD for certification
> Welcome To Process Mining: Data science in Action!
https://www.coursera.org/course/procmin
7 Copyright © 2013-2014 by Teradata. All rights reserved.
1
@dspadawan
- 8. DATA SCIENCE & ANALYTICS 1
• Edx
> The Analytics Edge
http://www.edx.org/course/mitx/mitx-15-071x-analytics-edge-
1416
> Data, Analytics and Learning
http://www.edx.org/course/utarlingtonx/utarlingtonx-link5-10x-data-
analytics-2186
• Udacity
$
> Intro to Data Science
https://www.udacity.com/course/ud359
8 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 9. MATH DANCE
9 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 10. STATISTICS COURSES
• Coursera
> Data analysis and statistical inference:
https://www.coursera.org/course/statistics
> Statistical inference and exploratory data analysis:
https://www.coursera.org/specialization/jhudatascience/1/courses
• EdX
> Introduction to Statistics: Descriptive Statistics
http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-1x-introduction-
1138
> Introduction to Statistics: Probability
http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-2x-introduction-
1534
> Introduction to Statistics: Inference
http://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-stat2-3x-introduction-
1533
10 Copyright © 2013-2014 by Teradata. All rights reserved.
2
@dspadawan
- 11. STATISTICS COURSES CONT. 2
• Udacity
$
> Intro to statistics:
https://www.udacity.com/course/st101
> Exploratory data analysis:
https://www.udacity.com/course/ud651
> Intro to Inferential Statistics
https://www.udacity.com/course/ud201
• Mathematical monk
> https://www.youtube.com/playlist?list=PL17567A1A3F5DB5E4
11 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 12. PROGRAMMING LANGUAGES
• Analysis/Data mining:
> R language
> Python
> SQL
> (Perl)
> (Octave)
• Big Data (Hadoop)
> Java (!)
> Python
• Visualization
> JavaScript
12 Copyright © 2013-2014 by Teradata. All rights reserved.
3
@dspadawan
- 13. R LANGUAGE
• Basic info and SW
> R Language:
http://www.r-project.org
> R Studio (IDE):
http://www.rstudio.com
• Courses
> R Programming:
https://www.coursera.org/course/rprog
• Practice
> Interactive courses:
https://www.datacamp.com/courses
> Data mining examples in R:
http://www.rdatamining.com
13 Copyright © 2013-2014 by Teradata. All rights reserved.
3
@dspadawan
- 14. PYTHON
• Basic info and SW:
> Python language:
https://www.python.org
> Eclipse Python:
http://pydev.org
• Python for Java developers:
> http://www.sthurlow.com/python
• Google's Python Class
> https://developers.google.com/edu/python
• Code Academy Python
> http://www.codecademy.com/tracks/python
14 Copyright © 2013-2014 by Teradata. All rights reserved.
3
@dspadawan
- 15. OCTAVE
• Basic info and SW:
> http://octave.sourceforge.net
> https://gnu.org/software/octave
> http://en.wikipedia.org/wiki/GNU_Octave
• Coursera:
> Machine learning: https://www.coursera.org/course/ml
15 Copyright © 2013-2014 by Teradata. All rights reserved.
3
Octave is mostly
compatible with
MatLab.
@dspadawan
- 16. MACHINE LEARNING COURSES
Subfield of computer
science and artificial
intelligence about
learn from data.
• Coursera
> Machine Learning (Stanford):
https://www.coursera.org/course/ml
> Machine Learning: (University of Washington)
https://www.coursera.org/course/machlearning
> Practical Machine Learning (Johns Hopkins):
https://www.coursera.org/course/predmachlearn
– part of Data Science Specialization
• Udacity
> Machine Learning (Supervised, Reinforcement, Unsupervised)
https://www.udacity.com/course/ud675
https://www.udacity.com/course/ud820
https://www.udacity.com/course/ud741
16 Copyright © 2013-2014 by Teradata. All rights reserved.
4A
$
@dspadawan
- 17. MACHINE LEARNING VIDEOS
• Udemy
> Hilary Mason: An Intro to Machine Learning with Web Data
https://www.udemy.com/hilary-mason-an-intro-to-machine-learning-
with-web-data
> Hilary Mason: Advanced Machine Learning
https://www.udemy.com/hilary-mason-advanced-machine-learning/
• Mathematical monk
> https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
• Videolectures.net
> http://blog.videolectures.net/100-most-popular-machine-learning-
talks-at-videolectures-net/
17 Copyright © 2013-2014 by Teradata. All rights reserved.
4A
$
@dspadawan
- 18. DATA MINING COURSES
Process of discovery
patterns in large data
sets via machine
learning or statistics.
• Coursera
> Mining Massive Datasets
(Stanford)
https://www.coursera.org/course/mmds
• Udemy
> Matthew Russell on Mining the Social Web
https://www.udemy.com/matthew-russell-on-mining-the-social-web/
> Data Mining
https://www.udemy.com/data-mining
• Web page
> http://www.rdatamining.com
18 Copyright © 2013-2014 by Teradata. All rights reserved.
4B
$
@dspadawan
- 19. DATA MINING COURSES & TOOLS
• Courses:
> Data Mining with Weka:
https://weka.waikato.ac.nz/dataminingwithweka/preview
> More Data Mining with Weka:
https://weka.waikato.ac.nz/moredataminingwithweka
• Weka
> SW: http://www.cs.waikato.ac.nz/ml/weka
• Knime
> SW: https://www.knime.org/downloads/overview
• RapidMiner
> Official site: http://rapidminer.com
> SW: http://sourceforge.net/projects/rapidminer
19 Copyright © 2013-2014 by Teradata. All rights reserved.
4B
@dspadawan
- 20. TEXT MINING 5A
• R Data Mining (Word Cloud)
TOP RECURRING THEMES ABOUT BIG DATA
> http://www.rdatamining.com/examples/text-mining
• Videolectures.net
> http://videolectures.net/Top/Computer_Science/Text_Mining
• Tool (Word Cloud)
> Wordle.net
20 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 21. NATURAL LANGUAGE PROCESSING COURSES
• Coursera
> Natural Language Processing
Subfield of computer
science and artificial
intelligence and
linguistics.
(Columbia University):
https://www.coursera.org/course/nlangp
> Natural Language Processing (Stanford):
https://www.coursera.org/course/nlp
• Deeper Learning MOOC
> http://dlmooc.deeper-learning.org/
• Wikipedia
> http://en.wikipedia.org/wiki/Natural_language_processing
21 Copyright © 2013-2014 by Teradata. All rights reserved.
5B
@dspadawan
- 22. VISUALIZATION TOOLS 6
• Tableau
> http://www.tableausoftware.com
> Commercial visualization software
• D3.js
> http://d3js.org
> Data Driven document visualization library
• GraphViz
> http://www.graphviz.org
> Graph visualization tools
• Gephi
> https://gephi.github.io
> Visualization platform
22 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 23. TABLEAU 6
• Trainings
> http://www.tableausoftware.com/learn/training
> On demand
> Live Online planned for specific topic
• Download
> Tableau Public: http://www.tableausoftware.com/public
> Tableau Trial: http://www.tableausoftware.com/products/trial
• Certification
> Desktop (Qualified associate, Certified Professional)
> Server (Qualified associate, Certified Professional)
> http://www.tableausoftware.com/support/certification
23 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 24. HOW BIG, IS BIG ENOUGH?
24 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 25. BIG DATA STUDY 7
• MOOC
> http://bigdatauniversity.com
> http://bigdatacourse.appspot.com
• Coursera
> Web Intelligence and Big Data
https://www.coursera.org/course/bigdata
• Udemy
$
> Big Data and Hadoop Essentials
https://www.udemy.com/big-data-and-hadoop-essentials-free-tutorial
• Open2Study
> Big Data for Better Performance
http://www.open2study.com/courses/big-data-for-better-performance
25 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 26. BIG DATA TOOLS
• Hadoop – Big Data Framework
• Hive – DWH infrastructure build on top of Hadoop
• HBase – Non-relational, distributed DB
• Pig – Hadoop programming tool
• Storm – Real time computation system for Hadoop
• Solr – Search platform
• Falcon – Data management and processing for Hadoop
• Sqoop – CMD application for transfer data into Hadoop
• Flume – Large scale log aggregation framework
• Oozie – Workflow scheduler for Hadoop
• Ambari – Simpler management for Hadoop clusters
• Mahout – Machine Learning algorithms implemented on Hadoop
• ZooKeeper – Coordination service for distributed applications
• Knox - REST API Gateway for interacting with Hadoop clusters
26 Copyright © 2013-2014 by Teradata. All rights reserved.
7
@dspadawan
- 27. HADOOP STUDY
• Hadoop providers
> http://www.cloudera.com
> http://hortonworks.com
> http://www.mapr.com
> http://www.teradata.com/aster
• Udacity
> Intro to Hadoop and MapReduce
https://www.udacity.com/course/ud617
• Udemy
> Become a Certified Hadoop Developer | Training | Tutorial
https://www.udemy.com/hadoop-tutorial
27 Copyright © 2013-2014 by Teradata. All rights reserved.
7
There is more
Hadoop providers:
IBM, Pivotal, etc.
$
$
@dspadawan
- 28. NOT ONLY SQL DATABASES
• MongoDB – JSON document store
> http://www.mongodb.com
> https://university.mongodb.com
• CouchDB – JSON document store
> http://couchdb.apache.org
• CasandraDB – High performance column oriented DB
> http://cassandra.apache.org
• VoltDB – In-memory database
> http://voltdb.com
• Redis – High performance column oriented DB
> http://redis.io
• NuoDB – Distributed SQL DB
> http://www.nuodb.com
28 Copyright © 2013-2014 by Teradata. All rights reserved.
7
@dspadawan
- 29. BIG DATA UNIVERSITY 7
• Big Data Courses path:
> Big Data Fundamentals
> Hadoop Fundamentals
> Moving Data into Hadoop (Sqoop and Flume tools)
> Query languages for Hadoop (Hive, Pig and Jaql)
> SQL Access for Hadoop
> Using HBase for Real-time Access to your Big Data
> Accessing Hadoop Data Using Hive
> Introduction to Pig
> Controlling Hadoop Jobs using Oozie
> Hadoop Reporting and Analysis
> Introduction to MapReduce Programming
• Courses are provided by IBM
29 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 30. IT IS EVEN BETTER, DON’T YOU THINK?
30 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 31. CLOUDERA HADOOP 7
• Tutorials
> 8 different paths
> On demand and free
> Lectured together with Udacity (paid on monthly basis)
> http://cloudera.com/content/cloudera/en/training/courses.html
> http://cloudera.com/content/cloudera/en/training/library.html
• Sandbox
> http://cloudera.com/content/support/en/downloads/quickstart_v
ms/cdh-5-1-x1.html
• Certification
> 200 USD per exam
> http://cloudera.com/content/cloudera/en/training/certification.ht
ml
31 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 32. HORTONWORKS HADOOP 7
• Tutorials
> http://hortonworks.com/tutorials
> 3 paths for
– Developers
– Administrators
– Data Scientists
• Sandbox
> http://hortonworks.com/hdp/downloads
• Certifications
> 200 USD per exam
> http://hortonworks.com/training/certification
32 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 33. MAPR HADOOP 7
• Tutorials
> https://www.mapr.com/services/mapr-academy/training-videos
> 3 paths for
– Developers
– Administrators
– Business users
• Sandbox
> https://www.mapr.com/products/mapr-sandbox-hadoop
• Certification
> For administrator only
> You must pass Hadoop Cluster Administration on MapR course
> https://www.mapr.com/services/mapr-academy/certification
33 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 34. STREAMING – NO BIG DEAL
34 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 35. STREAMING DATA PROCESSING
• Storm (https://storm.incubator.apache.org)
• Open source (ASF) real-time Hadoop
• Twitter project
• Spark (https://spark.apache.org)
• Open source (ASF) in-memory Hadoop
• Apache project
• S4 (http://incubator.apache.org/s4)
• Open source (ASF) processing of stream data
• Yahoo project
• Samza (http://samza.incubator.apache.org)
• Open source processing messagining data
• LinkedIn project
35 Copyright © 2013-2014 by Teradata. All rights reserved.
7
@dspadawan
- 36. DATA INGESTION 8
• Techniques
Process of obtaining,
importing and
processing data for
later use or storage.
> Data import and export
> Data fusion – integration multiple data
> Data sampling – selection of data subset (rows)
> Data discovery – detection patterns in data
> Exploratory data analysis – summarize main data characteristics
> Feature extraction – selection of data subset (columns)
> Data scrubbing – data error correction
> Missing data values – data correction
> Etc.
36 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 37. DATA WRANGLING / DATA MUNGING 9
• Coursera
> Getting and Cleaning Data
Converting or
mapping data from
one "raw" form into
another format.
part of Data Science Specialization
https://www.coursera.org/course/getdata
• Udacity
$
> Data Wrangling with MongoDB
https://www.udacity.com/course/ud032
• School of Data
> Many different courses http://schoolofdata.org
• Tools
> OpenRefine, DataWrangler – clean up and transform tools
> Talend, Pentaho – integration
37 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 38. TOOLBOX 10
• Hadoop and realtime
> Apache Scibe
• Machine Learning
> H2O – In memory machine learning
• Data Mining
> Rattle – GUI for DM using R
• Python and NLP
> NLTK = Natural Language ToolKit for Python
• R and Hadoop
> RHIPE = R + Hadoop Integrated Programming Environment
• Visualization
> Many Eyes – Online visualization system from IBM
38 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 39. ONLINE SOURCES
• Data Science Servers:
> http://www.datasciencecentral.com
> http://www.hadoop360.com
> http://www.datascienceweekly.org
• Aggregators
> https://trello.com/b/rbpEfMld/data-science
• Blogs
• http://datasciencemasters.org
• http://www.kdnuggets.com
• http://www.zipfianacademy.com/blog/post/46864003608/a-practical-intro-
to-data-science
• http://datascience101.wordpress.com
• http://fivethirtyeight.blogs.nytimes.com
39 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 40. FREE BOOKS
• Data Science
> Doing Data Science
> Agile Data Science
> Data Science for Business
• Statistics
> Think Stats
• Programming
> R language
– 25 Recipes for Getting Started with R
– Learning R
> Python
– Learning Python, 5th Edition
– Think Python
40 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 41. FREE BOOKS CONTINUED
• Machine Learning / Data Mining
> Machine Learning for Hackers
> Mining the Social Web
• Visualization
> Visualizing Data
> Getting Started with D3
> Communicating Data with Tableau
• Text mining / Natural Language Processing
> 21 Recipes for Mining Twitter
> Natural Language Processing with Python
> Natural Language Annotation for Machine Learning
41 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 42. FREE BOOKS CONTINUED
• Big Data
> Hadoop: The Definitive Guide, 3rd Edition
> Ethics of Big Data
> Big Data Analytics with R and Hadoop
• Data Ingestion
> Data Analysis with Open Source Tools
> Python for Data Analysis
• Data Wrangling and Munging
> Using OpenRefine
• Toolbox
$
> Getting Started with Storm
> Fast Data Processing with Spark
42 Copyright © 2013-2014 by Teradata. All rights reserved.
@dspadawan
- 43. QUESTIONS AND ANSWERS
43 Copyright © 2013-2014 by Teradata. All rights reserved.
By Tara Laskowski
@dspadawan
Contact me at datasciencepadawan@gmail.com
Follow me at twitter @dspadawan
Read my blog http://datasciencepadawan.blogspot.com
Notas del editor
- OK, done
- OK, done
- OK, done
- OK, done
- OK, done, code: REFcb75
- OK, done
- OK, done
- OK, done
- OK, done
- OK, done
- OK, done
- OK, done
- K-Mart predictive analysis