SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Los Angeles R Users’ Group




              NumPy and SciPy for Data Mining and Data Analysis
                  Including iPython, SciKits, and matplotlib

                                            Ryan R. Rosario




                                            March 29, 2011




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                           Los Angeles R Users’ Group
What is NumPy?




      NumPy is the standard package for scientific and numerical computing in
      Python:

              powerful N-dimensional array object.
              sophisticated (broadcasting) functions.
              tools for integrating C/C++ and Fortran code (R is similar)
              useful linear algebra, Fourier transform and random number capabilities.

      Additionally...
              data structures can be used as efficient multi-dimensional container of
              arbitrary data types.

Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                                 Los Angeles R Users’ Group
So, What is SciPy?




      Often (incorrectly) used as a synonym of NumPy.
              SciPy depends on NumPy’s data structures.
              OSS for mathematics, science, and engineering.
              provides efficient numerical routines for numerical integration and
              optimization.

      “NumPy provides the data structures, SciPy provides the application layer.”


Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                                 Los Angeles R Users’ Group
Getting NumPy
      The ease of installation is highly dependent on OS and CPU.
          Official releases are recommended and available for
                  1   32 bit MacOS X
                            Caveat: Must use Python Python, not Apple Python.
                  2   32 bit Windows
              Unofficial/third-party binary releases for
                  1   All Linux (use platform package system)
                  2   64 bit MacOS X (SciPy Superpack)
                            Requires Apple Python 2.6.1
                  3   64-bit Windows (Christoph Gohlke, UC Irvine)
              Source Code
              Enthought is to SciPy/Numpy as Revolution is to R.
      Main download link is http://www.scipy.org/Download
Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                               Los Angeles R Users’ Group
Disclaimer



      Time is even more limited than usual, so the goal is to illustrate
      what NumPy is, and what NumPy, SciPy and its colleagues can
      achieve. Some goals:
              Brief NumPy tutorial via demo.
              Uses of NumPy and SciPy, including SciKits.
              Brief data analysis demo with NumPy, SciPy and matplotlib.
              Concluding remarks.




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                      Los Angeles R Users’ Group
NumPy Basics

      The main data structure is a homogeneous multidimensional array
      – a table of elements all of the same type, indexed by a tuple of
      positive integers. Some examples are vectors, matrices, images and
      spreadsheets.


      Why not just a use a list of lists? NumPy provides functions
      that operate on an entire array at once (like R), Python does
      not...really.


      NumPy is full of syntax and time is limited, so let’s jump straight
      into a demo...

Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                      Los Angeles R Users’ Group
NumPy Basics




      Demo




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis   Los Angeles R Users’ Group
Keep in Mind: Broadcasting vs. Recycling



      In NumPy, if two arrays do not have the same length and we try
      to, say, add them, 1 is appended to the smaller array so that the
      dimensions match.


      In R, if two vectors do not have the same length, the values in the
      shorter vector are recycled and a warning is generated.
      Personally, recycling seems more practical.




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                     Los Angeles R Users’ Group
Keep in Mind: Indexing!



      R starts its indices at 1 and ranges are inclusive (i.e. seq(1,3) ->
      1, 2, 3)


      NumPy (and Python) starts its indices at 0 and ranges are
      exclusive (i.e. arange(1,3) -> 1, 2).




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                     Los Angeles R Users’ Group
Keep in Mind: Data Type Matters!




      R does a good job of using the correct data type.


      NumPy requires you to make decisions about data type. Using
      int64 for 1 is typically wasteful.




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                  Los Angeles R Users’ Group
ipython: a replacement CLI I




      IPython is the recommended CLI for NumPy/SciPy. I will use
      IPython in my demo.
          1   an enhanced interactive Python shell.
          2   an architecture for interactive parallel computing.




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                            Los Angeles R Users’ Group
ipython: a replacement CLI II


      Some advantages over the standard CLI:
          1   tab completion for object attributes and filenames, auto
              parentheses and quotes for function calls.
          2   simple object querying
                     object name? prints details about an object.
                     %pdoc (docstring), %pdef (function definition), %psource (full
                     source code).
          3   %run to run a code file, much like source in R. Differs from
              import because the code file is reread each time %run is
              called. Also provides special flags for profiling and debugging.



Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                             Los Angeles R Users’ Group
ipython: a replacement CLI III
          4   %pdb for spawning a pdb debug session.
          5   Output cache saves the results of executing each line of
              code. , ,      save the last three results.
          6   Input cache allows re-execution of previously executed code
              using a snapshot of the original input data in variable In. (to
              rexecute lines 22 through 28, and line 34: exec In[22:29]
              + In[34].
          7   Command history with(out) line numbers using %hist. Can
              also toggle journaling with %logstart.
          8   Execute system commands from within IPython as if you are
              in the terminal by prefixing with !. Use !!, %sc, %sx to get
              output into a Python variable (var = !ls -la). Use Python
              variables when calling the shell.
Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                         Los Angeles R Users’ Group
ipython: a replacement CLI IV

          9   Easy access to the profiler, %prun.
         10   Easy to integrate with matplotlib.
         11   Custom profiles (options, modules to load, function
              definitions). Similar to R workspace.
         12   Run other programs (that use pipes) and extract output
              (gnuplot).
         13   Easily run doctests.
         14   Lightweight version control.
         15   Save/restore session, like in R (the authors specifically said
              “like R”).


Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                          Los Angeles R Users’ Group
matplotlib: SciPy’s Graphics System I




      2D graphics library for Python that produces publication quality
      figures. Quality with respect to R is in the eye of the beholder.
              Graphics in R are much easier to construct, and require less
              code.
              Graphics in matplotlib may look nicer; require much more
              code.


Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                        Los Angeles R Users’ Group
matplotlib: SciPy’s Graphics System II




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis   Los Angeles R Users’ Group
matplotlib: SciPy’s Graphics System III




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis   Los Angeles R Users’ Group
matplotlib: SciPy’s Graphics System IV




      Show matplotlib gallery.


      I will do a demo of matplotlib in the data analysis demo, if time.




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                    Los Angeles R Users’ Group
Quick Linear Regression Demo




      Demo here.




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis   Los Angeles R Users’ Group
SciKits I


      SciKits are modules or “plug-ins” for SciPy; they are too
      specialized to be included in SciPy itself. Below is a list of some
      relevant SciKits.
          1   statsmodels: OLS, GLS, WLS, GLM, M-estimators for
              robust regression.
          2   timeseries: time series analysis.
          3   bootstrap: bootstrap error-estimation.
          4   learn: machine learning and data mining




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                       Los Angeles R Users’ Group
SciKits II



          5   sparse: sparse matrices
          6   cuda: Python interface to GPU powered libraries.
          7   image: image processing
          8   optimization: numerical optimization
      For more information, see
      http://scikits.appspot.com/scikits.




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                         Los Angeles R Users’ Group
Scientific Software Relying on SciPy I

      Many exciting software packages rely on SciPy

                              Python Based Reinforcement Learning, Artificial Intelligence and
                              Neural Network Library. Built around a neural networks kernel.
                              Provides unsupervised learning and evolution.
                              http://www.pybrain.org


                              PyMC Pythonic Markov Chain Monte Carlo. MCMC, Gibbs
                              sampling, diagnostics




                              PyML: SVM, nearest neighbor, ridge regression, multiclass methods,
                              model selection, filtering, CV, ROC.




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                                       Los Angeles R Users’ Group
Scientific Software Relying on SciPy II


                              Monte: gradient based learning machines, logistic regression,
                              conditional random fields.




                              MILK: SVM, k-NN, random forests, decision trees, feature selection,
                              affinity propagation.




                              Modular toolkit for Data Processing: signal processing, PCA, ICA,
                              manifold learning, factor analysis, regression, neural networks,
                              mixture models.



Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                                         Los Angeles R Users’ Group
Scientific Software Relying on SciPy III


                              PyEvolve: Genetic algorithms




                              Theano: Theano is a Python library that allows you to define,
                              optimize, and evaluate mathematical expressions involving
                              multi-dimensional arrays efficiently: GPUs, symbolic expressions,
                              dynamic C code generation.


                              Orange: Data mining, visualization and machine learning. Full GUI
                              and Python scripting.




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                                        Los Angeles R Users’ Group
Scientific Software Relying on SciPy IV




                              SAGE: a hodgepodge of open-source software designed to be one
                              analysis platform.




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                                     Los Angeles R Users’ Group
Pros/Cons I

      What NumPy/SciPy does better:
          1   Much larger community (SciPy + Python).
          2   Better, and cleaner graphics...slightly.
          3   Sits on atop a programming language rather than being one.
          4   Dictionary data type is very useful (Python).
          5   Adheres to a system of thought (“Pythonic”) that is more flexible.
          6   Memory mapped files and extensive sparse matrix packages.
          7   Easier to incorporate into a company/development workflow.
          8   Opinion: Developers don’t have to dive into C as soon as they do with
              R.



Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                                Los Angeles R Users’ Group
Pros/Cons II


      What R does better:
          1   One word data.frame.
          2   Installation is simple and universal.
          3   Application is self-contained (NumPy consists of several moving parts).
          4   Very simple, user friendly syntax.
          5   Graphics and analysis are fairly quick to code.
          6   Much easier data input.
          7   Handles missing values better.




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                                 Los Angeles R Users’ Group
The SciPy community is very useful, prolific and friendly. Project
      has very high morale. Some places to get help:
          1   http://www.numpy.org: extensive documentation, examples
              and cookbook.
          2   http://www.scipy.org: extensive documentation, examples
              and cookbook.
          3   NumPy and SciPy mailing lists are helpful.
          4   StackOverflow
          5   IRC channel #scipy
          6   Frequent sprints and hackathons.
          7   SciPyCon (Austin, TX)
          8   SoCalPiggies and LA Python (first meeting April 1!)


Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                      Los Angeles R Users’ Group
Keep in Touch!




      My email: ryan@stat.ucla.edu

      My blog: http://www.bytemining.com

      Follow me on Twitter: @datajunkie




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis   Los Angeles R Users’ Group
References



          1   NumPy http://www.numpy.org
          2   SciPy http://www.scipy.org
          3   SciKits http://scikits.appspot.com
          4   iPython http://ipython.scipy.org/moin/
          5   matplotlib http://matplotlib.sourceforge.net/




Ryan R. Rosario
NumPy/SciPy for Data Mining and Analysis                Los Angeles R Users’ Group

Más contenido relacionado

La actualidad más candente

Python interview questions
Python interview questionsPython interview questions
Python interview questionsPragati Singh
 
Python libraries for data science
Python libraries for data sciencePython libraries for data science
Python libraries for data sciencenilashri2
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answersRojaPriya
 
Most Asked Python Interview Questions
Most Asked Python Interview QuestionsMost Asked Python Interview Questions
Most Asked Python Interview QuestionsShubham Shrimant
 
Introduction to r bddsil meetup
Introduction to r bddsil meetupIntroduction to r bddsil meetup
Introduction to r bddsil meetupNimrod Priell
 
Python interview questions
Python interview questionsPython interview questions
Python interview questionsPragati Singh
 
Getting started with Linux and Python by Caffe
Getting started with Linux and Python by CaffeGetting started with Linux and Python by Caffe
Getting started with Linux and Python by CaffeLihang Li
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learningtrygub
 
Python interview question for students
Python interview question for studentsPython interview question for students
Python interview question for studentsCorey McCreary
 
Python Interview Questions And Answers
Python Interview Questions And AnswersPython Interview Questions And Answers
Python Interview Questions And AnswersH2Kinfosys
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programminghemasri56
 
summer training report on python
summer training report on pythonsummer training report on python
summer training report on pythonShubham Yadav
 
Python interview questions for experience
Python interview questions for experiencePython interview questions for experience
Python interview questions for experienceMYTHILIKRISHNAN4
 
Python Summer Internship
Python Summer InternshipPython Summer Internship
Python Summer InternshipAtul Kumar
 
R language tutorial
R language tutorialR language tutorial
R language tutorialDavid Chiu
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC techniqueChun Hao Wang
 

La actualidad más candente (20)

Python interview questions
Python interview questionsPython interview questions
Python interview questions
 
Math synonyms
Math synonymsMath synonyms
Math synonyms
 
Python libraries for data science
Python libraries for data sciencePython libraries for data science
Python libraries for data science
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answers
 
Most Asked Python Interview Questions
Most Asked Python Interview QuestionsMost Asked Python Interview Questions
Most Asked Python Interview Questions
 
Introduction to r bddsil meetup
Introduction to r bddsil meetupIntroduction to r bddsil meetup
Introduction to r bddsil meetup
 
Python interview questions
Python interview questionsPython interview questions
Python interview questions
 
Getting started with Linux and Python by Caffe
Getting started with Linux and Python by CaffeGetting started with Linux and Python by Caffe
Getting started with Linux and Python by Caffe
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learning
 
R crash course
R crash courseR crash course
R crash course
 
Python interview question for students
Python interview question for studentsPython interview question for students
Python interview question for students
 
Python Interview Questions And Answers
Python Interview Questions And AnswersPython Interview Questions And Answers
Python Interview Questions And Answers
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga
 
summer training report on python
summer training report on pythonsummer training report on python
summer training report on python
 
Basic lisp
Basic lispBasic lisp
Basic lisp
 
Python interview questions for experience
Python interview questions for experiencePython interview questions for experience
Python interview questions for experience
 
Python Summer Internship
Python Summer InternshipPython Summer Internship
Python Summer Internship
 
R language tutorial
R language tutorialR language tutorial
R language tutorial
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC technique
 

Destacado

Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining AreaMahamudHasanCSE
 
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataPier Luca Lanzi
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an IntroductionAli Abbasi
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data MiningAmritanshu Mehra
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data miningSnehali Chake
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsMotaz Saad
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDataminingTools Inc
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data MiningSushil Kulkarni
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 

Destacado (19)

Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
 
Lecture 01 Data Mining
Lecture 01 Data MiningLecture 01 Data Mining
Lecture 01 Data Mining
 
Data mining and_big_data_web
Data mining and_big_data_webData mining and_big_data_web
Data mining and_big_data_web
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web Data
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data mining
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence Tools
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data Mining
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data mining
Data miningData mining
Data mining
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 

Similar a NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits, and matplotlib

Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysisPramod Toraskar
 
Turbocharge your data science with python and r
Turbocharge your data science with python and rTurbocharge your data science with python and r
Turbocharge your data science with python and rKelli-Jean Chun
 
Intellectual technologies
Intellectual technologiesIntellectual technologies
Intellectual technologiesPolad Saruxanov
 
The Joy of SciPy
The Joy of SciPyThe Joy of SciPy
The Joy of SciPykammeyer
 
Introduction_to_Python.pptx
Introduction_to_Python.pptxIntroduction_to_Python.pptx
Introduction_to_Python.pptxVinay Chowdary
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationTravis Oliphant
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document usefulssuser3c3f88
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchainJie-Han Chen
 
Python standard library & list of important libraries
Python standard library & list of important librariesPython standard library & list of important libraries
Python standard library & list of important librariesgrinu
 
Introduction to python along with the comparitive analysis with r
Introduction to python   along with the comparitive analysis with r Introduction to python   along with the comparitive analysis with r
Introduction to python along with the comparitive analysis with r Ashwini Mathur
 
Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Rodrigo Urubatan
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
APPLICATION OF PYTHON IN GEOSCIENCE
APPLICATION OF  PYTHON IN GEOSCIENCEAPPLICATION OF  PYTHON IN GEOSCIENCE
APPLICATION OF PYTHON IN GEOSCIENCEAhasanHabibSajeeb
 
Five python libraries should know for machine learning
Five python libraries should know for machine learningFive python libraries should know for machine learning
Five python libraries should know for machine learningNaveen Davis
 
Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Fwdays
 

Similar a NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits, and matplotlib (20)

Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Turbocharge your data science with python and r
Turbocharge your data science with python and rTurbocharge your data science with python and r
Turbocharge your data science with python and r
 
Scientific Python
Scientific PythonScientific Python
Scientific Python
 
Numba
NumbaNumba
Numba
 
Intellectual technologies
Intellectual technologiesIntellectual technologies
Intellectual technologies
 
The Joy of SciPy
The Joy of SciPyThe Joy of SciPy
The Joy of SciPy
 
Introduction_to_Python.pptx
Introduction_to_Python.pptxIntroduction_to_Python.pptx
Introduction_to_Python.pptx
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
 
DS LAB MANUAL.pdf
DS LAB MANUAL.pdfDS LAB MANUAL.pdf
DS LAB MANUAL.pdf
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
Python standard library & list of important libraries
Python standard library & list of important librariesPython standard library & list of important libraries
Python standard library & list of important libraries
 
Introduction to python along with the comparitive analysis with r
Introduction to python   along with the comparitive analysis with r Introduction to python   along with the comparitive analysis with r
Introduction to python along with the comparitive analysis with r
 
Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?
 
Session 2
Session 2Session 2
Session 2
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Pythonfinalppt 170822121204
Pythonfinalppt 170822121204Pythonfinalppt 170822121204
Pythonfinalppt 170822121204
 
APPLICATION OF PYTHON IN GEOSCIENCE
APPLICATION OF  PYTHON IN GEOSCIENCEAPPLICATION OF  PYTHON IN GEOSCIENCE
APPLICATION OF PYTHON IN GEOSCIENCE
 
Five python libraries should know for machine learning
Five python libraries should know for machine learningFive python libraries should know for machine learning
Five python libraries should know for machine learning
 
Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"
 

Último

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits, and matplotlib

  • 1. Los Angeles R Users’ Group NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits, and matplotlib Ryan R. Rosario March 29, 2011 Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 2. What is NumPy? NumPy is the standard package for scientific and numerical computing in Python: powerful N-dimensional array object. sophisticated (broadcasting) functions. tools for integrating C/C++ and Fortran code (R is similar) useful linear algebra, Fourier transform and random number capabilities. Additionally... data structures can be used as efficient multi-dimensional container of arbitrary data types. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 3. So, What is SciPy? Often (incorrectly) used as a synonym of NumPy. SciPy depends on NumPy’s data structures. OSS for mathematics, science, and engineering. provides efficient numerical routines for numerical integration and optimization. “NumPy provides the data structures, SciPy provides the application layer.” Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 4. Getting NumPy The ease of installation is highly dependent on OS and CPU. Official releases are recommended and available for 1 32 bit MacOS X Caveat: Must use Python Python, not Apple Python. 2 32 bit Windows Unofficial/third-party binary releases for 1 All Linux (use platform package system) 2 64 bit MacOS X (SciPy Superpack) Requires Apple Python 2.6.1 3 64-bit Windows (Christoph Gohlke, UC Irvine) Source Code Enthought is to SciPy/Numpy as Revolution is to R. Main download link is http://www.scipy.org/Download Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 5. Disclaimer Time is even more limited than usual, so the goal is to illustrate what NumPy is, and what NumPy, SciPy and its colleagues can achieve. Some goals: Brief NumPy tutorial via demo. Uses of NumPy and SciPy, including SciKits. Brief data analysis demo with NumPy, SciPy and matplotlib. Concluding remarks. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 6. NumPy Basics The main data structure is a homogeneous multidimensional array – a table of elements all of the same type, indexed by a tuple of positive integers. Some examples are vectors, matrices, images and spreadsheets. Why not just a use a list of lists? NumPy provides functions that operate on an entire array at once (like R), Python does not...really. NumPy is full of syntax and time is limited, so let’s jump straight into a demo... Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 7. NumPy Basics Demo Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 8. Keep in Mind: Broadcasting vs. Recycling In NumPy, if two arrays do not have the same length and we try to, say, add them, 1 is appended to the smaller array so that the dimensions match. In R, if two vectors do not have the same length, the values in the shorter vector are recycled and a warning is generated. Personally, recycling seems more practical. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 9. Keep in Mind: Indexing! R starts its indices at 1 and ranges are inclusive (i.e. seq(1,3) -> 1, 2, 3) NumPy (and Python) starts its indices at 0 and ranges are exclusive (i.e. arange(1,3) -> 1, 2). Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 10. Keep in Mind: Data Type Matters! R does a good job of using the correct data type. NumPy requires you to make decisions about data type. Using int64 for 1 is typically wasteful. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 11. ipython: a replacement CLI I IPython is the recommended CLI for NumPy/SciPy. I will use IPython in my demo. 1 an enhanced interactive Python shell. 2 an architecture for interactive parallel computing. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 12. ipython: a replacement CLI II Some advantages over the standard CLI: 1 tab completion for object attributes and filenames, auto parentheses and quotes for function calls. 2 simple object querying object name? prints details about an object. %pdoc (docstring), %pdef (function definition), %psource (full source code). 3 %run to run a code file, much like source in R. Differs from import because the code file is reread each time %run is called. Also provides special flags for profiling and debugging. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 13. ipython: a replacement CLI III 4 %pdb for spawning a pdb debug session. 5 Output cache saves the results of executing each line of code. , , save the last three results. 6 Input cache allows re-execution of previously executed code using a snapshot of the original input data in variable In. (to rexecute lines 22 through 28, and line 34: exec In[22:29] + In[34]. 7 Command history with(out) line numbers using %hist. Can also toggle journaling with %logstart. 8 Execute system commands from within IPython as if you are in the terminal by prefixing with !. Use !!, %sc, %sx to get output into a Python variable (var = !ls -la). Use Python variables when calling the shell. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 14. ipython: a replacement CLI IV 9 Easy access to the profiler, %prun. 10 Easy to integrate with matplotlib. 11 Custom profiles (options, modules to load, function definitions). Similar to R workspace. 12 Run other programs (that use pipes) and extract output (gnuplot). 13 Easily run doctests. 14 Lightweight version control. 15 Save/restore session, like in R (the authors specifically said “like R”). Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 15. matplotlib: SciPy’s Graphics System I 2D graphics library for Python that produces publication quality figures. Quality with respect to R is in the eye of the beholder. Graphics in R are much easier to construct, and require less code. Graphics in matplotlib may look nicer; require much more code. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 16. matplotlib: SciPy’s Graphics System II Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 17. matplotlib: SciPy’s Graphics System III Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 18. matplotlib: SciPy’s Graphics System IV Show matplotlib gallery. I will do a demo of matplotlib in the data analysis demo, if time. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 19. Quick Linear Regression Demo Demo here. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 20. SciKits I SciKits are modules or “plug-ins” for SciPy; they are too specialized to be included in SciPy itself. Below is a list of some relevant SciKits. 1 statsmodels: OLS, GLS, WLS, GLM, M-estimators for robust regression. 2 timeseries: time series analysis. 3 bootstrap: bootstrap error-estimation. 4 learn: machine learning and data mining Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 21. SciKits II 5 sparse: sparse matrices 6 cuda: Python interface to GPU powered libraries. 7 image: image processing 8 optimization: numerical optimization For more information, see http://scikits.appspot.com/scikits. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 22. Scientific Software Relying on SciPy I Many exciting software packages rely on SciPy Python Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. Built around a neural networks kernel. Provides unsupervised learning and evolution. http://www.pybrain.org PyMC Pythonic Markov Chain Monte Carlo. MCMC, Gibbs sampling, diagnostics PyML: SVM, nearest neighbor, ridge regression, multiclass methods, model selection, filtering, CV, ROC. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 23. Scientific Software Relying on SciPy II Monte: gradient based learning machines, logistic regression, conditional random fields. MILK: SVM, k-NN, random forests, decision trees, feature selection, affinity propagation. Modular toolkit for Data Processing: signal processing, PCA, ICA, manifold learning, factor analysis, regression, neural networks, mixture models. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 24. Scientific Software Relying on SciPy III PyEvolve: Genetic algorithms Theano: Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently: GPUs, symbolic expressions, dynamic C code generation. Orange: Data mining, visualization and machine learning. Full GUI and Python scripting. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 25. Scientific Software Relying on SciPy IV SAGE: a hodgepodge of open-source software designed to be one analysis platform. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 26. Pros/Cons I What NumPy/SciPy does better: 1 Much larger community (SciPy + Python). 2 Better, and cleaner graphics...slightly. 3 Sits on atop a programming language rather than being one. 4 Dictionary data type is very useful (Python). 5 Adheres to a system of thought (“Pythonic”) that is more flexible. 6 Memory mapped files and extensive sparse matrix packages. 7 Easier to incorporate into a company/development workflow. 8 Opinion: Developers don’t have to dive into C as soon as they do with R. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 27. Pros/Cons II What R does better: 1 One word data.frame. 2 Installation is simple and universal. 3 Application is self-contained (NumPy consists of several moving parts). 4 Very simple, user friendly syntax. 5 Graphics and analysis are fairly quick to code. 6 Much easier data input. 7 Handles missing values better. Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 28. The SciPy community is very useful, prolific and friendly. Project has very high morale. Some places to get help: 1 http://www.numpy.org: extensive documentation, examples and cookbook. 2 http://www.scipy.org: extensive documentation, examples and cookbook. 3 NumPy and SciPy mailing lists are helpful. 4 StackOverflow 5 IRC channel #scipy 6 Frequent sprints and hackathons. 7 SciPyCon (Austin, TX) 8 SoCalPiggies and LA Python (first meeting April 1!) Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 29. Keep in Touch! My email: ryan@stat.ucla.edu My blog: http://www.bytemining.com Follow me on Twitter: @datajunkie Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group
  • 30. References 1 NumPy http://www.numpy.org 2 SciPy http://www.scipy.org 3 SciKits http://scikits.appspot.com 4 iPython http://ipython.scipy.org/moin/ 5 matplotlib http://matplotlib.sourceforge.net/ Ryan R. Rosario NumPy/SciPy for Data Mining and Analysis Los Angeles R Users’ Group