Big data analysis requires fast prototyping on data mining process to gain insight into data. In this slides, the author introduces how to use IPython Notebook to sketch code pieces for data mining stages and make fast observations easily.
Continuous Delivery: automated testing, continuous integration and continuous...
Fast data mining flow prototyping using IPython Notebook
1. Fast data mining flow prototyping
using IPython Notebook
2013/01/31
Jimmy Lai
r97922028 [at] ntu.edu.tw
2. Outline
1. Workflow for data mining
2. What IPython Notebook provides
3. Exemplified by text classification
4. Demo code and Notebook usage
IPython Notebook 2
3. Workflow for data mining
• Traditional programming workflow:
– Edit -> Compile -> Run
• Data Mining workflow:
– Execute -> Explore
– Consists of many data processing stages and we
may do trials in each stage with different methods.
– Stages: data parsing, feature extraction, feature
selection, model training, model predicting, post
processing, etc.
IPython Notebook 3
4. What IPython Notebook provides
• Interactive Web IDE
– Display rich data like plots by matplotlib, math
symbols by latex
– Code cell for sketching
– Execute piece of code in arbitrarily order
– Browser interface for programming remotely
– Easy to demonstrate code and execution result in html
or PDF.
• IPython Notebook makes sketching data analysis
easily.
IPython Notebook 4
5. Demo code and Notebook usage
• Demo Code: ipython_demo directory in
https://bitbucket.org/noahsark/slideshare
• Ipython Notebook:
– Install
$ pip install ipython
– Execution (under ipython_demo dir)
$ ipython notebook --pylab=inline
– Open notebook with browser, e.g.
http://127.0.0.1:8888
IPython Notebook 5
7. Exemplified by text classification
• Text classification on newsgroup dataset.
• Dataset:
– Build in sklearn.datasets
– Each article belongs to one of the 20 groups
• Goal: classify article to one of the newsgroup
name.
• Experiment: feature generation using different
ngram parameters.
IPython Notebook 7