Artificial Intelligence for Automating Data Analysis

Artificial Intelligence
for Automating Data Analysis
Manuel Martín Salvador
Smart Technology Research Centre

27th November 2013

Outline
1. Data and KDD Process
2. Support for Analysts
3. Prior Knowledge
4. Types of IDAs
5. Future Directions
6. References
Presentation based on the paper by Serban et al. “A survey of intelligent assistants for data analysis” 2013
http://dx.doi.org/10.1145/2480741.2480748

Data

Many domains: biology, geography,
telecommunications, sales, process industry...
Structured and non-structured
Single source and multiple sources
Imperfect data: missing values, outliers...

KDD process
0. Goal?

Raw Data
1. Selection

Target Data

KDD process
0. Goal?

Raw Data
1. Selection

Target Data
2. Preprocessing

Preprocessed Data

KDD process
0. Goal?

Raw Data
1. Selection

Target Data
2. Preprocessing

Preprocessed Data
3. Transformation

Transformed Data

KDD process
0. Goal?

Raw Data
1. Selection

Target Data
2. Preprocessing

Preprocessed Data
3. Transformation

Transformed Data
4. Data Mining

Patterns

KDD process
0. Goal?

Raw Data
1. Selection

Target Data
2. Preprocessing

Preprocessed Data
3. Transformation

Transformed Data
4. Data Mining

Patterns
5. Interpretation /
Evaluation

Knowledge

KDD process
0. Goal?

Raw Data
1. Selection

Target Data

Refining

2. Preprocessing

Preprocessed Data
3. Transformation

Transformed Data
4. Data Mining

Patterns
5. Interpretation /
Evaluation

Knowledge

Starting a KDD process
Problems: Lack of guidance
Increasing number of techniques
Large volumes of data

Novice Analysts
Overwhelmed
Trial and error

Advanced Analysts
Comfort area
No further exploration

Supporting analysts
Single step of KDD process: Hints and advice for data selection;
support in choosing a suitable algorithm and parameters.
Multiple steps of KDD process: Help regarding the sequence of
operators and their parameters.
Graphical Design of KDD workflows: GUIs for interactively building
the process manually.
Automatic KDD workflow generation: Based on the data and
description of their task, the users receive a set of possible scenarios
for solving a problem.
Explanations: The rationale behind a decision or a result allows the
user to reason about the aid provided.

Prior knowledge

Meta-data of the input dataset:
Data properties such as number of attributes, amount of
missing values, or information-theoretic measures.
Meta-data of operators:
External (inputs, outputs, preconditions and effects) and
Internal (structure and performance).
Case base: Set of successful prior data analysis workflows.

Types of IDAs
Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process.

1. Expert Systems: Apply rules defined by human experts to suggest useful techniques.

Q&A

User

Expert System

Ranking of
useful
techniques

Rules

Experts

Types of IDAs

REX [Gale 1986]: linear regression.
SPRINGEX [Raes 1992]: multivariate and non-parametric statistics.
Statistical Navigator [Raes 1992]: multivariate casual analysis and
classification.
KENS [Hand 1987], NONPAREIL [Hand 1990] and LMG [Hand 1990]: manual
exploration of rules.
Consultant-2 [Craw et al. 1992]: first IDA for machine learning algorithms.

Types of IDAs

Training

2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs.
Evaluations of
algorithms

Prediction

Meta-data of
datasets

Meta-database

Meta-learner

Model

New dataset
User preferences

Meta-Learning System

Advise/Ranking
of algorithms

Types of IDAs

StatLog [Michie et al. 1994]: A decision tree model is built for each algorithm
predicting whether or not it is applicable on a new dataset.
The Data Mining Advisor [Giraud-Carrier 2005]: A k-NN algorithm is trained to
predict algorithm performance on a new dataset.
NOEMON [Kalousis et al. 2001]: Pairwise models are built and stored in a
knowledge base. Scores based on wins/ties/losses are obtained for each
algorithm in order to create a ranking.

Types of IDAs

3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in
similar cases.
Operators

Experts

Case base

Case-based
reasoner

Workflow editor

User

Workflow

Meta-data

Types of IDAs

similar cases.
CITRUS [Engels 1996]: A case base of operators and workflows was created by
experts. Most similar case is returned based on user needs and data statistics.
MiningMart [Morik et al. 2004]: A case base of workflows in a XML-based
language is available online. Cases are described in an ontology. It offers a
three-tier graphical editor: case, concept and relation editors.
The Hybrid Data Mining Assistant [Charest et al. 2008]: Combines CBR with the
experts rules of expert systems. Apart from meta-features, the case base
includes user satisfaction ratings which are used for case ranking.

Types of IDAs

similar cases.
4. Planning-Based Data Analysis Systems: Use AI planners to generate and rank valid
data analysis workflows.
Experts

Ontology

Dataset
User

Planner

Plans

Ranker

Ranking
of plans

Types of IDAs

similar cases.
AIDE [Amant et al. 1998]: Multi-level planning based on hierarchical task
network planning. A plan library contains subproblems and primitive operators.
IDEA [Bernstein et al. 2005]: Meta-data is encoded in an ontology. Valid plans
are ranked by user preferences.
NExT [Bernstein et al. 2007]: CBR-extension of IDEA approach. Firstly, it
retrieves the most suitable cases and then uses the planner for filling gaps.
1/2

Types of IDAs

similar cases.
KDDVM [Diamantini et al. 2009]: A directed graph of operators is iteratively built
using a custom algorithm. The operators are chosen from an ontology.
RDM [Zakova et al. 2010]: A two-planner system that uses an ontology formed
of knowledge (datasets, constraints...), algorithms and KDD tasks.
eLico-IDA [Kietz et al. 2009]: An ontology with operators and their effects is
queried for creating tasks that are sent to the HTN planner. A second ontology is
2/2
used to rank the resulting plans.

Types of IDAs

similar cases.
5. Workflow Composition Environments: Facilitate manual workflow creation and testing.

Dataset
Operators
User

Workflow editor

Workflow Composition Environment

Workflow

Types of IDAs

similar cases.
5. Workflow Composition Environments: Facilitate manual workflow creation and testing.
Canvas-Based Tools: IBM SPSS Modeler, SAS Enterprise Miner, Weka,
RapidMiner or Knime.
Scripting-Based Tools: MATLAB, R or Python.

Future directions

Cold start problem: A new dataset is not similar to any of the previous cases.
Adaptivity: Current IDAs are not able to adapt the workflows in the presence of
new data.
Predictive models: To predict the effects of the operators given the input data.
Reduce expert dependency: Self-maintenance of case bases.
Combination of approaches: CBR + expert rules, CBR + planning...
Scalability: To deal with large repositories of operators and case bases.

Beware of automatic things!

Click here to see

Thanks
You can get these slides in
http://slideshare.net/draxus
msalvador@bournemouth.ac.uk

References
AMANT, R. AND COHEN, P. 1998. Interaction with a mixed-initiative system for exploratory data analysis. Knowl. Based Syst. 10, 5, 265–273.
BERNSTEIN, A. AND DAENZER, M. 2007. The NExT system: Towards true dynamic adaptations of semantic web service compositions. In The Semantic Web:
Research and Applications, Lecture Notes in Computer Science, vol. 4519, Springer, 739–748.
BERNSTEIN, A., PROVOST, F., AND HILL, S. 2005. Toward intelligent assistance for a data mining process: An ontology-based approach for cost-sensitive
classification. IEEE Trans. Knowl. Data Eng. 17, 4, 503–518.
CHAREST, M.,DELISLE, S.,CERVANTES, O., AND SHEN, Y. 2008. Bridging the gap between data mining and decision support: A case-based reasoning and
ontology approach. Intell. Data Anal. 12, 1–26.
CRAW, S., SLEEMAN, D., GRANER, N., AND RISSAKIS, M. 1992. Consultant: Providing advice for the machine learning toolbox. In Proceedings of the Annual
Technical Conference on Expert Systems (ES). 5–23.
DIAMANTINI, C., POTENA, D., AND STORTI, E. 2009b. Ontology-driven KDD process composition. In Advances in Intelligent Data Analysis VIII, Lecture Notes in
Computer Science, vol. 5772, Springer, 285–296.
ENGELS, R. 1996. Planning tasks for knowledge discovery in databases: Performing task-oriented userguidance. In Proceedings of the ACM SIGKDD
International Conference on Knowledge Discovery and Data mining (KDD). 170–175.
GALE,W. 1986. Rex review. In Artificial Intelligence and Statistics. Addison-Wesley Longman Publishing Co.,Inc., Boston, MA. 173–227.
GIRAUD-CARRIER, C. 2005. The data mining advisor: Meta-learning at the service of practitioners. In Proceedings of the International Conference on Machine
Learning and Applications (ICMLA). 113–119.
HAND, D. 1987. A statistical knowledge enhancement system. J. Royal Stat. Soc. Series A (General) 150, 4, 334–345.
HAND, D. 1990. Practical experience in developing statistical knowledge enhancement systems. Ann. Math. Artif. Intell. 2, 1, 197–208.
KALOUSIS, A. AND HILARIO, M. 2001. Model selection via meta-learning: A comparative study. Int. J. Artif. Intell. Tools 10, 4, 525–554.
KIETZ, J., SERBAN, F., BERNSTEIN, A., AND FISCHER, S. 2009. Towards cooperative planning of data mining workflows. In Proceedings of the ECML-PKDD
Workshop on Service-Oriented Knowledge Discovery. 1–12.
MICHIE, D., SPIEGELHALTER, D., AND TAYLOR, C. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood, Upper Saddle River, NJ.
MORIK, K. AND SCHOLZ, M. 2004. The MiningMart approach to knowledge discovery in databases. In Intelligent Technologies for Information Analysis, N.
Zhong, and J. Liu, Eds., Springer, 47–65.
RAES, J. 1992. Inside two commercially available statistical expert systems. Stat. Comput. 2, 2, 55–62.
ZAKOVA, M., KREMEN, P., ZELEZNY, F., AND LAVRAC, N. 2010. Automating knowledge discovery workflow composition through ontology-based planning. IEEE
Tran. Autom. Sci. Eng. 8, 2, 253–264

Acknowledgements
Satellite: http://commons.wikimedia.org/wiki/File:GPS_Satellite_NASA_art-iif.jpg
Industry: http://commons.wikimedia.org/wiki/File:Industry_Texas.jpg
DNA: http://commons.wikimedia.org/wiki/File:DNA_Double_Helix.png
Table: http://www.iconarchive.com/show/ravenna-3d-icons-by-double-j-design/Database-Table-icon.html
Car: http://en.wikipedia.org/wiki/File:Jurvetson_Google_driverless_car_trimmed.jpg
Twitter: http://www.flickr.com/photos/recampaign/5623528621/
Multiple sources: http://www.flickr.com/photos/inl/7895742584/
Thermometer: http://commons.wikimedia.org/wiki/File:Digital_thermometer.jpg
Traffic Control: http://commons.wikimedia.org/wiki/File:Air_Traffic_Control,_Abraham_Lincoln_CVN-72.jpg
Question Mark: http://commons.wikimedia.org/wiki/File:Question_mark_road_sign,_Australia.jpg
Noise: http://www.flickr.com/photos/benleto/3223155821/
Outliers: http://commons.wikimedia.org/wiki/File:Diagrama_de_caixa_com_outliers_and_whisker.png
Bowling: http://en.wikipedia.org/wiki/File:Lawn_Bowling_-_Tim_Mason1.jpg
Baby: http://www.flickr.com/photos/107489497@N06/10671592736/
Library: http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpg
Back to the future car: http://lowrider-girl.deviantart.com/art/Back-To-The-Future-206312200
Coquette Icon Set: http://dryicons.com
Roboto font: http://developer.android.com/design/style/typography.html

Artificial Intelligence for Automating Data Analysis

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (6)

Similar a Artificial Intelligence for Automating Data Analysis

Similar a Artificial Intelligence for Automating Data Analysis (20)

Más de Manuel Martín

Más de Manuel Martín (20)

Último

Último (20)

Artificial Intelligence for Automating Data Analysis