2. To design a framework, how many variations we need to protect? How many functionalities we need to provide for supporting all these variations? QUESTIONS
13. Domain Concept: RawCorpus A collection of RawDocument , supporting collection operations: - Add new RawDocument element - Remove existing RawDocument element - Accessing elements in the collection - …
15. Domain Concept: RawDocument An object with one or more string fields, serving as a non-processed, in-memory representation of a document unit - Like Java beans with getter and setter - All fields must be string type, even for numbers
17. Domain Concept: Processor An object that processes RawCorpus and produces Corpus . - Linguistic: Tokenizer, Stemmer, StopRemover, PosTagger, … - Machine learning: Feature-specific, document-specific
18. Domain Concept: Corpus An object representing a collection of Document for use by machine learning side of framework. This object provides a notion of splits which is commonly used (e.g., train, test)
19. Domain Concept: Trainer A representation of a machine learning algorithm, which can learn from a Corpus and produce a Model .
20. Domain Concept: Model An object of what machine learning algorithm (i.e., Trainer ) creates to store parameters that are "learned" from the data (i.e., Corpus )
21. Domain Concept: Classifier An object that maps Documents to target values (label, number, probability). It takes a Corpus and a Model as inputs, and produces a Prediction associated with the Corpus according to the Model .
22. Domain Concept: Prediction A collection of target values (label, number, probability) that associate with a Corpus , i.e., a collection of Document .
23. Domain Concept: Evaluator An object used for comparing the Prediction against its associated Corpus and generating Evaluation
24. Domain Concept: Evaluation A representation of evaluation result given by a Evaluator , in a summarized manner.
26. STAT (brief) Domain Model Note : We ignore texts on connectors for brevity. Some connections are not drawn because of space limitation Corpus Reader Processor RawCorpus Trainer Model Classifier Prediction Evaluator Evaluation Writer Vocabulary
27. STAT Domain Model Note : We ignore texts above lines for brevity Corpus Reader Processor RawCorpus Trainer Model Classifier Prediction Evaluator Evaluation Writer
28. STAT Domain Model Note : We ignore texts above lines for brevity Corpus Reader Processor RawCorpus Trainer Model Classifier Prediction Evaluator Evaluation Document RawDocument