Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Dan Sullivan, Principal
DS Applied Technologies
NoSQL Matters 2015
Dublin, Ireland
June 4, 2015
Data Analytics and
Text Mi...
Dan Sullivan, Principal
DS Applied Technologies
NoSQL Matters 2015
Dublin, Ireland
June 4, 2015
Data Analytics and
Text Mi...
My Background
 Data Architect / Engineer
 NoSQL and relational data
modeler
 Big data
 Analytics, machine learning
and...
Overview
 Quick Intro to Data and Text Mining
 Need for Data Management in Data and Text
Mining
 Relational or NoSQL?
...
*
* 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
c...
*
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural ...
*
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error
Training ...
Debt, Law,
Graduation
Debt, EU,
Greece, Euro
Source: http://www.nytimes.com/pages/business/index.html April 27, 2015
EU, G...
*
*Large volumes of
accessible and relevant
texts:
*Social media
*Email
*Patents and research
*Customer
communications
* U...
Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques,...
*
* Collect
* Data
* Documents
* Extract and Pre-processing
* Normalization
* Data Cleansing
* Case conversion
* Punctuati...
*
*Experiments
*Type
*Data sets
*Algorithms
*Type
*Hyper-parameters
*Implementation software
*Results
*Model generation
*E...
*
 Pragmatic
 Widely applicable
 Many options
 Modeling
 Reduce risk of data
anomalies.
 Separate logical
and physical...
Features
 JSON/XML structures
 Fields vary between docs
 No predefined schema
 Documents analogous to
rows
 Collecti...
Schema-less <> Model-less
 Schema-less Document
Databases
 No fixed schema
 Polymorphic documents

 ...however, not a...
Relational:
Requirements known at start
of project
Entities described by common
attributes
Compliance and audit issues...
Key value:
Caching
Few attributes
Document databases:
Varying attributes
Integrate diverse data
types
Use denormali...
*
Pattern 1: One-to-Many
 Embed Documents
 Multiple documents
embedded
 “Many” attributes stored
with “One” document
 Pr...
One-to-Many Considerations
 Query attributes in
embedded documents?
 Support for indexing
embedded documents?
 Potentia...
Pattern 2: Many-to-Many
Employees
({empID: 1783,
pname: “Michelle”,
lname:”Jones”
projects: [487,973, 287]}
{empID: 9872,
...
Pattern 2: Many-to-Many
Employee
{empID: 1783,
pname: “Michelle”,
lname:”Jones”
projects: [
{projID:121,
projName:'NoSQL P...
Many-to-Many Considerations
 References
 Minimizes redundancy
 Preserves integrity
 Reduces document growth
 Requires...
Pattern 3: Trees with Parent & Child
References
 Trees
 Single root
document
 At most one parent
 No cycles
 Multiple...
Pattern 3: Trees with References
Children Refs.
({orgUnitID:178,
orgUnitType: “Primary”,
orgUnitName:”P1”
children: [179,1...
Tree Considerations
 Children reference allow for
top-down navigation
 Parent references allow for-
bottom up navigation...
Anti-Patterns
 Large arrays
 Significant growth in
document size
 Fetching more data than
needed
 Fear of data duplica...
*
*
Corpus
Experiment
Corpus
Experiment1:M 1:M
*
Corpus : {
corpus_id : ObjectID,
name : string,
descr : string,
create_date : date,
version : string,
contents: [ { id, ...
*
Experiment : {
exp_id: ObjectID,
type : string,
exp_corups_id : OjbectID,
algorithm : {
type : string,
hyperparams: [{pa...
*
* Data and text mining processes are multi-
faceted
* Well suited to advantages of document
database models
*Design patt...
Questions?
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Dublin 2015
Próxima SlideShare
Cargando en…5
×

Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Dublin 2015

9.734 visualizaciones

Publicado el

Data analysis is an exploratory process that requires a variety of tools and a flexible data store. Data analysis projects are easy to start but quickly become difficult to manage and error prone when depending on file-based data storage. Relational databases are poorly equipped to accommodate the dynamic demands complex analysis. This talk describes best practices for using MongoDB for analytics projects. Examples will be drawn from a large scale text mining project (approximately 25 million documents) that applies machine learning (neural networks and support vector machines) and statistical analysis. Tools discussed include R, Spark, Python scientific stack, and custom pre-processing scripts but the focus is on using these with the document database.

Publicado en: Datos y análisis

Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Dublin 2015

  1. 1. Dan Sullivan, Principal DS Applied Technologies NoSQL Matters 2015 Dublin, Ireland June 4, 2015 Data Analytics and Text Mining with MongoDB
  2. 2. Dan Sullivan, Principal DS Applied Technologies NoSQL Matters 2015 Dublin, Ireland June 4, 2015 Data Analytics and Text Mining with MongoDB
  3. 3. My Background  Data Architect / Engineer  NoSQL and relational data modeler  Big data  Analytics, machine learning and text mining  Cloud computing   Author  No SQL for Mere Mortals  Contributor to TechTarget  SearchDataManagement  SearchCloudComputing  SearchAWS
  4. 4. Overview  Quick Intro to Data and Text Mining  Need for Data Management in Data and Text Mining  Relational or NoSQL?  Document Database Design Patterns  MongoDB (Document Database) Model  Questions
  5. 5. *
  6. 6. * 3 Key Components * Data * Representation scheme * Algorithms * Data * Positive examples – Examples from representative corpus * Negative examples – Randomly selected from same publications * Representation * Feature Vector * Distributed Neural Network * Algorithms - Supervised learning * SVMs * Ridge Classifier * Perceptrons * kNN * SGD Classifier * Naïve Bayes * Random Forest * AdaBoost *
  7. 7. * Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
  8. 8. * 0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5 0 2000 4000 6000 8000 10000 All Training Error Validation Error Training Instances Error Rate
  9. 9. Debt, Law, Graduation Debt, EU, Greece, Euro Source: http://www.nytimes.com/pages/business/index.html April 27, 2015 EU, Greece, Negotiations, Varoufakis
  10. 10. * *Large volumes of accessible and relevant texts: *Social media *Email *Patents and research *Customer communications * Use Cases *Market research *Brand monitoring *e-Discovery *Intellectual property management
  11. 11. Manual procedures are time consuming and costly Volume of literature continues to grow Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually Some success with popular tools but limitations
  12. 12. * * Collect * Data * Documents * Extract and Pre-processing * Normalization * Data Cleansing * Case conversion * Punctuation removal * Stemming * Analysis * Classification Models * Predictive Analytics * Term Frequency – Inverse Document Frequency * Conditional Probabilities and Topic Models * Error Evaluation * Integration * Link to Structured Data * Deploy predictive models * Utilization * Improve information retrieval * Identity brand perception problems * Assess likelihood of customer churn * Predict likelihood of … Collect Extract & Pre-Process Analyze Integrate Utilize
  13. 13. * *Experiments *Type *Data sets *Algorithms *Type *Hyper-parameters *Implementation software *Results *Model generation *Error evaluation * Raw Data * Pre-processing steps Image: http://content.timesjobs.com/ data-mining-specialist-will-lead- demand-bpo-sector/
  14. 14. *
  15. 15.  Pragmatic  Widely applicable  Many options  Modeling  Reduce risk of data anomalies.  Separate logical and physical models
  16. 16. Features  JSON/XML structures  Fields vary between docs  No predefined schema  Documents analogous to rows  Collections analogous to tables  Query capabilities Limitations No joins No referential integrity checks Object-based query language { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  17. 17. Schema-less <> Model-less  Schema-less Document Databases  No fixed schema  Polymorphic documents   ...however, not a Design Free-for-All  Queries drives organization  Performance Considerations  Long-term Maintenance   Middle Ground: Data Model Patterns  Reusable methods for organizing data  Model is implicit in document structures
  18. 18. Relational: Requirements known at start of project Entities described by common attributes Compliance and audit issues Need normalization Acceptable performance on small number of servers Need server side joins 
  19. 19. Key value: Caching Few attributes Document databases: Varying attributes Integrate diverse data types Use denormalized data key3 key2 key1 value1 value2 value3 { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  20. 20. *
  21. 21. Pattern 1: One-to-Many  Embed Documents  Multiple documents embedded  “Many” attributes stored with “One” document  Pros  Single fetch returns primary and related data  Might improve performance  Simplifies application code  Cons  Increases document size  Might degrade performance { OrderID: 1837373, customer : {Name: 'Jane Lox' Addr: '123 Main St' City: 'Boston' State: 'MA'}, orderItem:{ Sku: 38383838, Descr: 'Black chair'}, orderItem:{ Sku: 2872636, Descr: 'Glass desk'}, orderItem:{ Sku: 4747433, Descr: 'USB Drive 32GB''} }
  22. 22. One-to-Many Considerations  Query attributes in embedded documents?  Support for indexing embedded documents?  Potential for arbitrary growth after record created?  Need for atomic writes?
  23. 23. Pattern 2: Many-to-Many Employees ({empID: 1783, pname: “Michelle”, lname:”Jones” projects: [487,973, 287]} {empID: 9872, pname: “Bob”, lname:”Williams” projects: [487,973, 121]}) Projects ({projID:121, projName:'NoSQL Pilot'', team: [9872, 2431, {projID:487, projName:'Customer Churn Analysis'', team: [1873,9872]}) References  Minimizes redundancy  Preserves integrity  Reduces document growth  Requires multiple reads
  24. 24. Pattern 2: Many-to-Many Employee {empID: 1783, pname: “Michelle”, lname:”Jones” projects: [ {projID:121, projName:'NoSQL Pilot''}, {projID:487, projName:'Customer Churn Analysis''} ]} Project {projID:121, projName:'NoSQL Pilot'', team: [ { empID: 1783, fname: “Michelle”, lname:”Jones”}, { empID: 9872, fname: “Bob”, lname:”Williams”} ]} Embedded Documents  Captures point in time data  One document read retrieves data  Increases document growth
  25. 25. Many-to-Many Considerations  References  Minimizes redundancy  Preserves integrity  Reduces document growth  Requires multiple reads  Embedded Documents  Captures point in time data  One document read retrieves data  Increases document growth
  26. 26. Pattern 3: Trees with Parent & Child References  Trees  Single root document  At most one parent  No cycles  Multiple Types  Is-A  Part-of
  27. 27. Pattern 3: Trees with References Children Refs. ({orgUnitID:178, orgUnitType: “Primary”, orgUnitName:”P1” children: [179,180]}, {orgUnitID:179, orgUnitType: “Branch”, orgUnitName:”B1” children: [181,182]}, {orgUnitID:180, orgUnitType: “Branch”, orgUnitName:”B2” children: [183,184]}) Parent Refs. ({orgUnitID:178, orgUnitType: “Primary”, orgUnitName:”P1” parent: 177}, {orgUnitID:179, orgUnitType: “Branch”, orgUnitName:”B1” parent: 178}, {orgUnitID:180, orgUnitType: “Branch”, orgUnitName:”B2” parent: 178})
  28. 28. Tree Considerations  Children reference allow for top-down navigation  Parent references allow for- bottom up navigation  Combination allow for bottom-up and top-down navigation  Avoid large arrays  Consider need for point in time data
  29. 29. Anti-Patterns  Large arrays  Significant growth in document size  Fetching more data than needed  Fear of data duplication  Thinking SQL, using NoSQL  Normalizing without need
  30. 30. *
  31. 31. * Corpus Experiment Corpus Experiment1:M 1:M
  32. 32. * Corpus : { corpus_id : ObjectID, name : string, descr : string, create_date : date, version : string, contents: [ { id, text } ] contents_uri: string } Experiment_Corpus : { exp_corpus_id: ObjectID, name : string, type : string, corpus_id : ObjectID, descr_stats : { count: integer, min_len :integer, max_len: integer, mean_len: integer, std_dev : float } pre_process_opers: { lowercase : boolean, nopunct : boolean, stem :boolean, normal: boolean } contents: [{ id, text }], contents_uri: string }
  33. 33. * Experiment : { exp_id: ObjectID, type : string, exp_corups_id : OjbectID, algorithm : { type : string, hyperparams: [{param, val}}, implementation : [ {software:string, version: string, code_uri: string } ] } model_file : string, results : [ {metric, val} ], model_gen_log : string, error_evaluation : [ { training_size, training_error, validation_error } ] }
  34. 34. * * Data and text mining processes are multi- faceted * Well suited to advantages of document database models *Design patterns provide building blocks of models * Query patterns determine choice among patterns
  35. 35. Questions?

×