1. 1JaMU – Jakarta 7 Maret 2014
Pentaho
and NoSQL
Java Meet Up (JaMU), Jakarta
7th March, 2014
Feris Thia
feris@phi-integration.com
08176-474-525
2. 2JaMU – Jakarta 7 Maret 2014
ABOUT ME
Founder
2007 2013
Feris Thia
PHI-Integration
3. 3JaMU – Jakarta 7 Maret 2014
ABOUT ME
Book Author
Feris Thia
November 2013
4. 4JaMU – Jakarta 7 Maret 2014
ABOUT ME
Community Manager
Feris Thia
Excel Indonesia User
Group (EIUG)
Pentaho User Group
Indonesia (Pentaho-ID)
2008
(~1000 members)
2013
(~5000 members)
5. 5JaMU – Jakarta 7 Maret 2014
ABOUT ME
PHI-Integration Clients
Community Manager
Feris Thia
6. 6JaMU – Jakarta 7 Maret 2014
AGENDA
DATA PREPARATION
What and why it is important?
PENTAHO DATA INTEGRATION
Popular Open Source ETL
NOSQL
An Emerging Non Relational
DatabaseTechnology
8. 8JaMU – Jakarta 7 Maret 2014
image source: http://www.huntbigsales.com/winning-in-the-meeting-after-the-meeting/
What cause sales increase
in this area? Is there
something unusual
happen?
WHAT?? So we cannot
make any decisions until
the data ready.
We need some times
to prepare additional
data to answer that.
Yes, sir….
9. 9JaMU – Jakarta 7 Maret 2014
Image Source: http://wrapbootstrap.com/preview/WB0KDM51J/
TYPICAL SOLUTION
SOPHISTICATED REPORTING OR
DASHBOARD APPLICATION!
10. 10JaMU – Jakarta 7 Maret 2014
Image Source: http://reallybadboss.com/wp-content/uploads/2012/02/frustration.jpg
PROBLEMS REMAIN…
11. 11JaMU – Jakarta 7 Maret 2014
Time Spent on Data Preparation
80 %
Data Quality
50%
Extract, Transformation & Load
30%
13. 13JaMU – Jakarta 7 Maret 2014
DATA PREPARATION IS THE KEY
Entry Systems Data Preparation
Reporting
Basic Data
Presentation
Performance
Dashboard
(Visualization)
1 2 3 4
Notes: Data preparation is often undermine.
14. 14JaMU – Jakarta 7 Maret 2014
DATA WAREHOUSE
Entry Systems Data Warehouse
Business
Intelligence
1 2 3
17. 17JaMU – Jakarta 7 Maret 2014
INTEGRATION
of many data sources
INCREMENTAL
Extract only changes
DATASIZE
Big data
INFRASTRUCTURE
network failure, high latency, slow
i/o, etc.
DATAQUALITY
missing data, conversion etc.
PROTOCOL
driver availability, reliability, etc.
EXTRACT
18. 18JaMU – Jakarta 7 Maret 2014
NORMALIZE
DENORMALIZE
SPLIT/ MERGE
DATAREDUCTION
(Aggregate,etc)
TRANSPOSE
TEXTPARSING
TRANSFORM
19. 19JaMU – Jakarta 7 Maret 2014
PERFORMANCE
of many data sources
CHANGES
structure, data type, column
size, etc
DATASIZE
Big data
INFRASTRUCTURE
network failure, high latency, slow
i/o, etc.
DATAMAPPING
sync with correlated data
Output Format
Excel, PDF, HTML, RDBMS, etc.
LOAD
20. 20JaMU – Jakarta 7 Maret 2014
DEMO
Data structure changes to increase SQL query performance.
21. 21JaMU – Jakarta 7 Maret 2014
Pentaho Data Integration
Open Source ETL
22. 22JaMU – Jakarta 7 Maret 2014
FEATURES AND BENEFITS
• Open Source
• Cost Efficient
• More than 200 modules
• Multi OS Platform
• Working with emerging Big Data platforms
• Low Learning Curve
23. 23JaMU – Jakarta 7 Maret 2014
DEMO
Basic Extract
and
Transformaion
More I/O
Helper Table
(Closure)
1 2 3
25. 25JaMU – Jakarta 7 Maret 2014
2009
Redis Initial Release
TIMELINE
Emergence of open source NoSQL
2004 2006 2007 2008 2009 2011 2012 2013 2014
2007
MongoDB Started,
Neo4J Initial Release
2004
Google’s Map
Reduce Paper
Published
2012
Google Spanner Paper
Published
1998
1998
NoSQL coined
2006
Hadoop
Started
2008
Apache Hbase,
Apache Cassandra
26. 26JaMU – Jakarta 7 Maret 2014
NOSQL GROUPS
DOCUMENT
MongoDB, CouchDB, Ria
k
WIDE COLUMN
Cassandra, Hbase, Hype
rtable
GRAPH
Neo4J, OrientDB
KEY - VALUE
Redis, MemcacheDB,
SimpleDB
<K, V>
27. 27JaMU – Jakarta 7 Maret 2014
NOSQL VS SQL
http://gigaom.com/2010/07/12/nosql-pioneers-are-driving-the-webs-manifest-destiny/
Data Store Type Use Cases Advantages Disadvantages Key Product
Key-Value In-memory cache, web-site
analytics, log file analysis
Simple, replication, versioning,
locking, transactions, and sorting
web-accessible, schema-less,
distributed
Simple, small set of data types,
limited transaction support
Redis, Scalaris, Tokyo
Cabinet
Tabular or Columnar Data mining, analytics Rapid data aggregation, scalable,
versioning, locking, web-
accessible, schema-less,
distributed
Limited transaction support Google BigTable, Hbase or
HyperTable, Cassandra
Document Store Document management CRM,
Business continuity
Stores and retrieves unstructured
documents, map/reduce, web-
accessible, schema-less,
distributed
Limited transaction support CouchDB, MongoDB, Riak
Traditional Relational Transaction processing, typical
corporate workloads
Well documented and supported,
mature code, widely implemented
in production
Cost, vertical scaling, increased
complexity
Oracle, Microsoft SQL
Server, MySQL Cluster
28. 28JaMU – Jakarta 7 Maret 2014
Nosql VS SQL
• Schema are much more flexible
• Non relational (no joins)
• Horizontal Scalability
• Master – Slave
• Peer-to-peer
• Data Pipeline
– Expressions
– Functional Programming
• ACID
(Atomicity, Consistency, Isolation, Du
rability)
• BASE (Basic Availability, Soft-
state, Eventual consistency)
• CAP
(Consistency, Availability, Partition
Tolerance)
29. 29JaMU – Jakarta 7 Maret 2014
DB-ENGINES.COM DB RANKING
PER 7 MARCH 2014
Rank Last Month DBMS Database Model Score Changes
1 1Oracle Relational DBMS 1491.8 -8.43
2 2MySQL Relational DBMS 1290.21 1.83
3 3Microsoft SQL Server Relational DBMS 1205.28 -8.99
4 4PostgreSQL Relational DBMS 235.06 4.61
5 5MongoDB Document store 199.99 4.81
6 6DB2 Relational DBMS 187.32 -1.14
7 7Microsoft Access Relational DBMS 146.48 -6.4
8 8SQLite Relational DBMS 92.98 -0.03
9 9Sybase ASE Relational DBMS 81.55 -6.33
10 10Cassandra Wide column store 78.09 -2.23
30. 30JaMU – Jakarta 7 Maret 2014
MongoDB
Document Oriented Database
• Schemaless
• Distributed
• Auto Sharding
• Map Reduce Capabilities
• Multi Platform
• Structures
– Database
– Collections
– Documents
• Document
– A record is a document
– Similar to JSON Objects
32. 32JaMU – Jakarta 7 Maret 2014
MONGODB DEMO
Basic
Commands
PDI Extract
and
Load
Aggregation
Framework
1 2 3
33. 33JaMU – Jakarta 7 Maret 2014
Neo4j
Graph Database
Properties
Relationship
Cypher
Node
34. 34JaMU – Jakarta 7 Maret 2014
Neo4J
• Neo4J Web Admin
• Create Node
CREATE (n {property_name :“property_value" })
• Create Relation
CREATE n-[:RELATION]->m
• Where:
– n, m is identifier
– :RELATION is relation name
Basic Utility, Commands & Expressions
35. 35JaMU – Jakarta 7 Maret 2014
Neo4J
• Matching and Returning Objects
START emil=node:people(name='Emil')
MATCH emil-[:MARRIED_TO]-madde
RETURN madde
Basic Commands & Expressions
36. 36JaMU – Jakarta 7 Maret 2014
HIERARCHICAL MODEL
Neo4j Case Demo
Root
Child 3 Child 4Child 2Child 1 Child 5
38. 38JaMU – Jakarta 7 Maret 2014
Universitas Multimedia Nusantara
New Media Tower, Lv.12
Scientia Boulevard St.
Tangerang, Banten, 15811
+6221-7038-7738 (phone)
+ 628176-474-525 (mobile)
https://www.facebook.com/feris.thia
@FerisThia
feris@phi-integration.com
CONTACT ME