SlideShare una empresa de Scribd logo
1 de 25
Content Storage with Apache
Jackrabbit
2009-03-25, Jukka Zitting
Introducing JCR
Content Repository for Java™ Technology API
-> Version 1.0 defined in JSR 170, final in 2006
-> Version 2.0 defined in JSR 283, final (hopefully) in 2009
-> Open source RI and TCK developed in Apache Jackrabbit
Flexible, hierarchically ordered content store with features like
full text search, versioning, transactions, observation, etc.
Combines and extends features often found in file systems and
databases; a "best of both worlds" approach.
Introducing Apache Jackrabbit
Entered Apache Incubator
in 2004, graduated in 2006,
now a TLP with 24
committers
Version 1.0 (and JCR RI) in
2006, currently at 1.5.3.
Version 2.0 (and JCR 2.0 RI)
planned for 2009.
Also: JCR Commons, Sling
Components:
jackrabbit-api
jackrabbit-core
jackrabbit-standalone
jackrabbit-webapp
jackrabbit-jca
jackrabbit-ocm
jackrabbit-jcr-rmi
jackrabbit-webdav
jackrabbit-jcr-server
etc.
Structure of a content repository
Consists of one or more workspaces, one of which is the
default workspace.
A workspace consists of a tree of nodes, each of which can
have zero or more child nodes. Each workspace has a single
root node.
Nodes have properties. Properties are typed (string,
integer, date,  binary, reference, etc.) and can be either
single- or multivalued.
Structure, cont.
Each node or property has a name, and can be accessed
using a path. Names are namespaced.
A referenceable node has a UUID, through which it can be
accessed or referenced. Referential integrity is guaranteed.
Each node has a primary node type and zero or more mixin
types. Node types define the structure of a node. A node
can also be unstructured.
Content repository features
Read/write
Search (XPath and SQL)
XML import/export
Observation
Versioning
Node types
Locking
Access control
Atomic changes
XA transactions
Getting started with Jackrabbit
Download and run the standalone server:
java -jar jackrabbit-standalone-1.5.3.jar
-> Web interface at http://localhost:8080/
-> WebDAV at http://localhost:8080/repository/default
-> JCR access over RMI at http://localhost:8080/rmi
-> Repository data in ./jackrabbit
-> Configuration in ./jackrabbit/repository.xml
If you have a servlet container, use the webapp
If you have a J2EE application server, use jca
Remote access
JCR-RMI layer available
since Jackrabbit 0.9.
Good functional coverage,
not so good performance.
-> administrative tools
Look at clustering for
better performance.
Jackrabbit 1.6: spi2dav
o.a.j.rmi.repository:
new URLRemoteRepository(
    "http://.../rmi");
new RMIRemoteRepository(
    "//.../repository");
Classpath:
jcr-1.0.jar
jackrabbit-jcr-rmi-1.5.3.jar
jackrabbit-api-1.5.3.jar
(also in rmiregistry!)
Jackrabbit as a shared resource
JCA adapter in an
application server
-> accessed through JNDI
Jackrabbit webapp in a
servlet container
-> JNDI or servlet context
-> JNDI: complex setup
-> cross-context access
JNDI configuration with
jackrabbit-servlet:
<servlet>
  <servlet-name>Repository</servlet-name>
  <servlet-class>
  org.apache.jackrabbit.servlet.JNDIRepositoryServlet
  </servlet-class>
  <init-param>
    <param-name>location</param-name>
    <param-value<javax/jcr/Repository</param-value>
  </init-param>
</servlet>
Accessing the repository:
public class MyServlet extends HttpServlet {
    private final Repository repository =
        new ServletRepository(this);
}
Jackrabbit in embedded mode
RepositoryConfig config = RepositoryConfig.create(
      "/path/to/repository", "/path/to/repository.xml");
RepositoryImpl repository = RepositoryImpl.create(config);
try {
    // Use the javax.jcr interfaces to access the repository
} finally {
    repository.shutdown();
}
Embedded mode, cont.
Only a single instance can be
running at a time.
For concurrent access:
- multiple sessions
- RMI for remote access
- clustering
Also: TransientRepository
Maven coordinates
<dependency>
  <groupId>org.apache.jackrabbit</groupId>
  <artifactId>jackrabbit-core</artifactId>
  <version>1.5.3</version>
  <exclusions>
    <exclusion>
      <groupId>commons-logging</groupId>
      <artifactId>commons-logging</artifactId>
    </exclusion>
  </exclusions>
</dependency>
<dependency>
  <groupId>org.slf4j</groupId>
  <artifactId>slf4j-log4j12</artifactId>
  <version>1.5.3</version>
</dependency>
<dependency>
  <groupId>org.slf4j</groupId>
  <artifactId>jcl-over-slf4j</artifactId>
  <version>1.5.3</version>
</dependency>
Logging - what's this SLF4J thing?
Jackrabbit uses the SLF4J
logging facade for logging.
Benefits: Great for
embedded uses, can adapt
to 
Drawbacks: What do I put
in my classpath?
SLF4J in practice
SLF4J API is automatically
included as a dependency of
jackrabbit-core.
You need to explicitly add
the SLF4J implementation.
Jackrabit webapp, jca and
standalone use slf4j-log4j,
so you can use normal log4j
configuration.
Classpath with log4j:
slf4j-api-1.5.3.jar
slf4j-log4j-1.5.3.jar
log4j-1.2.14.jar
log4j.properties
Classpath with no logging:
slf4j-api-1.5.3.jar
slf4j-nop-1.5.3.jar
Repository configuration
Configuration in a repository.xml file. Default configuration
shipped with Jackrabbit. Structure defined in a DTD.
Contains global settings and a workspace configuration
template. The template is instantiated to a workspace.xml
configuration file for each new workspace.
Main elements: clustering, workspace, versioning, security,
persistence, search index, data store, file system
Persistence managers
Use one of the bundle
persistence managers.
-> select one for your db
Other persistence managers
mostly for backwards
compatibility.
Default based on embedded
Apache Derby.
Configuration:
driver,url,user,password
schema
schemaObjectPrefix
minBlobSize
bundleCacheSize
consistencyCheck/Fix
blockOnConnectionLoss
Needs CREATE TABLE
permissions!
Search index
Per-workspace search
indexes based on Apache
Lucene.
By default everything is
indexed. Use index
configuration to customize.
Clustering: Each cluster
node maintains local search
indexes.
Configuration:
min/maxMergeDocs
mergeFactor
analyzer
textFilterClasses
respectDocumentOrder
resultFetchSize
Performance:
property, type, ft -> fast
path -> slow
Text extraction
Full text indexing of file
contents based on various
parser libraries (POI,
PDFBox, etc.).
Currently only for the
jcr:data property with
correct jcr:mimeType.
Jackrabbit 2.0: indexing of
all binary properties.
textFilterClasses:
PlainTextExtractor
MsWordTextExtractor
MsExcelTextExtractor
MsPowerPointTextExtractor
PdfTextExtractor
OpenOfficeTextExtractor
RTFTextExtractor
HTMLTextExtractor
XMLTextExtractor
Data store - dealing with lots of data
Data store feature available
since Jackrabbit 1.4
Content-addressed storage
of large binary properties.
Completely transparent to
client applications.
Uses garbage collection to
remove unused data.
Implementations:
FileDataStore
DbDataStore
sandbox: S3DataStore
Garbage collection:
gc = si.createDSGC();
gc.scan();
gc.stopScan();
gc.deleteUnused();
Content modeling: David's model
1. Data First. Structure Later. Maybe.
2.Drive the content hierarchy, don't let it happen.
3.Workspaces are for clone(), merge() and update()
4.Beware of Same Name Siblings
5.References considered harmful
6.Files are Files are Files
7.ID's are evil
Content modeling: Example
CREATE TABLE author (
    id INTEGER,
    name VARCHAR,
);
CREATE TABLE post (
    id INTEGER,
    author INTEGER,
    posted DATETIME,
    title VARCHAR,
    body TEXT
);
/blog
  /jukka
    @name = Jukka Zitting
    /2009
      /03
        /25
          /hello
            @author = jukka
            @posted
            @title
            @body
Example, cont.
CREATE TABLE  comment (
    id INTEGER,
    post INTEGER,
    title VARCHAR,
    body TEXT
);
CREATE TABLE media (
    id INTEGER,
    post INTEGER,
    data BLOB,
    caption VARCHAR
);
/hello
  /comments
    /salut
      @title = Salut!
      @body
  /media [nt:folder]
    /image.jpg [nt:file]
    /code [nt:folder]
      /Example.java [nt:file]
Example, cont.
Versioning:
/hello [mix:versionable]
  @jcr:versionHistory
  @jcr:baseVersion
node.checkout();
// ...
node.save();
node.checkin();
Locking:
/hello [mix:lockable]
node.lock();
// ...
node.unlock();
Common issues: Content hierarchy
Jackrabbit doesn't support very flat content hierarchies.
You'll start seeing problems when you put more than 10k
child nodes under a single parent.
Solution: Add more depth to your hierarchy. Divide entries
by date, category, author, etc. If nothing else, use the first
letter(s) of a title, a content hash, or even a random number
to distribute the nodes.
Note: You can still access the entries as a single flat set
through search. The hierarchy is for browsing.
Common issues: Concurrent edits
Three ways to handle concurrent edits:
1. Merge changes
2. Fail conflicting changes 
3. Block concurrent changes
Jackrabbit does 1 by default, and falls back to 2 when
merge fails. You can explicitly opt for 3 by using the JCR
locking feature.
Estimate: How often conflicts would happen? Will the
benefits of locking be worth the overhead.
Questions?
 

Más contenido relacionado

La actualidad más candente

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Apache Jackrabbit Oak on MongoDB
Apache Jackrabbit Oak on MongoDBApache Jackrabbit Oak on MongoDB
Apache Jackrabbit Oak on MongoDB
MongoDB
 

La actualidad más candente (20)

Apache Solr
Apache SolrApache Solr
Apache Solr
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
Oracle RAC 12c (12.1.0.2) Operational Best Practices - A result of true colla...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger Internals
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Apache Jackrabbit Oak on MongoDB
Apache Jackrabbit Oak on MongoDBApache Jackrabbit Oak on MongoDB
Apache Jackrabbit Oak on MongoDB
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
ProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQL
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 

Destacado

Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
Jukka Zitting
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
Jukka Zitting
 

Destacado (20)

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
 
The architecture of oak
The architecture of oakThe architecture of oak
The architecture of oak
 
The Zero Bullshit Architecture
The Zero Bullshit ArchitectureThe Zero Bullshit Architecture
The Zero Bullshit Architecture
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On Steroids
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
 
Apache Jackrabbit Oak - Scale your content repository to the cloud
Apache Jackrabbit Oak - Scale your content repository to the cloudApache Jackrabbit Oak - Scale your content repository to the cloud
Apache Jackrabbit Oak - Scale your content repository to the cloud
 
Java Content Repository avec Jackrabbit
Java Content Repository avec JackrabbitJava Content Repository avec Jackrabbit
Java Content Repository avec Jackrabbit
 
Hosting huge amount of binaries in JCR
Hosting huge amount of binaries in JCRHosting huge amount of binaries in JCR
Hosting huge amount of binaries in JCR
 
Apache Jackrabbit
Apache JackrabbitApache Jackrabbit
Apache Jackrabbit
 
Rapid JCR Applications Development with Sling
Rapid JCR Applications Development with SlingRapid JCR Applications Development with Sling
Rapid JCR Applications Development with Sling
 
恐るべきApache, Web勉強会@福岡
恐るべきApache, Web勉強会@福岡恐るべきApache, Web勉強会@福岡
恐るべきApache, Web勉強会@福岡
 
Apache Hive 紹介
Apache Hive 紹介Apache Hive 紹介
Apache Hive 紹介
 
Building Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGiBuilding Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGi
 
Demystifying Oak Search
Demystifying Oak SearchDemystifying Oak Search
Demystifying Oak Search
 
Into the TarPit: A TarMK Deep Dive
Into the TarPit: A TarMK Deep DiveInto the TarPit: A TarMK Deep Dive
Into the TarPit: A TarMK Deep Dive
 
Shooting rabbits with sling
Shooting rabbits with slingShooting rabbits with sling
Shooting rabbits with sling
 
Content Management Standards
Content Management StandardsContent Management Standards
Content Management Standards
 

Similar a Content Storage With Apache Jackrabbit

Jackrabbit setup configuration
Jackrabbit setup configurationJackrabbit setup configuration
Jackrabbit setup configuration
prabakaranbrick
 
The Java Content Repository
The Java Content RepositoryThe Java Content Repository
The Java Content Repository
nobby
 
How to start using Scala
How to start using ScalaHow to start using Scala
How to start using Scala
Ngoc Dao
 

Similar a Content Storage With Apache Jackrabbit (20)

Jackrabbit setup configuration
Jackrabbit setup configurationJackrabbit setup configuration
Jackrabbit setup configuration
 
Introduction to Apache Tomcat 7 Presentation
Introduction to Apache Tomcat 7 PresentationIntroduction to Apache Tomcat 7 Presentation
Introduction to Apache Tomcat 7 Presentation
 
GlassFish and JavaEE, Today and Future
GlassFish and JavaEE, Today and FutureGlassFish and JavaEE, Today and Future
GlassFish and JavaEE, Today and Future
 
Jira Rev002
Jira Rev002Jira Rev002
Jira Rev002
 
Hibernate
HibernateHibernate
Hibernate
 
bjhbj
bjhbjbjhbj
bjhbj
 
Java is Container Ready - Vaibhav - Container Conference 2018
Java is Container Ready - Vaibhav - Container Conference 2018Java is Container Ready - Vaibhav - Container Conference 2018
Java is Container Ready - Vaibhav - Container Conference 2018
 
香港六合彩
香港六合彩香港六合彩
香港六合彩
 
Java se7 features
Java se7 featuresJava se7 features
Java se7 features
 
Arquillian in a nutshell
Arquillian in a nutshellArquillian in a nutshell
Arquillian in a nutshell
 
Hibernate java and_oracle
Hibernate java and_oracleHibernate java and_oracle
Hibernate java and_oracle
 
.NET RDF APIs
.NET RDF APIs.NET RDF APIs
.NET RDF APIs
 
Rollin onj Rubyv3
Rollin onj Rubyv3Rollin onj Rubyv3
Rollin onj Rubyv3
 
Dynamic Languages Web Frameworks Indicthreads 2009
Dynamic Languages Web Frameworks Indicthreads 2009Dynamic Languages Web Frameworks Indicthreads 2009
Dynamic Languages Web Frameworks Indicthreads 2009
 
The Java EE 6 platform
The Java EE 6 platformThe Java EE 6 platform
The Java EE 6 platform
 
Java Cloud and Container Ready
Java Cloud and Container ReadyJava Cloud and Container Ready
Java Cloud and Container Ready
 
The Java Content Repository
The Java Content RepositoryThe Java Content Repository
The Java Content Repository
 
How to start using Scala
How to start using ScalaHow to start using Scala
How to start using Scala
 
Jsp and jstl
Jsp and jstlJsp and jstl
Jsp and jstl
 
Practical JRuby
Practical JRubyPractical JRuby
Practical JRuby
 

Más de Jukka Zitting

MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
Jukka Zitting
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
Jukka Zitting
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
Jukka Zitting
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
Jukka Zitting
 

Más de Jukka Zitting (11)

MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuning
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
NoSQL Oakland
NoSQL OaklandNoSQL Oakland
NoSQL Oakland
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
Apache Tika
Apache TikaApache Tika
Apache Tika
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Content Storage With Apache Jackrabbit

  • 2. Introducing JCR Content Repository for Java™ Technology API -> Version 1.0 defined in JSR 170, final in 2006 -> Version 2.0 defined in JSR 283, final (hopefully) in 2009 -> Open source RI and TCK developed in Apache Jackrabbit Flexible, hierarchically ordered content store with features like full text search, versioning, transactions, observation, etc. Combines and extends features often found in file systems and databases; a "best of both worlds" approach.
  • 3. Introducing Apache Jackrabbit Entered Apache Incubator in 2004, graduated in 2006, now a TLP with 24 committers Version 1.0 (and JCR RI) in 2006, currently at 1.5.3. Version 2.0 (and JCR 2.0 RI) planned for 2009. Also: JCR Commons, Sling Components: jackrabbit-api jackrabbit-core jackrabbit-standalone jackrabbit-webapp jackrabbit-jca jackrabbit-ocm jackrabbit-jcr-rmi jackrabbit-webdav jackrabbit-jcr-server etc.
  • 4. Structure of a content repository Consists of one or more workspaces, one of which is the default workspace. A workspace consists of a tree of nodes, each of which can have zero or more child nodes. Each workspace has a single root node. Nodes have properties. Properties are typed (string, integer, date,  binary, reference, etc.) and can be either single- or multivalued.
  • 5. Structure, cont. Each node or property has a name, and can be accessed using a path. Names are namespaced. A referenceable node has a UUID, through which it can be accessed or referenced. Referential integrity is guaranteed. Each node has a primary node type and zero or more mixin types. Node types define the structure of a node. A node can also be unstructured.
  • 6. Content repository features Read/write Search (XPath and SQL) XML import/export Observation Versioning Node types Locking Access control Atomic changes XA transactions
  • 7. Getting started with Jackrabbit Download and run the standalone server: java -jar jackrabbit-standalone-1.5.3.jar -> Web interface at http://localhost:8080/ -> WebDAV at http://localhost:8080/repository/default -> JCR access over RMI at http://localhost:8080/rmi -> Repository data in ./jackrabbit -> Configuration in ./jackrabbit/repository.xml If you have a servlet container, use the webapp If you have a J2EE application server, use jca
  • 8. Remote access JCR-RMI layer available since Jackrabbit 0.9. Good functional coverage, not so good performance. -> administrative tools Look at clustering for better performance. Jackrabbit 1.6: spi2dav o.a.j.rmi.repository: new URLRemoteRepository(     "http://.../rmi"); new RMIRemoteRepository(     "//.../repository"); Classpath: jcr-1.0.jar jackrabbit-jcr-rmi-1.5.3.jar jackrabbit-api-1.5.3.jar (also in rmiregistry!)
  • 9. Jackrabbit as a shared resource JCA adapter in an application server -> accessed through JNDI Jackrabbit webapp in a servlet container -> JNDI or servlet context -> JNDI: complex setup -> cross-context access JNDI configuration with jackrabbit-servlet: <servlet>   <servlet-name>Repository</servlet-name>   <servlet-class>   org.apache.jackrabbit.servlet.JNDIRepositoryServlet   </servlet-class>   <init-param>     <param-name>location</param-name>     <param-value<javax/jcr/Repository</param-value>   </init-param> </servlet> Accessing the repository: public class MyServlet extends HttpServlet {     private final Repository repository =         new ServletRepository(this); }
  • 10. Jackrabbit in embedded mode RepositoryConfig config = RepositoryConfig.create(       "/path/to/repository", "/path/to/repository.xml"); RepositoryImpl repository = RepositoryImpl.create(config); try {     // Use the javax.jcr interfaces to access the repository } finally {     repository.shutdown(); }
  • 11. Embedded mode, cont. Only a single instance can be running at a time. For concurrent access: - multiple sessions - RMI for remote access - clustering Also: TransientRepository Maven coordinates <dependency>   <groupId>org.apache.jackrabbit</groupId>   <artifactId>jackrabbit-core</artifactId>   <version>1.5.3</version>   <exclusions>     <exclusion>       <groupId>commons-logging</groupId>       <artifactId>commons-logging</artifactId>     </exclusion>   </exclusions> </dependency> <dependency>   <groupId>org.slf4j</groupId>   <artifactId>slf4j-log4j12</artifactId>   <version>1.5.3</version> </dependency> <dependency>   <groupId>org.slf4j</groupId>   <artifactId>jcl-over-slf4j</artifactId>   <version>1.5.3</version> </dependency>
  • 12. Logging - what's this SLF4J thing? Jackrabbit uses the SLF4J logging facade for logging. Benefits: Great for embedded uses, can adapt to  Drawbacks: What do I put in my classpath?
  • 13. SLF4J in practice SLF4J API is automatically included as a dependency of jackrabbit-core. You need to explicitly add the SLF4J implementation. Jackrabit webapp, jca and standalone use slf4j-log4j, so you can use normal log4j configuration. Classpath with log4j: slf4j-api-1.5.3.jar slf4j-log4j-1.5.3.jar log4j-1.2.14.jar log4j.properties Classpath with no logging: slf4j-api-1.5.3.jar slf4j-nop-1.5.3.jar
  • 14. Repository configuration Configuration in a repository.xml file. Default configuration shipped with Jackrabbit. Structure defined in a DTD. Contains global settings and a workspace configuration template. The template is instantiated to a workspace.xml configuration file for each new workspace. Main elements: clustering, workspace, versioning, security, persistence, search index, data store, file system
  • 15. Persistence managers Use one of the bundle persistence managers. -> select one for your db Other persistence managers mostly for backwards compatibility. Default based on embedded Apache Derby. Configuration: driver,url,user,password schema schemaObjectPrefix minBlobSize bundleCacheSize consistencyCheck/Fix blockOnConnectionLoss Needs CREATE TABLE permissions!
  • 16. Search index Per-workspace search indexes based on Apache Lucene. By default everything is indexed. Use index configuration to customize. Clustering: Each cluster node maintains local search indexes. Configuration: min/maxMergeDocs mergeFactor analyzer textFilterClasses respectDocumentOrder resultFetchSize Performance: property, type, ft -> fast path -> slow
  • 17. Text extraction Full text indexing of file contents based on various parser libraries (POI, PDFBox, etc.). Currently only for the jcr:data property with correct jcr:mimeType. Jackrabbit 2.0: indexing of all binary properties. textFilterClasses: PlainTextExtractor MsWordTextExtractor MsExcelTextExtractor MsPowerPointTextExtractor PdfTextExtractor OpenOfficeTextExtractor RTFTextExtractor HTMLTextExtractor XMLTextExtractor
  • 18. Data store - dealing with lots of data Data store feature available since Jackrabbit 1.4 Content-addressed storage of large binary properties. Completely transparent to client applications. Uses garbage collection to remove unused data. Implementations: FileDataStore DbDataStore sandbox: S3DataStore Garbage collection: gc = si.createDSGC(); gc.scan(); gc.stopScan(); gc.deleteUnused();
  • 19. Content modeling: David's model 1. Data First. Structure Later. Maybe. 2.Drive the content hierarchy, don't let it happen. 3.Workspaces are for clone(), merge() and update() 4.Beware of Same Name Siblings 5.References considered harmful 6.Files are Files are Files 7.ID's are evil
  • 20. Content modeling: Example CREATE TABLE author (     id INTEGER,     name VARCHAR, ); CREATE TABLE post (     id INTEGER,     author INTEGER,     posted DATETIME,     title VARCHAR,     body TEXT ); /blog   /jukka     @name = Jukka Zitting     /2009       /03         /25           /hello             @author = jukka             @posted             @title             @body
  • 21. Example, cont. CREATE TABLE  comment (     id INTEGER,     post INTEGER,     title VARCHAR,     body TEXT ); CREATE TABLE media (     id INTEGER,     post INTEGER,     data BLOB,     caption VARCHAR ); /hello   /comments     /salut       @title = Salut!       @body   /media [nt:folder]     /image.jpg [nt:file]     /code [nt:folder]       /Example.java [nt:file]
  • 22. Example, cont. Versioning: /hello [mix:versionable]   @jcr:versionHistory   @jcr:baseVersion node.checkout(); // ... node.save(); node.checkin(); Locking: /hello [mix:lockable] node.lock(); // ... node.unlock();
  • 23. Common issues: Content hierarchy Jackrabbit doesn't support very flat content hierarchies. You'll start seeing problems when you put more than 10k child nodes under a single parent. Solution: Add more depth to your hierarchy. Divide entries by date, category, author, etc. If nothing else, use the first letter(s) of a title, a content hash, or even a random number to distribute the nodes. Note: You can still access the entries as a single flat set through search. The hierarchy is for browsing.
  • 24. Common issues: Concurrent edits Three ways to handle concurrent edits: 1. Merge changes 2. Fail conflicting changes  3. Block concurrent changes Jackrabbit does 1 by default, and falls back to 2 when merge fails. You can explicitly opt for 3 by using the JCR locking feature. Estimate: How often conflicts would happen? Will the benefits of locking be worth the overhead.