SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
PDF AssociationTechnical Conference June 18-19 2013

PDF and Microsoft Sharepoint
Hurdles to Overcome

Neil Pitman
Aquaforest Limited

Version 1.120613
Objective

PDF as a Sharepoint “First Class Citizen”
 Objectives
 Sharepoint Overview
 PDF Capture
 PDF Search

Agenda

 iFilters
 Handling Image and Mixed Mode PDFs

 PDF Metadata
 Dictionary, XMP and Entity Extraction

 Configuration
 Sharepoint 2010 , 2013

 Summary
Microsoft Sharepoint Server - 125 million licenses sold
Sharepoint to be a natural target for PDF storage

 What is Sharepoint?
 On-Premise and Cloud-based Collaboration &
Document Management Platform

Sharepoint
Overview

 Origin - 2001
 Usage
 Focus on MS Office Documents
 Typically distributed capture
 Sharepoint Editions (2010, 2013)

Sharepoint
Overview

 Foundation
 Standard
 Enterprise

 Office 365 / Sharepoint Online
 Ecosystem
 Partner Products
 Office / Sharepoint Marketplace
Sharepoint
Architecture
Overview



MS Web-based (IIS)



MS Office Integration



SQL Server Storage



List or library data in a site collection is stored in a SQL Server database table, which uses queries, indexes and locks to maintain overall performance, sharing, and accuracy.



Filtered views with column indexes (and other operations) create database queries that identify a subset of columns and rows and return this subset to your computer.



Thresholds and limits help throttle operations and balance resources for many simultaneous users.



Privileged developers can use object model overrides to temporarily increase thresholds and limits for custom applications.



Administrators can specify dedicated time windows for all users to do unlimited operations during off-peak hours.



Information workers can use appropriate views, styles, and page limits to speed up the display of data on the page.

Microsoft Technology Stack







Windows Server 2008/12
Internet Information Server (IIS)
.Net Framework
SQL Server
MS Office
 Options

PDF Capture
for Sharepoint







Sharepoint UI
Acrobat XI
Load Tools
Custom Code
Workflow & Event Receivers

WebRequest request = WebRequest.Create(destUrl);
request.Credentials = CredentialCache.DefaultCredentials;
request.Method = "PUT";
byte[] buffer = new byte[1024];
using (Stream stream = request.GetRequestStream())
using (MemoryStream ms = new MemoryStream(fileBytes))
{
for (int i = ms.Read(buffer, 0, buffer.Length); i > 0;
i = ms.Read(buffer, 0, buffer.Length))
{
stream.Write(buffer, 0, i);
}
}
WebResponse response = request.GetResponse();
response.Close();
Logging.Log("Upload successful");
Acrobat XI
Sharepoint
Integration

http://www.adobe.com/uk/products/acrobat/pdf-version-control-sharepoint-integration.html
PDF Search in
Sharepoint Overview

 Item 1
 Item 2
iFilters scan documents for text and attributes – primarily in support
of Microsoft Search technologies.

iFilter
Architecture
iFilter
Configuration

 Architecture
 Code Sample
 Suppliers
 Issues
iFilter Explorer

PDF Search in
Sharepoint :
iFilters

 iFilter Explorer
https://gist.github.com/jimschubert/1473904

Using iFilters
directly in
Code

StringBuilder Buffer=new StringBuilder();
string PDFFile = @"C:devPDF
Conferences.pdf";
FilterCode f=new FilterCode();
f.GetTextFromDocument(PDFFile, ref Buffer);
Console.WriteLine(Buffer);

[DllImport("query.dll", SetLastError = true,
CharSet = CharSet.Unicode)]
static extern int LoadIFilter(string
pwcsPath,
[MarshalAs(UnmanagedType.IUnknown)]
object pUnkOuter,
ref IFilter ppIUnk);

public void GetTextFromDocument(string Path, ref StringBuilder
Buffer)
{
IFilter filter = null;
int hresult;
IFilterReturnCodes rtn;
// Initialize the return buffer to 64K.
Buffer = new StringBuilder(64 * 1024);
// Try to load the filter for the path given.
hresult = LoadIFilter(Path, new IntPtr(0), ref filter);
if (hresult == 0)
{
IFILTER_FLAGS uflags;
// Init the filter provider.
rtn = filter.Init(
IFILTER_INIT.IFILTER_INIT_CANON_PARAGRAPHS |
IFILTER_INIT.IFILTER_INIT_CANON_HYPHENS |
IFILTER_INIT.IFILTER_INIT_CANON_SPACES |
IFILTER_INIT.IFILTER_INIT_APPLY_INDEX_ATTRIBUTES |
IFILTER_INIT.IFILTER_INIT_INDEXING_ONLY,
0, new IntPtr(0), out uflags);
if (rtn == IFilterReturnCodes.S_OK)
{
STAT_CHUNK statChunk;
iFilter Test
Bookmark

PDF
Attachment

XMP
Metadata
Text

Image/OCR Text
Dictionary
Metadata

Annotation
Adobe
iFilter

FoxIt
iFilter

Microsoft
Format Handler

Body Text

iFilter Test
Results

PDFLib
iFilter




Bookmarks



Dictionary
Metadata










Annotations






XMP Metadata







PDF Attachment



*









Classify :





Dealing with
Image and
Mixed-Mode
PDFs

Image-Only
Born-Digital
Part Image-Only, Part Born-Digital
Previously OCRed
 Objectives:
 Ensure Full Searchability
 Avoid Text to Image Processing

 Process :

Dealing with
Image and
Mixed-Mode
PDFs

 Capture Time?
 Scheduled In-Place?
 Text Search vs Metadata Search
 Crawled vs Managed Properies
 Review Requirements

 Dictionary Metadata
 XMP Metadata
 Entity Extraction

PDF Metadata
In Sharepoint

 Consider Automation
Crawled vs Managed Properies

PDF Metadata
In Sharepoint
PDF Metadata
In Sharepoint :
Using Event
Receivers

 Event Receivers can enable Metadata assignment
Entity Extraction

PDF Metadata
In Sharepoint
Configuration

 Sharepoint 2010
 Sharepoint 2013
 Missing icon and iFilter

Sharepoint
2010 PDF
Configuration

http://www.adobe.com/devnet-docs/acrobatetk/tools/AdminGuide/Acrobat_Reader_IFilter_configuration.pdf
Sharepoint
2010 PDF
Configuration
 Default for PDF : X-Download-Options: noopen' added to HTTP
Response Header

Sharepoint
PDF
Configuration
 PDF Format Handler Support
 Currently no iFilter Support for PDF !?!?!!

Sharepoint
2013 and PDF
Configuration
Inline Viewing PDF in Sharepoint 2013

Sharepoint
2013 and PDF
Configuration

http://stevemannspath.blogspot.co.uk/2012/10/sharepoint-2013-pdf-preview-in-search.html
http://stevemannspath.blogspot.co.uk/2013/04/sharepoint-2013-pdf-support-and.html
 Microsoft Sharepoint Server - 125 million licenses sold
 Sharepoint to be a natural target for PDF storage
 PDF as a Sharepoint “First Class Citizen”

Summary

Contact : neil.pitman@aquaforest.com

Más contenido relacionado

Último

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Destacado

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Destacado (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

Pdf and microsoft share point hurdles to overcome

  • 1. PDF AssociationTechnical Conference June 18-19 2013 PDF and Microsoft Sharepoint Hurdles to Overcome Neil Pitman Aquaforest Limited Version 1.120613
  • 2. Objective PDF as a Sharepoint “First Class Citizen”
  • 3.  Objectives  Sharepoint Overview  PDF Capture  PDF Search Agenda  iFilters  Handling Image and Mixed Mode PDFs  PDF Metadata  Dictionary, XMP and Entity Extraction  Configuration  Sharepoint 2010 , 2013  Summary
  • 4. Microsoft Sharepoint Server - 125 million licenses sold Sharepoint to be a natural target for PDF storage  What is Sharepoint?  On-Premise and Cloud-based Collaboration & Document Management Platform Sharepoint Overview  Origin - 2001  Usage  Focus on MS Office Documents  Typically distributed capture
  • 5.  Sharepoint Editions (2010, 2013) Sharepoint Overview  Foundation  Standard  Enterprise  Office 365 / Sharepoint Online  Ecosystem  Partner Products  Office / Sharepoint Marketplace
  • 6. Sharepoint Architecture Overview  MS Web-based (IIS)  MS Office Integration  SQL Server Storage  List or library data in a site collection is stored in a SQL Server database table, which uses queries, indexes and locks to maintain overall performance, sharing, and accuracy.  Filtered views with column indexes (and other operations) create database queries that identify a subset of columns and rows and return this subset to your computer.  Thresholds and limits help throttle operations and balance resources for many simultaneous users.  Privileged developers can use object model overrides to temporarily increase thresholds and limits for custom applications.  Administrators can specify dedicated time windows for all users to do unlimited operations during off-peak hours.  Information workers can use appropriate views, styles, and page limits to speed up the display of data on the page. Microsoft Technology Stack      Windows Server 2008/12 Internet Information Server (IIS) .Net Framework SQL Server MS Office
  • 7.  Options PDF Capture for Sharepoint      Sharepoint UI Acrobat XI Load Tools Custom Code Workflow & Event Receivers WebRequest request = WebRequest.Create(destUrl); request.Credentials = CredentialCache.DefaultCredentials; request.Method = "PUT"; byte[] buffer = new byte[1024]; using (Stream stream = request.GetRequestStream()) using (MemoryStream ms = new MemoryStream(fileBytes)) { for (int i = ms.Read(buffer, 0, buffer.Length); i > 0; i = ms.Read(buffer, 0, buffer.Length)) { stream.Write(buffer, 0, i); } } WebResponse response = request.GetResponse(); response.Close(); Logging.Log("Upload successful");
  • 9. PDF Search in Sharepoint Overview  Item 1  Item 2
  • 10. iFilters scan documents for text and attributes – primarily in support of Microsoft Search technologies. iFilter Architecture
  • 11. iFilter Configuration  Architecture  Code Sample  Suppliers  Issues
  • 12. iFilter Explorer PDF Search in Sharepoint : iFilters  iFilter Explorer
  • 13. https://gist.github.com/jimschubert/1473904 Using iFilters directly in Code StringBuilder Buffer=new StringBuilder(); string PDFFile = @"C:devPDF Conferences.pdf"; FilterCode f=new FilterCode(); f.GetTextFromDocument(PDFFile, ref Buffer); Console.WriteLine(Buffer); [DllImport("query.dll", SetLastError = true, CharSet = CharSet.Unicode)] static extern int LoadIFilter(string pwcsPath, [MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter, ref IFilter ppIUnk); public void GetTextFromDocument(string Path, ref StringBuilder Buffer) { IFilter filter = null; int hresult; IFilterReturnCodes rtn; // Initialize the return buffer to 64K. Buffer = new StringBuilder(64 * 1024); // Try to load the filter for the path given. hresult = LoadIFilter(Path, new IntPtr(0), ref filter); if (hresult == 0) { IFILTER_FLAGS uflags; // Init the filter provider. rtn = filter.Init( IFILTER_INIT.IFILTER_INIT_CANON_PARAGRAPHS | IFILTER_INIT.IFILTER_INIT_CANON_HYPHENS | IFILTER_INIT.IFILTER_INIT_CANON_SPACES | IFILTER_INIT.IFILTER_INIT_APPLY_INDEX_ATTRIBUTES | IFILTER_INIT.IFILTER_INIT_INDEXING_ONLY, 0, new IntPtr(0), out uflags); if (rtn == IFilterReturnCodes.S_OK) { STAT_CHUNK statChunk;
  • 15. Adobe iFilter FoxIt iFilter Microsoft Format Handler Body Text iFilter Test Results PDFLib iFilter   Bookmarks  Dictionary Metadata       Annotations     XMP Metadata    PDF Attachment  *      
  • 16. Classify :     Dealing with Image and Mixed-Mode PDFs Image-Only Born-Digital Part Image-Only, Part Born-Digital Previously OCRed
  • 17.  Objectives:  Ensure Full Searchability  Avoid Text to Image Processing  Process : Dealing with Image and Mixed-Mode PDFs  Capture Time?  Scheduled In-Place?
  • 18.  Text Search vs Metadata Search  Crawled vs Managed Properies  Review Requirements  Dictionary Metadata  XMP Metadata  Entity Extraction PDF Metadata In Sharepoint  Consider Automation
  • 19. Crawled vs Managed Properies PDF Metadata In Sharepoint
  • 20. PDF Metadata In Sharepoint : Using Event Receivers  Event Receivers can enable Metadata assignment
  • 23.  Missing icon and iFilter Sharepoint 2010 PDF Configuration http://www.adobe.com/devnet-docs/acrobatetk/tools/AdminGuide/Acrobat_Reader_IFilter_configuration.pdf
  • 25.  Default for PDF : X-Download-Options: noopen' added to HTTP Response Header Sharepoint PDF Configuration
  • 26.  PDF Format Handler Support  Currently no iFilter Support for PDF !?!?!! Sharepoint 2013 and PDF Configuration
  • 27. Inline Viewing PDF in Sharepoint 2013 Sharepoint 2013 and PDF Configuration http://stevemannspath.blogspot.co.uk/2012/10/sharepoint-2013-pdf-preview-in-search.html http://stevemannspath.blogspot.co.uk/2013/04/sharepoint-2013-pdf-support-and.html
  • 28.  Microsoft Sharepoint Server - 125 million licenses sold  Sharepoint to be a natural target for PDF storage  PDF as a Sharepoint “First Class Citizen” Summary Contact : neil.pitman@aquaforest.com