SlideShare una empresa de Scribd logo
1 de 24
PrivatePond: Outsourced Management of Web Corpuses Daniel Fabbri, Arnab Nandi,  Kristen LeFevre, H.V. Jagadish University of Michigan 1
Outsourcing Data to the Cloud Increase in cloud computing Outsource documents management to service providers Search and retrieve documents from the cloud Leverage existing search infrastructure High quality search results 2
Outsourcing Challenge: Confidentiality Documents may contain private information The service provider/public should not have access to the contents How can we balance confidentiality and search quality? WEB Intranet Search Engines 3
PrivatePond Create and store a corpus of confidential hyperlinked documents  Search confidential document using an unmodified search engine Balance privacy and searchability with a secure indexable representation WEB Intranet Intranet Search Engines 4
PrivatePond Design Goals User Experience: Document Confidentiality Search Quality Transparency Search System: Minimal Overhead Leverage Existing Search Infrastructure Previous work requires modification to the search engine    [Song 2000, Bawa 2003, Zerr 2008] 5
Outsourcing Architecture 6 Outsource the original corpus Does not maintain confidentiality D Service (Unmodified) Search Engine Ranked Result Document(s) D Q User Search
Outsourcing Architecture Outsource encrypted documents Local proxy encrypts and decrypts Local proxy performs the searches High search overhead 7 E(D) Service (Unmodified) Search Engine Local Proxy Ranked Result Document(s) D Q User Search
PrivatePond Architecture 8 Secure Indexable Representation Attached to encrypted document Indexable Searchable Secure Indexable  Representation E(D) Service (Unmodified) Search Engine E(D) Q’ Local Proxy Ranked Result Document(s) D Q User Search
Outsourcing Search 9 Practical Tradeoffs… Search Quality Confidentiality Indexable Representation Outsource Original Corpus   - Searchable   - Not confidential Outsource Encrypted Corpus - Confidential   - Not easily searched
Sample Indexable Representation First, consider encrypting each word in a document Maintain links between indexable representations  Vulnerable to attacks: Language structure (e.g., <noun> <verb> <noun>) Frequency of words (e.g., twinkle is most frequent)  [Kumar 2007] Twinkle, twinkle little star AAA AAA BBB CCC Document Indexable Representation 10
Second, represent documents as an encrypted set-of-words Prevents attacks on a single indexable representation Vulnerable to attacks that aggregate word frequencies across all indexable representations in the corpus Doc 2 Doc 1 Doc 3 AAA BBB CCC AAA BBB CCC AAA BBB CCC Sample Indexable Representation AAA BBB CCC Corpus of Indexable Representations Aggregate  Document Frequency 11
Third, Set-of-words representation + Padding (BW = 3) ,[object Object],Sample Indexable Representation AAA BBB CCC BBB CCC CCC Aggregate  Document Frequency Corpus of Indexable Representations 12
Set-of-words representation + Padding (BW = 3) PrivatePond Indexable Representation AAA BBB CCC AAA BBB CCC AAABBBCCC Aggregate  Document Frequency Corpus of Indexable Representations 13
PrivatePond Indexable Representation  Impact on Search Quality ,[object Object]
  Lose term frequency
  Padding of tokens introduces false positives14 What is the effect of the indexable representation on search quality?
Evaluation Data: Sample of Simple Wikipedia (Small Corpus) Full  Simple Wikipedia (Large Corpus) Query workload of 10 K queries Evaluation preformed with MySQL 15
Ranking Models Ranking Models: TFIDF (as implemented in MySQL FULLTEXT)  PageRank Combination of Ranking Models Measure change in search quality due to the indexable representation 16
Search Quality Metrics Indexable Representation Original  Corpus Search Engine Search Engine Ranked Results: Ranked Results: Gold List Pond List 17
Example: Search Quality Metrics ,[object Object]
N – Consider documents ranked from 1 to N
  P(N) = [gold list INTERSECT pond list] / N
  P(3) = 2/3
  Two additional metrics (included in the paper):

Más contenido relacionado

La actualidad más candente

Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Jeet Das
 
SURE_2014 Poster 2.0
SURE_2014 Poster 2.0SURE_2014 Poster 2.0
SURE_2014 Poster 2.0Alex Sumner
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderRuben Taelman
 
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Gong Cheng
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data PublishingRuben Taelman
 

La actualidad más candente (6)

Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)
 
SURE_2014 Poster 2.0
SURE_2014 Poster 2.0SURE_2014 Poster 2.0
SURE_2014 Poster 2.0
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
 
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
 
EKAW - Linked Data Publishing
EKAW - Linked Data PublishingEKAW - Linked Data Publishing
EKAW - Linked Data Publishing
 

Destacado

Dé Managementconferentie 2011
Dé Managementconferentie 2011   Dé Managementconferentie 2011
Dé Managementconferentie 2011 saMBO-ICT
 
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...Clarke Ching
 
Hans Appel260308
Hans Appel260308Hans Appel260308
Hans Appel260308saMBO-ICT
 
Augmented Reality Arno Coenders
Augmented Reality Arno CoendersAugmented Reality Arno Coenders
Augmented Reality Arno CoenderssaMBO-ICT
 
Nbl Vermeend26mrt08
Nbl Vermeend26mrt08Nbl Vermeend26mrt08
Nbl Vermeend26mrt08saMBO-ICT
 
2020 InZicht ROC Mondriaan
2020 InZicht ROC Mondriaan2020 InZicht ROC Mondriaan
2020 InZicht ROC MondriaansaMBO-ICT
 
This call is being recorded
This call is being recordedThis call is being recorded
This call is being recordedsaMBO-ICT
 
DeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion SlideshowDeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion Slideshowmistersugar
 
Multitasking is evil
Multitasking is evilMultitasking is evil
Multitasking is evilClarke Ching
 

Destacado (9)

Dé Managementconferentie 2011
Dé Managementconferentie 2011   Dé Managementconferentie 2011
Dé Managementconferentie 2011
 
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
Rocks Into Gold - Helping Programmers THRIVE through the Credit Crunch - by C...
 
Hans Appel260308
Hans Appel260308Hans Appel260308
Hans Appel260308
 
Augmented Reality Arno Coenders
Augmented Reality Arno CoendersAugmented Reality Arno Coenders
Augmented Reality Arno Coenders
 
Nbl Vermeend26mrt08
Nbl Vermeend26mrt08Nbl Vermeend26mrt08
Nbl Vermeend26mrt08
 
2020 InZicht ROC Mondriaan
2020 InZicht ROC Mondriaan2020 InZicht ROC Mondriaan
2020 InZicht ROC Mondriaan
 
This call is being recorded
This call is being recordedThis call is being recorded
This call is being recorded
 
DeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion SlideshowDeKalb High School '88 Reunion Slideshow
DeKalb High School '88 Reunion Slideshow
 
Multitasking is evil
Multitasking is evilMultitasking is evil
Multitasking is evil
 

Similar a PrivatePond: Outsourced Management of Web Corpuses

Exploiting web search engines to search structured
Exploiting web search engines to search structuredExploiting web search engines to search structured
Exploiting web search engines to search structuredNita Pawar
 
data base management system (DBMS)
data base management system (DBMS)data base management system (DBMS)
data base management system (DBMS)Varish Bajaj
 
X.500 More Than a Global Directory
X.500 More Than a Global DirectoryX.500 More Than a Global Directory
X.500 More Than a Global Directorylurdhu agnes
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
 
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoLa big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoData Con LA
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsIRJET Journal
 
Concept Based Search
Concept Based SearchConcept Based Search
Concept Based Searchfreewi11
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerSean Golliher
 
TID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseTID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseWanBK Leo
 
Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)Prof Ansari
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchC4Media
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Metadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsMetadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsPéter Király
 
search.ppt
search.pptsearch.ppt
search.pptPikaj2
 
An Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search TechniqueAn Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search Techniquepaperpublications3
 

Similar a PrivatePond: Outsourced Management of Web Corpuses (20)

Exploiting web search engines to search structured
Exploiting web search engines to search structuredExploiting web search engines to search structured
Exploiting web search engines to search structured
 
How web searching engines work
How web searching engines workHow web searching engines work
How web searching engines work
 
data base management system (DBMS)
data base management system (DBMS)data base management system (DBMS)
data base management system (DBMS)
 
X.500 More Than a Global Directory
X.500 More Than a Global DirectoryX.500 More Than a Global Directory
X.500 More Than a Global Directory
 
I explore
I exploreI explore
I explore
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoLa big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
 
An Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File SystemsAn Efficient Approach to Manage Small Files in Distributed File Systems
An Efficient Approach to Manage Small Files in Distributed File Systems
 
Concept Based Search
Concept Based SearchConcept Based Search
Concept Based Search
 
search engine
search enginesearch engine
search engine
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 
TID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To DatabaseTID Chapter 10 Introduction To Database
TID Chapter 10 Introduction To Database
 
Anatomy of google
Anatomy of googleAnatomy of google
Anatomy of google
 
Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)Normalisation in Database management System (DBMS)
Normalisation in Database management System (DBMS)
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Metadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation beginsMetadata Quality Assurance Part II. The implementation begins
Metadata Quality Assurance Part II. The implementation begins
 
search.ppt
search.pptsearch.ppt
search.ppt
 
An Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search TechniqueAn Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search Technique
 
Lecture 3 note.pptx
Lecture 3 note.pptxLecture 3 note.pptx
Lecture 3 note.pptx
 

Más de arnabdotorg

Guided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result ParadigmGuided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result Paradigmarnabdotorg
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Runningarnabdotorg
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Runningarnabdotorg
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Runningarnabdotorg
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Runningarnabdotorg
 

Más de arnabdotorg (6)

Guided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result ParadigmGuided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result Paradigm
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
The Benefits of Running
The Benefits of RunningThe Benefits of Running
The Benefits of Running
 
yvmail
yvmailyvmail
yvmail
 

Último

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

PrivatePond: Outsourced Management of Web Corpuses

  • 1. PrivatePond: Outsourced Management of Web Corpuses Daniel Fabbri, Arnab Nandi, Kristen LeFevre, H.V. Jagadish University of Michigan 1
  • 2. Outsourcing Data to the Cloud Increase in cloud computing Outsource documents management to service providers Search and retrieve documents from the cloud Leverage existing search infrastructure High quality search results 2
  • 3. Outsourcing Challenge: Confidentiality Documents may contain private information The service provider/public should not have access to the contents How can we balance confidentiality and search quality? WEB Intranet Search Engines 3
  • 4. PrivatePond Create and store a corpus of confidential hyperlinked documents Search confidential document using an unmodified search engine Balance privacy and searchability with a secure indexable representation WEB Intranet Intranet Search Engines 4
  • 5. PrivatePond Design Goals User Experience: Document Confidentiality Search Quality Transparency Search System: Minimal Overhead Leverage Existing Search Infrastructure Previous work requires modification to the search engine [Song 2000, Bawa 2003, Zerr 2008] 5
  • 6. Outsourcing Architecture 6 Outsource the original corpus Does not maintain confidentiality D Service (Unmodified) Search Engine Ranked Result Document(s) D Q User Search
  • 7. Outsourcing Architecture Outsource encrypted documents Local proxy encrypts and decrypts Local proxy performs the searches High search overhead 7 E(D) Service (Unmodified) Search Engine Local Proxy Ranked Result Document(s) D Q User Search
  • 8. PrivatePond Architecture 8 Secure Indexable Representation Attached to encrypted document Indexable Searchable Secure Indexable Representation E(D) Service (Unmodified) Search Engine E(D) Q’ Local Proxy Ranked Result Document(s) D Q User Search
  • 9. Outsourcing Search 9 Practical Tradeoffs… Search Quality Confidentiality Indexable Representation Outsource Original Corpus - Searchable - Not confidential Outsource Encrypted Corpus - Confidential - Not easily searched
  • 10. Sample Indexable Representation First, consider encrypting each word in a document Maintain links between indexable representations Vulnerable to attacks: Language structure (e.g., <noun> <verb> <noun>) Frequency of words (e.g., twinkle is most frequent) [Kumar 2007] Twinkle, twinkle little star AAA AAA BBB CCC Document Indexable Representation 10
  • 11. Second, represent documents as an encrypted set-of-words Prevents attacks on a single indexable representation Vulnerable to attacks that aggregate word frequencies across all indexable representations in the corpus Doc 2 Doc 1 Doc 3 AAA BBB CCC AAA BBB CCC AAA BBB CCC Sample Indexable Representation AAA BBB CCC Corpus of Indexable Representations Aggregate Document Frequency 11
  • 12.
  • 13. Set-of-words representation + Padding (BW = 3) PrivatePond Indexable Representation AAA BBB CCC AAA BBB CCC AAABBBCCC Aggregate Document Frequency Corpus of Indexable Representations 13
  • 14.
  • 15. Lose term frequency
  • 16. Padding of tokens introduces false positives14 What is the effect of the indexable representation on search quality?
  • 17. Evaluation Data: Sample of Simple Wikipedia (Small Corpus) Full Simple Wikipedia (Large Corpus) Query workload of 10 K queries Evaluation preformed with MySQL 15
  • 18. Ranking Models Ranking Models: TFIDF (as implemented in MySQL FULLTEXT) PageRank Combination of Ranking Models Measure change in search quality due to the indexable representation 16
  • 19. Search Quality Metrics Indexable Representation Original Corpus Search Engine Search Engine Ranked Results: Ranked Results: Gold List Pond List 17
  • 20.
  • 21. N – Consider documents ranked from 1 to N
  • 22. P(N) = [gold list INTERSECT pond list] / N
  • 23. P(3) = 2/3
  • 24. Two additional metrics (included in the paper):
  • 26. Rank Perturbation 18
  • 27.
  • 28. PageRank is unaffected by the set-of-words representation19
  • 29.
  • 30. Padding in documents with high PageRankor low document frequency20
  • 31.
  • 32. Conclusion Present the PrivatePond architecture Outsourcing search Goal of balancing searchability and confidentiality Leverages existing search engine infrastructure Future Work: Alternative Indexable Representations 22
  • 33. more info at www.eecs.umich.edu/db 23

Notas del editor

  1. Consider a small company’s intranetOffload management responsibilities
  2. Secure boolean search on encrypted documents /Secure inverted indexes for document retrieval Transparency – seamless interaction for the userQuery run time
  3. Traditional search architecture query returns ranked list of documents
  4. Download each encrypted document to search
  5. So not confidential?
  6. One example to strike a balance between searchability and confidentiality
  7. Impact on Search Quality Lose proximity-based search Lose term frequency Padding of tokens introduces false positives
  8. Given a ranking model, examine the change in search quality; we do not determine the best ranking modelN – N highest ranked documents
  9. Meaning of N
  10. Bw = 1
  11. Varying confidentiality and search quality characteristics