We propose a novel system called PrivatePond, which was designed with the goal of allowing an end-user to create, store, and search a corpus of web documents, using an untrusted service provider, and without compromising the confidentiality of the documents in the corpus.
2. Outsourcing Data to the Cloud Increase in cloud computing Outsource documents management to service providers Search and retrieve documents from the cloud Leverage existing search infrastructure High quality search results 2
3. Outsourcing Challenge: Confidentiality Documents may contain private information The service provider/public should not have access to the contents How can we balance confidentiality and search quality? WEB Intranet Search Engines 3
4. PrivatePond Create and store a corpus of confidential hyperlinked documents Search confidential document using an unmodified search engine Balance privacy and searchability with a secure indexable representation WEB Intranet Intranet Search Engines 4
5. PrivatePond Design Goals User Experience: Document Confidentiality Search Quality Transparency Search System: Minimal Overhead Leverage Existing Search Infrastructure Previous work requires modification to the search engine [Song 2000, Bawa 2003, Zerr 2008] 5
6. Outsourcing Architecture 6 Outsource the original corpus Does not maintain confidentiality D Service (Unmodified) Search Engine Ranked Result Document(s) D Q User Search
7. Outsourcing Architecture Outsource encrypted documents Local proxy encrypts and decrypts Local proxy performs the searches High search overhead 7 E(D) Service (Unmodified) Search Engine Local Proxy Ranked Result Document(s) D Q User Search
8. PrivatePond Architecture 8 Secure Indexable Representation Attached to encrypted document Indexable Searchable Secure Indexable Representation E(D) Service (Unmodified) Search Engine E(D) Q’ Local Proxy Ranked Result Document(s) D Q User Search
9. Outsourcing Search 9 Practical Tradeoffs… Search Quality Confidentiality Indexable Representation Outsource Original Corpus - Searchable - Not confidential Outsource Encrypted Corpus - Confidential - Not easily searched
10. Sample Indexable Representation First, consider encrypting each word in a document Maintain links between indexable representations Vulnerable to attacks: Language structure (e.g., <noun> <verb> <noun>) Frequency of words (e.g., twinkle is most frequent) [Kumar 2007] Twinkle, twinkle little star AAA AAA BBB CCC Document Indexable Representation 10
11. Second, represent documents as an encrypted set-of-words Prevents attacks on a single indexable representation Vulnerable to attacks that aggregate word frequencies across all indexable representations in the corpus Doc 2 Doc 1 Doc 3 AAA BBB CCC AAA BBB CCC AAA BBB CCC Sample Indexable Representation AAA BBB CCC Corpus of Indexable Representations Aggregate Document Frequency 11
12.
13. Set-of-words representation + Padding (BW = 3) PrivatePond Indexable Representation AAA BBB CCC AAA BBB CCC AAABBBCCC Aggregate Document Frequency Corpus of Indexable Representations 13
16. Padding of tokens introduces false positives14 What is the effect of the indexable representation on search quality?
17. Evaluation Data: Sample of Simple Wikipedia (Small Corpus) Full Simple Wikipedia (Large Corpus) Query workload of 10 K queries Evaluation preformed with MySQL 15
18. Ranking Models Ranking Models: TFIDF (as implemented in MySQL FULLTEXT) PageRank Combination of Ranking Models Measure change in search quality due to the indexable representation 16
19. Search Quality Metrics Indexable Representation Original Corpus Search Engine Search Engine Ranked Results: Ranked Results: Gold List Pond List 17
Consider a small company’s intranetOffload management responsibilities
Secure boolean search on encrypted documents /Secure inverted indexes for document retrieval Transparency – seamless interaction for the userQuery run time
Traditional search architecture query returns ranked list of documents
Download each encrypted document to search
So not confidential?
One example to strike a balance between searchability and confidentiality
Impact on Search Quality Lose proximity-based search Lose term frequency Padding of tokens introduces false positives
Given a ranking model, examine the change in search quality; we do not determine the best ranking modelN – N highest ranked documents
Meaning of N
Bw = 1
Varying confidentiality and search quality characteristics