Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Biometric Databases and Hadoop__HadoopSummit2010
1. Hadoop for Large-scale Biometric Databases Jason Trost Cloud Computing Team Booz | Allen | Hamilton
2. This session shows the application of Hadoop and a large-scale, low-latency distributed fuzzy matching database to Biometrics Background - what you need to know about Biometrics The Problem – Big Data and unordered fuzzy matching A Solution - Hadoop Applications for Biometrics Session Agenda
3. Key Takeaways from this Session Searching large-scale Biometric Databases is a hard problem Hadoop is a potential solution to this problem Hadoop is a great platform for solving all sorts of Big Data and distributed computing problems, even low latency searching 3
4. 4 Introduction to Biometrics Iris Face Fingerprint Biometrics: The science of establishing the identity of an individual based on the physical, chemical, or behavioral attributes of the person * Modality: Physical or behavioral characteristics of an individual used to establish identity* Template: A symbolic or numeric representation of a modality optimized for storage and/or matching Palm Print Gait Hand Geometry Signature Ear Voice Keystroke Pattern Facial Thermogram Vein Pattern * Handbook of Biometrics. A. Jain, P. Flynn A. Ross.
5.
6. It has many useful applications where establishing identity is important
7. Banks and Financial Services companies are using biometrics to prevent banking and identity fraud
8.
9. Enrollment: Adding New Identities and Biometrics Data to the Database Collect biographic information from an individual such as name, address, SSN, etc Capture biometric data in raw form (e.g. high resolution images) Transform raw biometric data into encoded biometric template (feature vector) Store all this information in the biometrics database 7
10. Verification: One-to-one Matching Lookup the biometric template for a particular individual Verify that the stored template and the recently captured template match Fuzzy matching is used for matching the biometric templates 8
11. Identification: One-to-Many Searching Capture some number of raw Biometric features, convert them into Biometric templates Perform fuzzy matching against large number of stored biometric templates to determine the identity If latency is not an issue, this is relatively straightforward, especially in MapReduce This is a hard problem for low latency applications and increasing in complexity as the size of these databases grow There is a speed/accuracy tradeoff The search space can be reduced using clustering techniques, but this only goes so far 9
12. What is Fuzzy matching? Fuzzy matching is an operation performed on two objects that determines how similar the objects are to each other Typically this operation produces a numeric similarity score Necessary when data collected from sensor is noisy, and matching needs to be very accurate Almost all biometric matching algorithms perform some sort of fuzzy matching: Elastic Bunch Graph Matching – face recognition algorithm BOZORTH3 - minutiae based fingerprint matching algorithm IrisCode - iris matching algorithm Other Examples: Image comparison Audio comparison Video comparison 10
13. Why Fuzzy Matching? Biometric data is inherently noisy and dirty Conditions are not exactly the same when the original biometric data was captured (Enrollment) and when a new reading occurs (Identification) Different types of cameras and sensors made by different companies Partial or smudged fingerprints (e.g. crime scene) Changes in skin tone, facial hair, makeup Different lighting conditions Aging and skin damage Weight gain, Weight loss Injury Derived from http://www.flickr.com/photos/glennji/3558118429/. Licensed under Creative Commons 11
14. Existing Large-scale Biometric Databases US Visitor & Immigrant Status Indicator Technology (US-VISIT)* International travelers’ biometrics (fingerprint and face) Collected at US ports of entry, Immigration Services, and State Department Used to support the Department of Homeland Security's mission FBI Integrated Automated Fingerprint Identification System, (IAFIS)** Used to solve and prevent crime and catch criminals and terrorists Includes fingerprints, criminal histories, mug shots, scars and tattoo photos, physical characteristics like height, weight, and hair and eye color, and aliases AllTrust Networks Paycheck Secure System Uses fingerprints to support secure check cashing Designed to stop fraud and speed check cashing Plus many more 12 * One Team, One Mission, Securing our Homeland. US DHS. ** http://www.fbi.gov/hq/cjisd/iafis/iafis_facts.htm *** http://www.alltrustnetworks.com/News/6Million/tabid/378/Default.aspx
15. This session shows the application of Hadoop and a large-scale, low-latency distributed fuzzy matching database to Biometrics Background - what you need to know about Biometrics The Problem – Big Data and unordered fuzzy matching A Solution - Hadoop Applications for Biometrics Session Agenda
16. Combined U.S. government biometric databases are expected to grow to hold billions of identities The DHS’s US-VISIT program has the world’s largest and fastest biometric database (called IDENT) with over 110 million identities and roughly 145,000 identities enrolled or verified daily* From the FBI’s Integrated Automated Fingerprint Identification System (IAFIS) alone, there are 66.5 million identities with 8,000-10,000 more subjects added each day ** India is reportedly creating a biometric database to hold the fingerprints and face images for each of its 1.2 billion citizens as part of its Unique Identification Project *** European Union’s Biometric Matching System (EU-BMS) is expected to hold biometric information of 70 Million people to support visa applications, border control, and immigration **** AllTrust Networks Paycheck Secure system has enrolled over 6 Million users and has performed over 70 Million transactions***** 13 Growth of Biometric Databases * US-VISIT: The world’s largest biometric application. William Graves. ** http://www.fbi.gov/hq/cjisd/iafis/iafis_facts.htm *** http://www.business-standard.com/india/news/national-population-register-to-start-biometrics-data-collectiondec/399135/ **** http://www.findbiometrics.com/articles/i/5220/ ***** http://www.alltrustnetworks.com/News/6Million/tabid/378/Default.aspx
17. Biometric Databases are a Big Data Problem Large scale operations Searching and storing 100 Million to 1 Billion Identities Multiple biometric templates and raw files per identity for multimodal matching (Fingerprints, Faces, and Iris) Typically, new raw files and templates are stored after each Verification and Identification operation because the biometrics readings change over time Raw Images: (500M Identities x 16KB-300KB* x 10-20) = 1-2 PB Biometric Templates: (500M Identities x 256b-3KB** x 10-20) = 2-27 TB 15
22. Fuzzy match searches are expensive and typically a large number of objects need to be searched to find a match16
23. This session shows the application of Hadoop and a large-scale, low-latency distributed fuzzy matching database to Biometrics Background - what you need to know about Biometrics The Problem – Big Data and unordered fuzzy matching A Solution - Hadoop Applications for Biometrics Session Agenda
24.
25. MapReduce can be used for improving feature selection by analyzing the entire database to select features that are most effective in distinguishing identities
26. Easy to test and deploy new algorithms against all data at scale
27. N-to-N matching search (special type of Identification search) to cleanse database, find people trying to circumvent the system (Identity Fraud, etc)
28. Map Reduce can be used for batched searching where latency doesn’t matter
32. This makes searching faster because a only small subset of the data must be processed
33. This concept is based on work done in academia**Efficient Search and Retrieval in Biometric Databases by Amit Mhatre, Srinivas Palla, Sharat Chikkerur and Venu Govindaraju * Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot
34. Bulk Clustering and Real-time Classification 22 This makes searching for keys faster because only a small subset of the entire dataset needs to be processed using fuzzy matching The classifier determines which Bins need to be searched in order to find the most likely matching keys
35. Fuzzy Table: Data Storage and Bins Bins are represented as directories in HDFS containing one or more chunk files (stored as SequenceFiles): /fuzzytable/_table_fingerprints/_bin_000001/_chunk_000001 Chunk files contain many {Key, Value} pairs and are a small multiple of the HDFS block size Chunk files are distributed uniformly and randomly across the Data Servers in the cluster This ensures that the bins are striped across the cluster for optimal parallel searching Also, chunk files are replicated across the Data Servers using the replication mechanism in HDFS Data Servers only search through chunk files that reside locally and results are returned in real-time as soon as a match is found 23
36. Fuzzy Table: Low Latency Fuzzy Matching Component The low latency component consists of three main parts Client – submit queries for Keys and get back {Key, Value} pairs Master Server – serve metadata about which Data Servers host which bins Data Servers – Actually perform fuzzy matching searches Data Servers perform fuzzy matching against Keys in order to find {Key, Value} records double score = fuzzyMatcher.match(key, storedRec.getKey()); if(score >= threshold) return storedRec; Fuzzy matching searches are performed in parallel across many Data Servers 24
45. Future Work Fuzzy Table is still a research prototype, but we plan to keep building it out to support this biometrics work Locality Sensitive Hashing instead of K-means clustering for binning and search space reduction Distributed/Replicated master servers (and Zookeeper integration) Real-time ingest Hopefully we will have performance/scalability metrics as well as more features and example applications to share within the next few months 33
46. Conclusion Searching large-scale Biometric Databases is a hard problem Hadoop is a potential solution to this problem We used MapReduce for bulk processing to enable distributed low latency fuzzy matching over HDFS Hadoop is a great platform for solving all sorts of Big Data and distributed computing problems, even for low latency searching 34
47. Contributors Cloud Computing Team Jason Trost Lalit Kapoor Daniel Neuberger Michael Beck Edmond Kohlwey Josh Sullivan Identity Management/Biometrics Team Abel Sussman Eric Karlinsky Deanna Walters Joel Rader Allen Wight 35
49. Contact Information – Cloud Computing Team 37 Joshua Sullivan Senior Associate Lalit Kapoor Senior Consultant Michael Beck Senior Consultant Daniel Neuberger Senior Consultant Jason Trost Associate Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)543-4611 sullivan_joshua@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000 kapoor_lalit@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000 kapoor_lalit@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)821-8000 kapoor_lalit@bah.com Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)543-4400 trost_jason@bah.com Edmund Kohlwey Consultant Booz Allen Hamilton Inc. 134 National Business Parkway. Annapolis Junction, Maryland 20701 (301)617-3523 kohlwey_edmund@bah.com
50. Contact Information – Identity Management Team 38 Joel Rader Identity Analyst Eric Karlinsky Identity Analyst Deanna Walters Biometrics Analyst Allen Wight Biometrics Analyst Booz Allen Hamilton Inc. 13200 Woodland Park Rd Herndon, VA 20171 (703) 984-0312 rader_joel@bah.com Booz Allen Hamilton Inc. 13200 Woodland Park Rd. Herndon, VA 20171 (703) 984-3532 Karlinsky_eric@bah.com Booz Allen Hamilton Inc. 13200 Woodland Park Rd Herndon, VA 20171 (703) 984-1982 walters_deanna@bah.com Booz Allen Hamilton Inc. 13200 Woodland Park Rd Herndon, VA 20171 (703) 984-1978 wight_allen@bah.com Abel Sussman Biometrics Subject Matter Expert Booz Allen Hamilton Inc. 13200 Woodland Park Rd. Herndon, VA 20171 (703) 984-7663 sussman_abel@bah.com