Federal, State and Local governments and the development community surrounding them are busy creating solutions leveraging the Apache Foundation Hadoop capabilities. This session will highlight the top five solutions selected by an all star panel of judges. Who will take home the coveted Hadoop Award for Government Excellence (Haggie) Nominations for Haggies are being accepted now at http://CTOlabs.com
Vector Search -An Introduction in Oracle Database 23ai.pptx
Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley - Crucial Point LLC
1. Government Big Data Solutions Award
Bob Gourley
CTOlabs.com http://ctolabs.com Nov 2011
2. About This Presentation:
• How can we help accelerate public sector innovation?
• Top Federal Mission Needs for Big Data
• The State of Big Data Solutions in the Federal Space
• The Intent of the Government Big Data Solutions Award
• Criteria
• Judges
• Top Nominees for 2011
• How to Nominate for 2012
• The Judges Choice for 2011
CTOlabs.com 2
4. The Government Needs More Agility*
―High tech runs three-times faster than normal businesses. And the
government runs three-times slower than normal businesses.
So we have a nine-times gap‖
– Andy Grove
The government can rapidly benefit from the lessons of high tech
by being a faster follower, especially when it comes to Big Data
constructs
Thesis: If the Big Data community understands more about federal
missions, challenges and successes, we can improve the speed
and effectiveness of federal solutions.
*Among other needs
CTOlabs.com 4
5. Top Federal Mission Needs for Big Data
Financial fraud detection across large, rapidly changing data sets
Cyber Security: rapid real time analysis of all relevant data
Rapid return of geospatial data based on query
Location based push of data: Focused on emergency response
Real time return of relevant search: USA.gov is exemplar
Real time suggestion of topics: USA.gov is exemplar
Real time suggestion of correlations: DoD has many use cases
Bioinformatics: Human Genome
Bioinformatics: Patient location, treatment, outcomes
These needs must be met in an era of significant downward pressure on budgets.
Scalable systems with well thought out governance & extensive automation are key.
CTOlabs.com 5
6. Most active fed solution areas:
Federal integrators: Spending internal research and development
funds to create prototypes and full solutions relevant to fed
missions
DoD and IC agencies: Using Big Data approaches to solve
―needle in the haystack‖ and ―connect the dots‖ problems
National Labs: Bioinformatics solutions have been put in place by
federal researchers
OMB and GSA: Ensuring sharing of lessons and solutions. Key
exemplars around web search methods. Solutions inside
government agencies and on citizen facing properties
Big Data solutions are already making a difference in government service to
citizens. Highlighting some of this virtuous work is a goal of our Government
Big Data Solutions Award.
CTOlabs.com 6
7. The Intent of the Government Big Data
Solutions Award
Established to help facilitate exchange of best practices, lessons
learned and creative ideas for solutions to hard data challenges
Special focus on solutions built around Apache Hadoop framework
Nominees and award winners to be written up in CTOlabs.com
technology reviews
Award meant to help generate exchange of lessons learned
We established a team of judges, asked them to consider mission impact as
primary criteria, and solicited award nominations via sites frequented by
government IT professionals and solution providers.
CTOlabs.com 7
8. Judges
Doug Cutting: An advocate and creator of open source search
technologies (@cutting)
Chris Dorobek: Founder, editor, publisher of DorobekInsider.com
(@DorobekINSIDER)
Ed Granstedt: QinetiQ Strategic Solution Center
Ryan LaSalle: Accenture Technology Labs (@Labsguy)
Alan Wade: Experienced federal CIO
Judges are all experienced innovators known for mastery in their fields
CTOlabs.com 8
9. Top Nominees for 2011
USA Search: Best in class hosted search services over more than
400 gov sites. Great use of CDH3.
GCE Federal: Cloud-based financial management solutions.
Apache Hadoop, Hbase, Lucene for Dept of Labor.
PNNL Bioinformatics: Leading researcher Dr. Taylor of PNNL is
advancing understanding of health, biology, genetics and computing
using Apache Hadoop/MapReduce/HBase.
SherpaSurfing: Use of CDH as a cybersecurity solution. Ingest
packet capture in any format, analyze trends, find malware, alert.
US Department of State: Bureau of Counselor Affairs. Large data
with important applications for citizen service and national security.
Each of these are making a difference for government missions right now.
CTOlabs.com 9
11. How to Nominate for 2012
Click Here.
Fill In Form.
Hit “Submit”
• We expect (and hope
for) a much more
crowded field of
contenders next year.
• Please let us know if
you are working on
things that feds should
be aware of.
• You can also submit
technologies for review
on our site.
CTOlabs.com 11
12. Special Mention
Department of State
Consular Consolidated Database
CTOlabs.com
13. Department of State (DoS), Bureau of Consular
Affairs (CA) Consular Consolidated Database
(CCD)
CCD is critical to citizen support and important in facilitating lawful
visits to US
First line of defense against unlawful entry
Largest connected/replicating database structure in the government
Pre-screening visa applicants, helps adjudicators weed out fraud
Used by multiple agencies
Very smart use of current data approaches to solve hard problems
CTOlabs.com 13
16. USA Search
Program of General Services Administration‘s (GSA) Office of
Citizen Services and Information Technologies.
Hosted search services for USA.gov and over 500 other
government websites.
Solves big data challenges with open source capabilities.
CDH3 since fall 2010. HDFS, Hadoop and Hive used in cost
effective, resilient, scalable solution.
Search Results. Search Suggestions. Trend analysis. Analytic
dashboards.
Bottom Line: USA Search brings the best of the open source community to
multiple government missions, including direct citizen support
CTOlabs.com 16
21. Department of State (DoS), Bureau of Consular
Affairs (CA) Consular Consolidated Database
(CCD)
•Bureau of Consular Affairs issues travel documents to U.S. and foreign citizens. CA stores data collected from
consular posts abroad and domestic processing centers, as well as other government agencies in the Consular
Consolidated Database (CCD).
•CCD holds over one hundred (115) terabytes of data, growing by 6-8 terabytes each month. Over 170 software
applications collect this information and provide interfaces with the numerous partner agencies that share data
with CA.
•CCD is the ―largest connected/replicating database structure in the government.‖
•Most of these applications use a ‗case‘ (such as a visa or passport application), and not a person record, as the
basis of their data storage and retrieval. At the application level, it is extremely difficult to link person information
in one application to potentially-matching person information contained in another application. A person could
apply for a visa at one location, and then apply at another location under a different name, and an adjudicator
may not be able to establish the link between the cases. The CCD can leverage all available data elements from
all applications throughout the system in order to determine all of the potential identity matches of any given
person that CA has encountered.
•The CCD also contains unstructured data, such as free-form comments or case notes. The CCD must deal with
millions of large image files, such as applicant photos or scanned documents. The CCD‘s powerful, custom-built
analytical tools synthesize the complex data captured by CA with the equally-complex data received from other
agencies. The CCD thus gives its users the ability to make informed decisions, detect and prevent fraud, and
identify potential national security threats.
CTOlabs.com 21
22. Department of State (DoS), Bureau of Consular
Affairs (CA) Consular Consolidated Database
•CCD is based on Oracle tools.
(CCD)
•The CCD can pre-screen a visa record before an adjudicator even looks at it. The CCD provides the means to
conduct vetting checks against various government databases.
•Due to the wide variety of resources used by the CCD, the system can establish links between two applicants
using completely different names. With each subsequent encounter, the CCD creates additional links, resulting in
a searchable, fully cross-referenced web of information that traces a person‘s activities across all of CA data. By
being able to see these links in a person-centric view, the adjudicators have a broader, more complete, and more
easily-accessible set of data with which to make better-informed decisions.
•The CCD automatically initiates biometric checks. The CCD automatically looks for fraud indicators. The CCD
captures all of the data entered during the process and automatically creates cross-references using the new
data.
•The CCD has transformed CA‘s mission delivery by breaking the paradigm of data isolated in independent
databases
•The CCD allows staff to focus its time on better customer service, investigative activities, and analysis. CA‘s
technical achievement with the CCD has been to create a robust, economical, and analytically-powerful data
platform in an environment where fragmentation and inefficiency had been the norm.
CTOlabs.com 22
23. USA Search: A Strategic Resource
• USASearch is a program of the General Services Administration‘s (GSA)
Office of Citizen Services and Information Technologies.
• GSA believes in building once and using many times. USASearch is no
exception. Since 2000, USASearch has provided hosted search services for
USA.gov and for more than 400 government websites—across all levels of
government—at no cost through its Affiliate Program.
• USASearch instituted many innovative changes in 2010—making it a model
for the Obama administration‘s effort to leverage open source technologies
and shared solutions to bring substantial cost savings for the government.
With its new open architecture model, the USASearch Program provides
viable and scalable shared search services.
• USASearch Solves Big Data Challenges
CTOlabs.com 23
24. USA Search: A Strategic Resource
• USASearch began using Cloudera‘s Distribution including Apache Hadoop (CDH3) for
the first time in the fall of 2010, and since then has seen its usage grow every month—
not just in scale, but also in scope.
• All of the search traffic across USA.gov and the hundreds of affiliate sites comes through
a single search service, and this generates a lot of data. To continuously improve the
service, USASearch needs aggregated information on what searchers look for, how well
they find it, and emerging trends, among other information. Once searches are initiated,
USASearch also needs to know what results are shown and clicked on. This information
needs to be broken down by affiliate and by time, and also aggregated across all
affiliates.
• The initial system was fairly simple and did just enough to address the most pressing
data needs. As USASearch watched its data grow and the nightly batch jobs took longer
and longer, it became clear that it would soon exhaust its existing resources. USASearch
considered scaling up the hardware vertically and sharding the database horizontally, but
both options seemed to kick the can down the road. Larger database hardware is both
costly and eventually insufficient for USASearch‘s needs, and sharding promised to take
all the usual issues associated with a single database system and multiply them.
• USASearch determined it needed HDFS, Hadoop, and Apache Hive—a big data system
that could grow cost effectively and without downtime, be naturally resilient to failures,
and sensibly handle backups.
CTOlabs.com 24
25. USA Search: A Strategic Resource
• USASearch Makes Data Actionable USASearch displays the results of its Hive
analyses in various analytics dashboards, but, more importantly, it also ensures
the results positively affect searchers‘ experience on government websites.
For example, USASearch uses Hadoop to generate contextually relevant and
timely search suggestions for each of its affiliated government websites.
Compare the different type-ahead suggestions for ‗gran‘ on NPS.gov and
USA.gov. Both websites use the same USASearch backend system, but the
suggestions differ completely.
• USASearch Is a Success The overhaul of USASearch‘s analytics is a dramatic
success story. In the space of a few months, USASearch went from having a
brittle and hard-to-scale RDBMS-based analytics platform to a much more agile
Hadoop-based system that is intrinsically designed to scale. USASearch
continues to see its Hadoop usage grow in scope with each new data source it
adds, and it is clear that USASearch will rely on it more and more as the suite of
tools and resources around Hadoop grows and matures in the future.
• By using a state-of-the-art open source technology, USASearch has created a
radically different search service that transforms the customer experience.
Having a government-owned and -controlled search service allows us to
constantly understand what‘s on the minds of Americans to drive enhancements
to other delivery channels. The public has a much improved experience when
interacting with the government due to USASearch.
CTOlabs.com 25
Editor's Notes
An important mission of the Department of State (DoS), Bureau of Consular Affairs (CA) is to issue travel documents to U.S. and foreign citizens. CA uses a suite of software applications at locations around the world to collect applicant data for the purpose of issuing immigrant visas, non-immigrant visas, and United States passports. CA stores data collected from consular posts abroad and domestic processing centers, as well as other government agencies, in the Consular Consolidated Database (CCD). Since its introduction, the CCD has proven to be a robust, economical, and analytically-powerful data platform in an environment where fragmentation and inefficiency had been the norm. Indeed, without the CCD and its capabilities, CA would not be able to make effective use of the massive amount of data it collects.The Size and Complexities of Consular Data: CA stores one hundred (115) terabytes of data in the CCD. On average, the CCD grows by 6-8 terabytes each month. Currently, over 170 software applications collect information for CA. CA uses these applications to process the many types of travel documents issued by the bureau. These applications also provide the interfaces with the numerous partner agencies that share data with CA. Most of these applications use a ‘case’ (such as a visa or passport application), and not a person record, as the basis of their data storage and retrieval. Each application collects different data, in a variety of formats, and with varying levels of detail. At the application level, it is extremely difficult to link person information in one application to potentially-matching person information contained in another application. A person could apply for a visa at one location, and then apply at another location under a different name, and an adjudicator may not be able to establish the link between the cases. However, since all CA data is stored in one central repository (the CCD), the CCD can leverage all available data elements from all applications throughout the system in order to determine all of the potential identity matches of any given person that CA has encountered.The CCD also contains unstructured data, such as free-form comments or case notes. The CCD must deal with millions of large image files, such as applicant photos or scanned documents. The CCD’s powerful, custom-built analytical tools synthesize the complex data captured by CA with the equally-complex data received from other agencies. The CCD thus gives its users the ability to make informed decisions, detect and prevent fraud, and identify potential national security threats. Sharing Consular DataThe CCD is at the heart of information sharing between the government agencies involved in the national security of the United States. Over 34,000 national security officials in the Department of State and its partner agencies use the CCD. In fact, the CCD now serves more external users (23,000) than internal DoS users (11,600). The statistics below illustrate just how vital the CCD is to the entire national security apparatus:• DHS: The CCD is the single most important and frequently-used source of data. DHS has over 17,000 users worldwide, averaging 7 million hits per month. • FBI: 1,700 users of the CCD, averaging 420,000 hits per month• DoD: 200 users, averaging 180,000 hits per monthBecause of the CCD, information sharing between posts and security partners is no longer a cumbersome effort. Instead, it is automated, simplified and routine. For example, in November 2010, the average response time for the over 630,000 fingerprint checks submitted to DHS was 10.5 minutes. The average response time for the 588,000 fingerprint checks submitted to the FBI was 14.6 minutes. This near real-time collection, distribution, and analysis of consular data is vital to those stakeholders who rely on consular data to make informed decisions.An Improved ArchitectureThe CCD’s architecture is designed to be flexible and scalable. It uses the latest generation of technologies and methodologies to enable rapid capture, distribution, and analysis of the massive amount of data collected by CA. The CCD captures data from 270 posts around the world and replicates that data using Oracle Multimaster Replication to a centralized repository in near real-time. This CCD architecture replaces the stove-piped concept of the past with a web-enabled, directly-accessible database platform. The CCD connects users to their data via a single-platform design that is forward-looking and easily integrated with external systems. Before the CCD, consular data resided on a decentralized global network of approximately 270 consular posts supported by independent, in-house systems. These systems contained all of the significant inefficiencies inherent when data resources are structurally isolated and widely distributed. Management reporting was inefficient. O&M costs were burdened with the necessity of delivering services individually to each post. The old architecture created formidable logistical and fiscal hurdles. Sharing data between posts and with partners was difficult and time-consuming. The inability to rapidly share information and to obtain early access to application data negatively impacted fraud detection and prevention. The new CCD architecture consolidated the individual data assets of each post into a design that incorporated advanced infrastructure components. This forward-looking model has the flexibility needed for system modifications, the easy integration of new stakeholders, and the ability to make use of future technologies. Today, according to Oracle, the CCD is the “largest connected/replicating database structure in the government.” The CCD is economical, too. The CCD architecture has saved CA $1.4 million annually. The CCD architecture has also established an enviable green profile. The CCD made possible the elimination of an entire Data Share Group with a hardware reduction of 100 servers, an 80% reduction in passport database servers, and reduced support costs by eliminating entire storage networks.Making Sense of Consular DataThe data contained in the CCD would mean little to the consular officer adjudicating a visa application, or to a Customs and Border Patrol agent at a border crossing, if not for the CCD’s ability to make sense of the enormity of the data (over 115 terabytes) it contains. The CCD can pre-screen a visa record before an adjudicator even looks at it. The CCD provides the means to conduct vetting checks against various government databases. The CCD contains powerful analytical tools and a set of custom-built services that allow users to do everything from sending a mass email to American citizens abroad to tracking fraud investigations. In short, the CCD is a one-stop shop for collecting, analyzing and making informed use of consular data. Consulate staffs and Customs and Border Patrol agents are under immense pressure to do thorough and accurate identity and background checks on both citizens and non-citizens. In this age of international terrorism, the success and accuracy of staff decisions has critical implications to the security of the United States. The CCD gives its users the tools and data to make informed decisions.Before an adjudicator of a visa applicant looks at a record, the CCD has already done much of the pre-processing automatically. Rather than an adjudicator sorting through terabytes of data, the CCD has already sorted through the over 115 terabytes data and made the connections that are simply impossible for an individual user to make. At each encounter with an applicant for a visa or passport, the CCD automatically establishes links between all cases involving that applicant and other potentially-matching cases, enabling the detection of potential fraud or national security threats. For example, the CCD can base these links on the applicant using the same point of contact in the United States that was used on another case. The CCD can establish links based on the results of a biometric check, such as fingerprints or facial recognition. The CCD can even establish links using unstructured data by searching for certain text strings and linking records in which these strings appear. The CCD examines every conceivable combination of data elements when looking for potential matches. Due to the wide variety of resources used by the CCD, the system can establish links between two applicants using completely different names. With each subsequent encounter, the CCD creates additional links, resulting in a searchable, fully cross-referenced web of information that traces a person’s activities across all of CA data. By being able to see these links in a person-centric view, the adjudicators have a broader, more complete, and more easily-accessible set of data with which to make better-informed decisions.The CCD automatically initiates biometric checks, including fingerprint checks and facial recognition checks. The CCD can also automatically look for possible fraud indicators in the data the applicant provided in his or her application. The CCD can then alert the adjudicator to look into these indicators, saving the adjudicator time. If the adjudicator finds a case of potential fraud, he or she can refer the case for fraud investigation right from the CCD. The fraud investigator can record the results of his or her investigation in the CCD and has access to all of the analytical tools and biometric checks available in the system. The CCD captures all of the data entered during the process and automatically creates cross-references using the new data. The CCD completes the loop.When a CCD user pulls up an applicant record in the CCD, he or she will see much more than just the applicant’s biographical data and the current status of the case. The user can see the results of all of the background checks that the adjudicator ran. The user can see all of the previous visa or passport records for that applicant. The user can see all of the applicant’s images, the applicant’s fingerprints, and even a list of other CCD records that are linked to the applicant in one way or another. The CCD makes all of the information related to a case accessible in a single, consolidated view.In fact, the CCD is so easy to use that each month its users run 20 million reports, generate 120 million hits, process 1 million applicants, conduct 4 million facial recognition searches, submit 800,000 fingerprint check requests, and much more. Users add 6-8 terabytes of data to the CCD each month. Without the robust functionality built until the CCD, this workload would be unimaginable.Conclusion: Before the CCD, Visa and U.S. Passport application data were located on independent databases making data sharing within the Department of State and its national security partner agencies difficult. CA needed to maximize the accuracy and availability of consular data by creating a single, consolidated database. CA needed a state-of-the-art data archiving and data-sharing platform that provided rapid access to data and that enabled the fluid exchange of information, while reducing expenses and encouraging inter-agency collaboration. The CCD has transformed CA’s mission delivery by breaking the paradigm of data isolated in independent databases. The CCD is a single-platform of common, trusted data. The CCD uses a simplified, robust, and innovative network architecture that has streamlined CA’s physical IT infrastructure. The CCD today consolidates data from posts all over the world into a central repository that is over 115 terabytes in size and growing by 6-8 terabytes each month. In terms of both improved resource use and in enhanced national security through better data analysis, it is impossible to overstate the benefits that the CCD brings to CA. The CCD allows staff to focus its time on better customer service, investigative activities, and analysis. CA’s technical achievement with the CCD has been to create a robust, economical, and analytically-powerful data platform in an environment where fragmentation and inefficiency had been the norm.