Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Lessons Learned from Modernizing USCIS Data Analytics Platform

182 visualizaciones

Publicado el

U.S. Citizenship and Immigration Services (USCIS) is the government agency that oversees lawful immigration to the United States.

Publicado en: Datos y análisis
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Lessons Learned from Modernizing USCIS Data Analytics Platform

  1. 1. A Journey To Modernization Shawn Benjamin & Prabha Rajendran
  2. 2. Problems Faced q Informatica ETL pipeline was brittle q Lengthy Informatica ETL development cycle q Lengthy load time workflows for ingestion of data q Lack of ability for real time/ near real-time data q Lack of data science platform 2
  3. 3. Legacy Architecture Data Scientists Statisticians Business Analysts Extracted Data R / Python Code Business Insight 3 Datawarehouse Source Systems
  4. 4. 4 C3 CON eCIMS C3 LANS C4 NFTS ELIS NASS CPMS Pay.gov AR-11 RAPS MFAS VSS ATS PCTS SNAP RNACS Benefits Files Payment Verification Validation Cust Svc Scheduler AdminFraudFOIA Benefits Mart Scheduler Mart Payment Mart Validation Mart APSS C3 Con C4 CIS ELIS2 MFAS NFTS PCTS RNACS VSS C3 LAN AAO x2 CSC x2 MSC x2 NSC x2 TSC x2 VSC x2 CHAMPS ECHO iCLAIMS IDCMS QUEUE NPWR CAMINO VIBE SODA Data Marts SCCLAIMS FDNS-DS SMART Subject Areas SAS LibrarieseCISCOR Direct Connects Data Marts Direct Connects Active ODSes Decom. ODSes Treasury CIR CIS x7 x1 x2 x1 x1 x5 x2 x5 x2 x1 x2 x6 x1 x1 x1 x1 x2 x1 x2 x1 x1 LEGEND ODSes VIS x2 CPMS x1 SRMT x1 NFTS x1x1 x1 x1 FACCON x1 ePULSE x1 BI Tools Data Marts Users 4 2016SNAPSHOT JANUARY Data Sources SMART Subject Areas 36 2 eCISCOR ODS FutureData MartDirect Connect 66 2,354 ETL28 Processes
  5. 5. Implemented Databricks private cloud VPC (26 nodes) in the AWS Connected the Databricks cluster to the Oracle database Created all relevant DB tables in HIVE metadata pointing to Oracle database Copied relevant tables from Oracle database to S3 using Scala code Data is stored in Apache Parquet columnar format. For context, the 120 million row 83 column can be dumped to S3 in just 10 minutes. Identified appropriate partition scheme large tables were partitioned to optimize Spark query performance Created multiple notebooks Perform data analysis and visualize the results, e.g. created histogram of case life cycle duration Successful Proof of Concept 5
  6. 6. Current Databricks Implementation 6 Statisticians Business Analysts Business Insight S3 Data Lake Data Scientists LakeHouse
  7. 7. 7 • 75 Data Sources • Xx Data Interfaces • 7 Data Marts • 4 BI Tools • 6,086 Tableau Dashboards • 118 SMART Subject Areas • 56 SAS Libraries • 6,233 Users • 75 Data Sources • 35 Application Interfaces • 7 Data Marts • 4 BI Tools • 6,086 Tableau Dashboards • 118 SMART Subject Areas • 56 SAS Libraries • 6,233 Users
  8. 8. Databricks Accomplishments IMPLEMENTATION OF DELTA LAKE EASY INTEGRATION WITH OBIEE ,SAS AND TABLEAU WITH NATIVE CONNECTORS INTEGRATION WITH GITHUB FOR CONTINUOUS INTEGRATION & DEPLOYMENT AUTOMATING ACCOUNT PROVISIONING MACHINE LEARNING (ML FLOW) INTEGRATION 8
  9. 9. Change Data Capture using Delta Lake Databricks Delta –Success Factors v Faster Ingestion of CDC changes v Resiliency v Improved Data Quality , Reporting Availability and runtime performance v Schema evolution - adding additional columns without rewriting the entire table Databricks Delta-Lessons Learned v Storage requirements increased v Vacuum and Optimization is mandatory to improve the performance 9
  10. 10. Unified Data Analytical Platform -Tableau 10
  11. 11. Unified Data Analytical Platform –OBIEE/SAS 11
  12. 12. Data Science Experiments using ML 12 ML Graphs processes after running the models Prediction Model Samples
  13. 13. Text and Log Mining 0 0.5 1 NegativeSentiment Positive Sentiment Sentiment Analysis 13
  14. 14. Time Series Models and H2O Integration Integrated H20 with Databricks and built a model predicting the count of ‘No show’ on N400 using the traditional Time series forecasting to predict inefficiencies in normal day-to-day planning and operations 14
  15. 15. Enabling Security & Governance 15 Access Control (ACL) Credentials Passthrough Secrets Management v Control users access to data using the Databricks view-based access control model (Table and schema level ACLs) v Control users access to clusters that are not enabled for table access control v Enforced data object privileges at onboarding phase v Used Databricks secrets manager to store credentials and reference in notebooks
  16. 16. Databricks Management API Usage 16 Cluster/Jobs management àCreate, delete, manage clusters and get execution status of daily scheduled jobs which helped automated monitoring. Library /Secret managementà Easy upload of any third-party libraries and manage encrypted scopes/credentials to connect to source and target endpoints. Integration and Deploymentsà API with Git and Jenkins for continuous integration and continuous deployment Enabled MLFlow Tracking API for our Data Science experiments API Integrated Databricks Management API with Jenkins and other scripting tools to automate all our administration and management tasks.
  17. 17. Lessons learned through this Journey Training plan Cloud based experience Subject Matter Expertise Automation 17
  18. 18. Success Strategy Success Criteria Benefit Performance ü Auto-scalability leveraging on-demand and spot instances ü Efficient processing of larger datasets comparable to RDMS systems ü Scalable read/write performance on S3 Support for a variety of statistical programming languages ü Data Science Platform ( R, Python, Scala and SQL) ü Supports MLIB : Machine Learning & Deep Learning Integration with existing tools ü Allows connections to industry standard technologies via ODBC/JDBC connection and inbuilt connectors. Easily integrate new data sources ü Supports seamless integration with data streaming technologies like Kafka/Kinesis using Spark Streaming. This supports both structured and unstructured ü Leverages S3 extensively Secure ü Supports integration with multiple Single-Sign-On platforms ü Supports native encryption-decryption features (AES-256 and KMS) ü Supports Access Control Layer (ACL) ü Implemented in USCIS Private cloud 18
  19. 19. Questions! 19

×