Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Predictive Analysis of Financial Fraud Detection using Azure and Spark ML

322 visualizaciones

Publicado el

This talk aims at providing insights, performance, and architecture on Financial Fraud Detection on a mobile money transactional activity in Azure ML and Spark. We have predicted and classified the transaction as normal or fraud with a small sample and massive data set using Azure ML and Spark ML, which are traditional systems and Big Data respectively. I will present predictive analysis with several classification models experimenting in Azure and Spark ML. Besides, scalability of Spark ML will be presented for the models with different number of nodes for Spark clusters in Amazon AWS.

Publicado en: Datos y análisis
  • Inicia sesión para ver los comentarios

  • Sé el primero en recomendar esto

Predictive Analysis of Financial Fraud Detection using Azure and Spark ML

  1. 1. Jongwook Woo BigDAI HiPIC CalStateLA IDEAS SoCal Conf 2018 Oct 20 2018 Jongwook Woo, PhD, jwoo5@calstatela.edu Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat Big Data AI Center (BigDAI / HiPIC) California State University Los Angeles Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
  2. 2. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Introduction To Big Data Predictive Analytics  Fraud Detection Predictive Analytics  Summary
  3. 3. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself Experience:  Since 2002, Professor at California State University Los Angeles – PhD in 2001: Computer Science and Engineering at USC  Since 1998: R&D consulting in Hollywood – Warner Bros (Matrix online game), E!, citysearch.com, ARM etc – Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware  Since 2007: Exposed to Big Data at CitySearch.com  2012 - Present : Big Data Academic Partnerships – For Big Data research and training • Amazon AWS, MicroSoft Azure, IBM Bluemix • Databricks, Hadoop vendors
  4. 4. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself: Partners for Services
  5. 5. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experience in Big Data  Collaboration  Big Data Technical Advisor of Isaac Engineering for Smart * (Factory, Farms, …) in Korea  Council Member of IBM Spark Technology Center  City of Los Angeles for DSF, OpenHub and Open Data  Startup Companies in Los Angeles  External Collaborator and Advisor in Big Data – IMSC of USC – Pennsylvania State University – The Big Link, Softzen, Wiken in Korea  Grants  Oracle Cloud Big Data, IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in Research and Education Grant  Partnership  Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata
  6. 6. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself: Public Partners
  7. 7. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Myself: S/W Development Lead http://www.mobygames.com/game/windows/matrix-online/credits
  8. 8. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Introduction To Big Data Predictive Analytics  Fraud Detection Predictive Analytics  Summary
  9. 9. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Smart *: Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Non-/Semi-structured data Too expensive Need new systems Non-expensive
  10. 10. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  11. 11. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA What is Hadoop? 11  Hadoop Founder: o Doug Cutting  Apache Committer: Lucene, Nutch, …
  12. 12. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Super Computer vs Hadoop Parallel vs. Distributed file systems by Michael Malak Updated by Jongwook Woo Cluster for Store Cluster for Compute/Store Cluster for Compute
  13. 13. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Hadoop Cluster: Logical Diagram Web Browser of Cluster nonitor: CM/Ambari HTTP(S) Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Agent Hadoop Cluster Monitor . . . . . . . . . Agent Hadoop Agent Hadoop Agent Hadoop HDFS HDFS HDFS HDFS HDFS HDFS HIVE ZooKeeper Impala
  14. 14. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Hadoop Ecosystems http://dawn.dbsdataprojects.com/tag/hadoop/
  15. 15. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Definition: Big Data Non-expensive frameworks that is distributed parallel systems and that can store a large scale data and process it in parallel [1, 2] Hadoop – Non-expensive Super Computer – More public than the traditional super computers • You can store and process your applications – In your university labs, small companies, research centers Others – NoSQL DB (Cassandra, MongoDB, Redis, HBase) – ElasticSearch
  16. 16. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Introduction To Big Data Predictive Analytics  Fraud Detection Predictive Analytics  Summary
  17. 17. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab  In-Memory storage for intermediate data  20 ~ 100 times faster than N/W and Disk – MapReduce Good in Machine Learning – Iterative algorithms
  18. 18. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Integrating Spark and Hadoop  Spark  File Systems: Tachyon  Resource Manager: Mesos  Dedicated Spark – Cassandra, Couchbase…  Integrating Spark into Hadoop cluster  As Hadoop has been in the market for over 10 years  Cloud Computing – Oracle Cloud Big Data Compute, Amazon AWS, Azure HDInsight, IBM Bluemix, Google Cloud Platform • Object Storage, S3  Hadoop vendors – HDP, CDH  Databricks: Spark on AWS & Azure – Not much Hadoop ecosystems
  19. 19. Big Data AI Center (BDAIC / HiPIC) Jongwook Woo CalStateLA Spark Spark SQL Querying using SQL, HiveQL Data Frame Spark Streaming DStream – RDD in streaming ML Machine Learning on Data Frame, Pipelining MLib – On RDD – Sparse vector support, Decision trees, Linear/Logistic Regression, PCA, SVM, …
  20. 20. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Introduction To Big Data Predictive Analytics  Fraud Detection Predictive Analytics  Summary
  21. 21. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Big Data Analysis and Prediction Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, StereamSets, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Qlik, Tableau, …) Data Visualization Qlik, Excel PowerMap, Tableau, Looker, … - Big Data Engineering - Big Data Analysis - Big Data Science - Data Visualization
  22. 22. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Terms We know Data Engineering – Collect, clean, transform, filter data Data Analysis – Find insights from the existing data Data Science (Predictive Analysis) – Predict the trend or pattern from the existing data Do we know? Big Data Analysis and Science – Using Big Data for Data Analysis and Science • Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,.. – For Massive Data Set • How to store and compute?
  23. 23. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Big Data Science  Fraud Detection: Accepted to APJIS journal by Jongwook Woo et al in 2018 – Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat – Indexed SCOPUS Goal Analyzing Transaction data and Fraud Detection – For Mobile Money Transaction • based on a sample of real transactions – extracted from one month of financial logs from a mobile money service – using Spark ML (Big Data) and Azure ML (Traditional)
  24. 24. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Financial Data Set  Data is always issue  No public available datasets on financial services – Private nature of financial transactions PaySim – URL: https://www.kaggle.com/ntnu-testimon/paysim1 – generate a synthetic dataset • from the private dataset – that resembles the normal operation of transactions
  25. 25. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Financial Data Set (Cont‘d) Size: 470 MB (=> 718MB) 6,362,620 records Not that large scale data comparing to data set > GB But its architecture here can be applicable to much bigger data set – As it still adopt Spark Computing Engine in Big Data – Linearly scalable Attributes: 11 Target Column to Predict: ‘isFraud’
  26. 26. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment: Traditional Systems and Big Data
  27. 27. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment Azure ML: Traditional small data set Implement fundamental prediction models – Using Sample data: 80MB (1/5 – 1/6 data set) Select the best model among number of classifications
  28. 28. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment (Cont‘d) Spark ML Test with Databricks CE and IBM Cloud – 470 MB AWS EMR – Analyze all data • 470 MB (=> 718MB) – Implement and evaluate prediction model • 3 different models • Spark Clusters with 3 different # of nodes
  29. 29. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Hardware Specifications: Spark IBM DSX Lite Python 2, Spark 2.1 File System: Object Storage 2 Spark Executors, 16GB Memory Databricks Python 2, Spark 2.1 (Auto-updating, Scala 2.10) File System : Databricks File System Single/Unlimited Cluster, Memory : 6GB Memory
  30. 30. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment Environment AWS EMR EMR 12.1 – Spark 2.2.1 on Hadoop 2.8.3 – YARN with Ganglia 3.7.2 and Zeppelin 0.7.3.  m3.xlarge instance – Memory: 15.0 GiB, – CPU: 4 vCPUs, – Storage: 80 GiB (2 * 40 GiB SSD).  File System : S3 3 different EMR clusters – number of nodes that are servers: • 3, 6, 11 nodes
  31. 31. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA PySpark on Databricks
  32. 32. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Work Flow in Azure ML  Relatively Easy to build and test Drag and Drop GUI Work Flow 1. Data Engineering – Understanding Data – Data preparation – Balancing data statistically 2. Data Science: Machine Learning (ML) – Model building and validation • Classification algorithms – Model evaluation – Model interpretation
  33. 33. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Data Understanding • Numeric attributes: amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest, newbalanceDest • Categorical attributes: step, type, isFraud, isFlaggedFraud • String attributes: nameOrig, nameDest
  34. 34. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment in Azure ML
  35. 35. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Precision vs Recall True Positive (TP): Fraud? Yes it is False Negative (FN): No fraud? but it is False Positive (FP): Fraud? but it is not  Precision  TP / (TP + FP)  Recall  TP / (TP + FN)  Ref: https://en.wikipedia.org/wiki/Precision_and_recall Positive: Event occurs (Fraud) Negative: Event does not Occur (non Fraud)
  36. 36. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Model Evaluation More into Recall to capture the most fraudelent transactions Bad Recall: Fatal –If many false negative (FN) • predict the transaction as normal not fraud – but it is a fraud –Painful • Need to decrease FN – That is to increase Recall
  37. 37. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experimental Results in AzureML Model Accuracy Precision Recall Two Class Logistic Regression Two Class Decision Forest Two Class Decision Jungle 0.916 0.998
  38. 38. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experimental Results Accuracy Decision Jungle – Highest Recall 0.998 • While Precision: 0.916 – With small sample data set: 359KB • takes 11 sec Performance: Times taken to build a model with whole data set: – 470MB + data tweaking – Over a day Good Guide to adopt the 3 similar algorithms for Spark ML – Decision Tree, Random Forest, Logistic Regression
  39. 39. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experiment with Spark ML 1. Load the data source  470 MB (=> 718MB) 2. Train and build the models o Balanced data statistically 3. Evaluate
  40. 40. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Define the pipeline
  41. 41. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Train the models
  42. 42. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer ParamMap Estimator Classification Estimator Classification Evaluator Validation Estimator Model Transformer Classification Evaluator
  43. 43. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer ParamMap Estimator Classification Estimator Classification Evaluator Model Transformer Classification Evaluator Feature is generated from input columns Validation Estimator
  44. 44. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer ParamMap Estimator Classification Estimator Classification Evaluator Model Transformer Classification Evaluator Classifiers: Decision Tree, RandomForest, LogisticRegression Validation Estimator
  45. 45. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer ParamMap Estimator Classification Estimator Classification Evaluator Model Transformer Classification Evaluator Combination of Parameters: Max Bins, Max Depth,… Validation Estimator
  46. 46. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Pipeline Classification with Spark ML Feature Transformer ParamMap Estimator Classification Estimator Classification Evaluator Model Transformer Classification Evaluator Validators: Cross Validator, Train Validation Split Validation Estimator
  47. 47. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Results Model Area under ROC Precision Recall DecisionTreeClassifier RandomForestClassifier 0.909573 LogisticRegression • 3 models with different combinations of the parameters • Times taken (Spark Cluster): 1 hour • In theory of Linear Scalability: 2 minutes with 30 Spark clsters • The Random Forest has the best recall score • compared to Decision Tree and Logistic Regression.
  48. 48. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Experimental Results in AWS Execution times 3 nodes: –40min – 70mins 11 nodes –10min – 20mins
  49. 49. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Smart Factory with Big Data  Summary
  50. 50. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Summary Introduction to Big Data Introduction to Big Data Predictive Analytics Experimental Result of Fraud Detection Recall: – RandomForest in SparkML – DecisionJungle in AzureML Performance: – Traditional Systems: • not good for large scale data – Spark ML: • Linearly Scalable • Fast
  51. 51. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA Questions?
  52. 52. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA References 1. “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011) 2. Jongwook Woo, DMKD-00150, “Market Basket Analysis Algorithms with MapReduce”, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445- 452, ISSN 1942-4795 3. Jongwook Woo, “Big Data Trend and Open Data”, UKC 2016, Dallas, TX, Aug 12 2016 4. How to choose algorithms for Microsoft Azure Machine Learning, https://docs.microsoft.com/en- us/azure/machine-learning/machine-learning-algorithm-choice 5. “Big Data Analysis using Spark for Collision Rate Near CalStateLA” , Manik Katyal, Parag Chhadva, Shubhra Wahi & Jongwook Woo, https://globaljournals.org/GJCST_Volume16/1-Big-Data-Analysis-using-Spark.pdf 6. Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html 7. (Accepted in Sept 2018) Priyanka Purushu, Niklas Melcher, Bhagyashree Bhagwat, Jongwook Woo, "Predictive Analysis of Financial Fraud Detection using Azure and Spark ML", Asia Pacific Journal of Information Systems
  53. 53. Big Data AI Center (BigDAI / HiPIC) Jongwook Woo CalStateLA References 8. TensorFrames: Google Tensorflow on Apache Spark, https://www.slideshare.net/databricks/tensorframes- google-tensorflow-on-apache-spark 9. Deep learning and Apache Spark, https://www.slideshare.net/QuantUniversity/deep-learning-and-apache- spark 10. Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark, https://www.slideshare.net/SparkSummit/which-is-deeper-comparison-of-deep-learning-frameworks-on- spark 11. Accelerating Machine Learning and Deep Learning At Scale with Apache Spark, https://www.slideshare.net/SparkSummit/accelerating-machine-learning-and-deep-learning-at-scalewith- apache-spark-keynote-by-ziya-ma 12. Deep Learning with Apache Spark and TensorFlow, https://databricks.com/blog/2016/01/25/deep- learning-with-apache-spark-and-tensorflow.html 13. Tensor Flow Deep Learning Open SAP 14. Overview of Smart Factory, https://www.slideshare.net/BrendanSheppard1/overview-of-smart-factory- solutions-68137094/6 15. https://dzone.com/articles/sqoop-import-data-from-mysql-tohive

×