Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Big Data Analysis in Hydrogen Station using Spark and Azure ML

372 visualizaciones

Publicado el

Decision Forest machine learning algorithm is adopted to find out the features to affect the temperature of fueling valve and controller and to predict it.

Publicado en: Datos y análisis
  • Inicia sesión para ver los comentarios

Big Data Analysis in Hydrogen Station using Spark and Azure ML

  1. 1. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Data Analysis and Prediction Using Spark Manvi Chandra, mchandr2@calstatela.edu Jongwook Woo, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles
  2. 2. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data Machine Learning  Spark Cores  RDD  Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction Model
  3. 3. High Performance Information Computing Center Jongwook Woo CSULA Myself Name: Manvi chandra Experience:  2012 -2014 – Programmer Analyst at Cognizant Technology Solutions  2015-2016 - Present : Master’s in information system  Exposed to Big Data Analytics  Pursuing research in Big data analytics and machine learning  2007-2011-Bachelor of Technology in Electronics and Communication Engineering.
  4. 4. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data Machine Learning  Spark Cores  RDD  Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction Model
  5. 5. High Performance Information Computing Center Jongwook Woo CSULA Introduction To Big Data
  6. 6. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data (IoT), Bioinformatics, Social Computing, Streaming data, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  7. 7. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – On non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with multiple non-expensive computers • Own super computers
  8. 8. High Performance Information Computing Center Jongwook Woo CSULA What is Hadoop? 8 Hadoop Founder: Doug Cutting Chief Architect at Cloudera
  9. 9. High Performance Information Computing Center Jongwook Woo CSULA Definition: Big Data Inexpensive frameworks that can store a large scale data and process it faster in parallel Hadoop –Non-expensive Super Computer –You can build and run your applications
  10. 10. High Performance Information Computing Center Jongwook Woo CSULA Alternate of Hadoop MapReduce Limitation in MapReduce Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue Spark by UC Berkley AMP Lab In-memory storage for intermediate data 10 ~ 100x faster than N/W and Disk
  11. 11. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data Machine Learning  Spark Cores  RDD  Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction Model
  12. 12. High Performance Information Computing Center Jongwook Woo CSULA Machine Learning Subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. Explores pattern recognition during data analysis through computer science and statistics. Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look.
  13. 13. High Performance Information Computing Center Jongwook Woo CSULA Machine Learning Studio Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool you can use to build, test, and deploy predictive analytics solutions on your data.
  14. 14. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data Machine Learning  Spark Cores  RDD  Spark SQL, Streaming, ML Hydrogen Gas Power Plant Prediction Model
  15. 15. High Performance Information Computing Center Jongwook Woo CSULA Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS HBase, Hive, Sequence files New Programming with faster data sharing Good in complex multi-stage applications – Iterative graph algorithms, Machine Learning Interactive query
  16. 16. High Performance Information Computing Center Jongwook Woo CSULA Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s DataFrames RDD-Based Matrices Spark Cores GraphX (graph) RDD-Based Matrices Spark R RDD-Based Matrices
  17. 17. High Performance Information Computing Center Jongwook Woo CSULA Spark Drivers and Workers Drivers Client –with SparkContext • Create RDDs Workers Spark Executor Run on cluster nodes –Production Run in local threads –development
  18. 18. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  Hydrogen Gas Power Plant Prediction Model
  19. 19. High Performance Information Computing Center Jongwook Woo CSULA RDD Resilient Distributed Dataset (RDD) Distributed collections of objects –that can be cached in memory RDD, DStream, SchemaRDD, PairRDD Immutable Lineage –History of the objects –Automatically and efficiently recompute lost data
  20. 20. High Performance Information Computing Center Jongwook Woo CSULA RDD Operations Transformation Define new RDDs from the current –Lazy: not computed immediately map(), filter(), join() Actions Return values count(), collect(), take(), save()
  21. 21. High Performance Information Computing Center Jongwook Woo CSULA Programming in Spark Scala Functional Programming –Fundamental of programming is function • Input/Output is function No side effects –No states Python Legacy, large Libraries Java
  22. 22. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  Hydrogen Gas Power Plant Prediction Model
  23. 23. High Performance Information Computing Center Jongwook Woo CSULA Spark SparkSQL Turning an RDD into a Relation Querying using SQL Spark Streaming DStream – RDD in streaming – Windows • To select DStream from streaming data MLib Sparse vector support, Decision trees, Linear/Logistic Regression, PCA SVD and PCA
  24. 24. High Performance Information Computing Center Jongwook Woo CSULA Spark Hydrogen gas power plant spark model o Separating the labeled column. o Creation of RDD. o Splitting the data into training and test sets. o Training the dataset using Decision forest regression algorithm. o Evaluation of the result.
  25. 25. High Performance Information Computing Center Jongwook Woo CSULA Spark Hydrogen gas power plant spark model
  26. 26. High Performance Information Computing Center Jongwook Woo CSULA Contents  Myself  Introduction To Big Data  Hive Examples  Spark Cores  RDD  Spark SQL, Streaming, ML  Hydrogen Gas Power Plant Prediction Model
  27. 27. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) was formally opened on May 7, 2014.
  28. 28. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model The station is capable of producing hydrogen onsite from renewable energy sources, using the process known as electrolysis. Cal State L.A. Hydrogen Research and Fueling Facility became the first station in the nation to sell hydrogen fuel by the kilogram to the public.
  29. 29. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Workflow
  30. 30. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Model
  31. 31. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations
  32. 32. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations According to our research we are able to predict Vehicle Pressure (Pressure of hydrogen gas within the vehicle Hydrogen Storage System)using our model. The algorithm used is decision forest regression. Decision forest are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
  33. 33. High Performance Information Computing Center Jongwook Woo CSULA Hydrogen Gas Power Plant Prediction Model Results and observations STATE OF CHARGE (SOC):- – Ratio of hydrogen density within the vehicle storage system to the full-fill density. SOC is expressed as a percentage and is computed based on the gas density as per formula below: Our model predict vehicle pressure which in turn could be used to determine the state of charge.
  34. 34. High Performance Information Computing Center Jongwook Woo CSULA Question?
  35. 35. High Performance Information Computing Center Jongwook Woo CSULA References Hadoop, http://hadoop.apache.org Apache Spark op Word Count Example (http://spark.apach.org ) Databricks (http://www.databricks.com )

×