Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Data Science Training | Data Science For Beginners | Data Science With Python Course | Simplilearn

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 107 Anuncio

Data Science Training | Data Science For Beginners | Data Science With Python Course | Simplilearn

This Data Science presentation will help you understand what is Data Science, who is a Data Scientist, what does a Data Scientist do and also how Python is used for Data Science. Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining. This Data Science tutorial will help you establish your skills at analytical techniques using Python. With this Data Science video, you’ll learn the essential concepts of Data Science with Python programming and also understand how data acquisition, data preparation, data mining, model building & testing, data visualization is done. This Data Science tutorial is ideal for beginners who aspire to become a Data Scientist.

This Data Science presentation will cover the following topics:

1. What is Data Science?
2. Who is a Data Scientist?
3. What does a Data Scientist do?

This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you’ll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.

Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. A data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.

You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn’s Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques. Those who complete the course will be able to:

1. Gain an in-depth understanding of data science processes, data wrangling, data exploration, data visualization, hypothesis building, and testing. You will also learn the basics of statistics.
Install the required Python environment and other auxiliary tools and libraries
2. Understand the essential concepts of Python programming such as data types, tuples, lists, dicts, basic operators and functions
3. Perform high-level mathematical computing using the NumPy package and its largelibrary of mathematical functions.

Learn more at: https://www.simplilearn.com

This Data Science presentation will help you understand what is Data Science, who is a Data Scientist, what does a Data Scientist do and also how Python is used for Data Science. Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining. This Data Science tutorial will help you establish your skills at analytical techniques using Python. With this Data Science video, you’ll learn the essential concepts of Data Science with Python programming and also understand how data acquisition, data preparation, data mining, model building & testing, data visualization is done. This Data Science tutorial is ideal for beginners who aspire to become a Data Scientist.

This Data Science presentation will cover the following topics:

1. What is Data Science?
2. Who is a Data Scientist?
3. What does a Data Scientist do?

This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you’ll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.

Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. A data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.

You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn’s Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques. Those who complete the course will be able to:

1. Gain an in-depth understanding of data science processes, data wrangling, data exploration, data visualization, hypothesis building, and testing. You will also learn the basics of statistics.
Install the required Python environment and other auxiliary tools and libraries
2. Understand the essential concepts of Python programming such as data types, tuples, lists, dicts, basic operators and functions
3. Perform high-level mathematical computing using the NumPy package and its largelibrary of mathematical functions.

Learn more at: https://www.simplilearn.com

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Data Science Training | Data Science For Beginners | Data Science With Python Course | Simplilearn (20)

Anuncio

Más de Simplilearn (20)

Más reciente (20)

Anuncio

Data Science Training | Data Science For Beginners | Data Science With Python Course | Simplilearn

  1. 1. Data Science Tutorial
  2. 2. What’s in it for you? What is Data Science? Who is a Data Scientist? What are the skills of a Data Scientist? What does a Data Scientist do? Data Acquisition Data Preparation Data Mining with Tableau Model Building & Testing Model Maintenance Summary
  3. 3. What is Data Science? Data Science is the area of study which involves extracting knowledge from all the data you can gather
  4. 4. What is Data Science? Hence, translating a business problem into a research project, and then translating the results back into a practical solution requires a deep understanding of the business domain and creativity
  5. 5. What is Data Science? Fraudulent transactions are ramping on the internet and the results can be devastating
  6. 6. What is Data Science? Data Science can help in fraud detection using advanced machine learning algorithms and prevent great monetary losses
  7. 7. What is Data Science? And people doing all this work are called Data Scientists
  8. 8. What does a Data Scientist do? Model Maintenance Data Acquisition Data Preparation Data Mining Data Modelling Model Maintenance
  9. 9. Model Maintenance What does a Data Scientist do? Data Acquisition Data Preparation Data Mining Data Modelling Model Maintenance
  10. 10. Data Acquisition Data Acquisition involves extracting data from multiple source systems Integrating and transforming the data into homogeneous format Loading the transformed data into data warehouse Multiple source systems (Database, FlatFiles etc)
  11. 11. Data Acquisition Data Acquisition involves extracting data from multiple source systems Integrating and transforming the data into homogeneous format Loading the transformed data into data warehouse Transformation
  12. 12. Data Acquisition Data Acquisition involves extracting data from multiple source systems Integrating and transforming the data into homogeneous format Loading the transformed data into data warehouse Data Warehouse
  13. 13. Data Acquisition Data warehouse Multiple source systems (Database, FlatFiles etc.) Transformation
  14. 14. Data Acquisition Data warehouse Multiple source systems (Database, FlatFiles etc) Transformation This process is mainly referred as ETL (Extract, Transform, Load)
  15. 15. Data Acquisition Various tools are available for ETL process like Talend Studio, DataStage, Informatica
  16. 16. Model Maintenance What does a Data Scientist do? Data Acquisition Data Preparation Data Mining Model Building Model Maintenance
  17. 17. Data Preparation The most essential part of any Data Science project is Data Preparation. It consumes 60% of the time spent on the project. There are many things we can do to ensure data is used in the most productive and meaningful manner
  18. 18. Data Cleaning Data Transformation Handling Outliers Data Integration Data Reduction Data Preparation DATA CLEANING: • Data cleaning is important because bad data may lead to a bad model • It handles missing values which may be due to various reasons • In this, NULL or unwanted values is also handled • It improves business decisions and increases productivity
  19. 19. Data Cleaning Data Transformation Handling Outliers Data Integration Data Reduction Data Preparation DATA TRANSFORMATION: • Data Transformation turns raw data formats into desired outputs • It involves normalization of data • Min-max normalization • Z-score normalization
  20. 20. Data Cleaning Data Transformation Handling Outliers Data Integration Data Reduction Data Preparation HANDLING OUTLIERS: • Outliers are observations which are distant from the rest of the data. • They can be good or bad for the data • Univariate and Multivariate methods are used for detection • Outliers can be used for fraud detection • Plots used: Scatter plots, box plots
  21. 21. Data Cleaning Data Transformation Handling Outliers Data Integration Data Reduction Data Preparation DATA INTEGRITY: • Data integrity means that data is accurate and reliable • It is important where business decisions are made on the basis of data regularly
  22. 22. Data Cleaning Data Transformation Handling Outliers Data Integration Data Reduction Data Preparation DATA REDUCTION: • This step reduces multitudinous amounts of data into meaningful parts • It increases storage capacity • It is a factor helping in reducing costs • Involves data deduplication which eliminates redundant data
  23. 23. Data Preparation – Data Cleaning Data cleaning is one of the most common tasks of data preparation which involves ensuring that the data is: Valid Consistent Uniform Accurate
  24. 24. Data Preparation – Data Cleaning Let’s have a look at a bank’s data with it’s customer details
  25. 25. Data Preparation – Data Cleaning We can import and read the data using pandas library of Python
  26. 26. Let’s look at the Geography column which has some missing values In this case, we might not want to assume the location of the customers Data Preparation – Data Cleaning
  27. 27. So, let’s fill the missing values with empty string Data Preparation – Data Cleaning
  28. 28. In numerical columns, filling the unwanted values with 'mean' can help us to even out the data set: Before: After: Data Preparation – Data Cleaning Data.CreditScore = data.CreditScore.fillna(data.creditScore.mean())
  29. 29. We can remove all rows with unwanted values, however this is a very aggressive step Dropping all rows with any NA values is easy: data.dropna() Data Preparation – Data Cleaning
  30. 30. Dropping all rows with any NA values is easy: data.dropna() Data Preparation – Data Cleaning We can remove all rows with unwanted values, however this is a very aggressive step We can also drop rows that have all NA values: data.dropna(how=’all’)
  31. 31. Dropping all rows with any NA values is easy: data.dropna() We can also drop rows that have all NA values: data.dropna(how=’all’) Data Preparation – Data Cleaning We can remove all rows with unwanted values, however this is a very aggressive step We can also put a limitation example, the data needs to have at least 5 non-null values data.dropna(thresh=5)
  32. 32. Model Maintenance Data Mining Data Acquisition Data Preparation Data Mining Model Building Model Maintenance
  33. 33. Data Mining After acquiring the data, preparing and cleaning it, Data Scientist discovers patterns and relationship in data to make better business decisions
  34. 34. Data Mining It’s a discovery process, to get hidden and useful knowledge. It is commonly referred as ‘Data Mining’
  35. 35. Data Mining Let me show you how to do data mining using Tableau.
  36. 36. Data Mining using Tableau You can easily download Tableau software from https://www.tableau.com/
  37. 37. Data Mining using Tableau You can connect your csv file to Tableau desktop to start mining the data! You can connect to various other sources also
  38. 38. Data Mining using Tableau Tableau will automatically identify the fields as Dimensions (descriptive) and Measures (numeric fields)
  39. 39. Problem Statement It is important to understand the problem statement first! In this problem, we have a dataset of a bank’s customers. We want to understand the exit behavior of the customers based on different variables, for instance: Data Mining using Tableau • Gender • Credit card holding • Geography
  40. 40. Data Mining using Tableau Now, we have a dataset of about 10,000 rows as shown below and we need to find out potential customers who will exit first
  41. 41. Let’s first evaluate on the basis of gender, to understand if this variable will affect the model Data Mining using Tableau
  42. 42. We have moved the variable ‘exited’ from Measures to Dimensions Data Mining using Tableau
  43. 43. Here, Tableau recognizes ‘exited’ column as measure because of its numerical datatype. However, ‘exited’ is categorical for us as it tells whether a person has left or not. So, we have placed ‘exited’ in dimensions as seen in the screenshot: Data Mining using Tableau
  44. 44. Data Mining using Tableau If we put ‘gender’ in columns and ‘exited’ on colors Where: 0 – people who did not exit 1 – people who did exit
  45. 45. For better understanding, let’s prepare a bar chart which depicts male/female who stayed/exited Data Mining using Tableau
  46. 46. Data Mining using Tableau Stayed Exited You can give alias describing that 0 means stayed and 1 means exited
  47. 47. To compare the exit rates, you can also add a reference line by right clicking on the right axis and selecting ‘Add Reference Line’ Data Mining using Tableau
  48. 48. Data Mining using Tableau
  49. 49. Data Mining using Tableau So, we have added the reference line and we can say that female customers exit more than average as compared to male customers!
  50. 50. We can also see on the basis of whether or not they have a credit card. Again if a person has credit card or not is a categorical decision so we put it into dimensions and then evaluate: Data Mining using Tableau As explained earlier, we have added the alias and a reference line.
  51. 51. Hence, a customer leaving the bank is not dependent on a person holding credit card Data Mining using Tableau Here, we can see that we cannot differentiate the exit behavior between people having a credit and those who do not
  52. 52. Here, we can see that customers leaving the bank in Germany is more than that in France or Spain. So, we treat this as an anomaly and it can impact the model Similarly, we can evaluate on the basis of geography: Data Mining using Tableau
  53. 53. Data Mining using Tableau So, we can ignore the variable ‘has credit card’ because it will not impact the model
  54. 54. Data Mining using Tableau And focus on other variables like ‘gender’, ‘geography’ which can actually have an impact
  55. 55. Data Mining using Tableau So, we can make business decisions based on these findings. Lets see some advantages of data mining
  56. 56. Advantages of Data Mining Predicts future trends Signifies customer patterns Helps decision making Quick fraud detection Choosing right algorithm
  57. 57. Model Maintenance Model Building Data Acquisition Data Preparation Data Mining Model Building Model Maintenance
  58. 58. What is Model Building? The model is built by selecting a machine learning algorithm that suits the data, use case, and available resources
  59. 59. What is Model Building? Many choose to build more than one model and go ahead only with the best fit
  60. 60. What is Model Building? There are many machine learning algorithms, chosen based on the data and the problem at hand
  61. 61. Machine Learning Algorithms used by Data ScientistsCategorialContinuous Supervised Unsupervised • Regression Linear Multiple • Decision Trees • Random Forest • Classification KNN Logistic Regression SVM Naïve-Baiyes • Clustering K-Means PCA • Association Analysis • Hidden Markov Model
  62. 62. Machine Learning Algorithms used by Data ScientistsCategorialContinuous Supervised Unsupervised • Regression Linear Multiple • Decision Trees • Random Forest • Classification KNN Logistic Regression SVM Naïve-Baiyes • Clustering K-Means PCA • Association Analysis • Hidden Markov Model
  63. 63. Machine Learning Algorithms used by Data ScientistsCategorialContinuous Supervised Unsupervised • Regression Linear Multiple • Decision Trees • Random Forest • Classification KNN Logistic Regression SVM Naïve-Baiyes • Clustering K-Means PCA • Association Analysis • Hidden Markov Model
  64. 64. Machine Learning Algorithms used by Data ScientistsCategorialContinuous Supervised Unsupervised • Regression Linear Multiple • Decision Trees • Random Forest • Classification KNN Logistic Regression SVM Naïve-Baiyes • Clustering K-Means PCA • Association Analysis • Hidden Markov Model
  65. 65. What is Model Building? Let me take you through a use case for better understanding
  66. 66. Predicting World happiness Model Building - Multiple Linear Regression Questions To Ask: 1) How to describe the data? 2) Can we make a predictive model to calculate happiness score?
  67. 67. Variables used in the model Happiness Rank Region Health Happiness Score Economy Freedom Country Family Trust Generosity Dystopia Residual
  68. 68. Variables Used in Data: Happiness Rank: determined by a country’s happiness score Happiness Score: A score given to a country based on adding up the rankings that a population has given to each category (normalized) Country: The country in question Region: The region that the country belongs too (different than continent) Economy: GDP per capita of the country Family: quality of family life, nuclear and joint family Health: ranking healthcare availability and average life expectancy in the country Freedom: how much an individual is able to conduct them self based on their free will Trust: that the government to not be corrupt Generosity: how much their country is involved in peacekeeping and global aid Dystopia Residual: Dystopia happiness score (1.85) i.e. the score of a hypothetical country that has a lower rank than the lowest ranking country on the report, plus the residual value of each country (a number that is left over from the normalization of the variables which cannot be explained). Model Building - Multiple Linear Regression For instructor
  69. 69. Importing the Python Libraries: Model Building - Multiple Linear Regression
  70. 70. Preparing and Describing the Data: • Read csv file using pandas “read_csv” and put it in a dataframe ‘happiness_2015’ • Similarly for 2016 and 2017, store the data in happiness_2016 and happiness_2017 respectively Model Building - Multiple Linear Regression
  71. 71. Model Building - Multiple Linear RegressionHappiness_2016
  72. 72. Happiness_2017 Model Building - Multiple Linear Regression
  73. 73. We can concatenate the three data or we can build a model separately for one csv Head() shows top countries with highest happiness_score Model Building - Multiple Linear Regression
  74. 74. We can create a visual which gives us a more appealing view of where each country is placed in the World ranking report Model Building - Multiple Linear Regression
  75. 75. Model Building - Multiple Linear Regression The lighter colored countries have a lower ranking. The darker colored countries have higher ranking on the report (i.e. are the “happiest)
  76. 76. Model Building - Multiple Linear Regression We can find out the correlation between Happiness Score and Happiness Rank by plotting a scatterplot
  77. 77. Model Building - Multiple Linear Regression And the code is right here!
  78. 78. Model Building - Multiple Linear Regression • Happiness score determines how the country is ranked, so happiness score as predictor and the happiness rank as the dependent variable. • The higher the score the lower the numerical rank, and higher the happiness rating • Therefore, happiness score and rank are negatively correlated(as score increases, rank decreases) Let’s dissect the graph:
  79. 79. However, we see that rank is directly dependent on Happiness Score. So, we can drop rank from our data frame Model Building - Multiple Linear Regression
  80. 80. Model Building - Multiple Linear Regression We can draw a heat map and see the correlation between the variables
  81. 81. Model Building - Multiple Linear Regression We can see that happiness score is strongly correlated with economy and health, followed by family and freedom.
  82. 82. Thus, in our model, we should see strong correlation between variables when finding the coefficients. Model Building - Multiple Linear Regression
  83. 83. Now, we can start to use SkLearn to construct the model We drop categorical values and happiness rank because we will not explore it in this report Model Building - Multiple Linear Regression
  84. 84. Let’s divide the dataset into train and test for further model building and testing: Model Building - Multiple Linear Regression So, data is split in 80:20 ratio between and train and test dataset respectively
  85. 85. Then, we import sklearn’s linear regression to fit the regression to the training set Model Building - Multiple Linear Regression
  86. 86. Model Building - Multiple Linear Regression Evaluate the predicted values and calculate the difference:
  87. 87. Model Building - Multiple Linear Regression We can print the intercept and coefficients for the train dataset:
  88. 88. Model Building - Multiple Linear Regression Trick: Right after you have trained your model, you know the order of the coefficients This will print the coefficients and the corresponding feature Tip: you can also use dictionary to reuse the coefficients later
  89. 89. Model Building - Multiple Linear Regression For testing the model, let’s find out mean error:
  90. 90. Model Building - Multiple Linear Regression Using sklearn.predict, we can use this model to predict the happiness scores for the first 100 countries in our model How do these predictions compare to the actual values in our data?
  91. 91. Model Building - Multiple Linear Regression Let’s create a plot of our actual happiness score versus the predicted happiness score.
  92. 92. Model Building - Multiple Linear Regression Hence, we can say that our model is a pretty good indicator of the actual happiness score as there is a strong positive correlation between the actual and the predicted values
  93. 93. Model Building - Multiple Linear Regression HappinessScore=(0.0001289) + (1.000048 * Economy) + (0.999997 * Family)+ (0.999865 * Health) + (0.999920 * Freedom) + (0.999993 * Trust) + (1.000029 * Generosity) + (0.999971 * DystopiaResidual) The Multiple Linear Regression Model for Happiness Score:
  94. 94. Communicate Results
  95. 95. Before you present your data model and insights, first understand: What is your goal?Who is your target audience? Communicate Results But the battle is not over yet!!
  96. 96. Communicate Results Before you present your data model and insights, first understand: What is your goal?Who is your target audience?
  97. 97. Communicate Results After you are clear with your objective, think about: • What’s the question? • What’s your methodology? • What’s the answer?
  98. 98. Communicate Results Then, you’re finally ready to communicate your results to the business teams so that it easily goes into execution phase
  99. 99. Model Maintenance Model Maintenance Data Acquisition Data Preparation Data Mining Model Maintenance Model Building
  100. 100. Model Maintenance After we have deployed the model, it is also important to prevent the model from deteriorating. How can we do that?
  101. 101. ASSESS: Once in a while, we have to run a fresh sample through data to assess our model RETRAIN: After assessment, if you are not happy with the performance of the model then you retrain the model REBUILD: If retrain fails too, then you have to rebuild by finding all the variables again and build a new model Model Maintenance
  102. 102. Summary What is data science? What a data scientist does? Data acquisition Model maintenanceMODEL BuildingData Mining using tableau
  103. 103. Model Building Using ARIMA Plot the data: As we can observe that mean and variance are changing with time. Hence, this is non-stationary series
  104. 104. Model Building Using ARIMA Auto Correlation Function(ACF):
  105. 105. Model Building Using ARIMA Partial Auto Correlation Function(PACF):

Notas del editor

  • Remove title case
  • The important deciding variable seems to be score (as it determines rank).
  • The darker red the square, the stronger the positive correlation, and obviously, variables will have a correlation of 1 with each other.
  • The darker red the square, the stronger the positive correlation, and obviously, variables will have a correlation of 1 with each other.
  • RETRAIN: The variables remain the same but you train your model with fresh sample of data so coefficients will change
  • Natural language processing to enable it to communicate successfully in English (or some other human language).
    Knowledge representation to store information provided before or during the interrogation.
    Automated reasoning to use the stored information to answer questions and to draw new conclusions.
    Machine learning to adapt to new circumstances and to detect and extrapolate patterns.

×