SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
Data Analysis Course
Descriptive Statistics(Version-1)
Venkat Reddy
Data Analysis Course
• Data analysis design document
• Introduction to statistical data analysis

• Descriptive statistics
•   Data exploration, validation & sanitization




                                                                Venkat Reddy
                                                          Data Analysis Course
•   Probability distributions examples and applications
•   Simple correlation and regression analysis
•   Multiple liner regression analysis
•   Logistic regression analysis
•   Testing of hypothesis
•   Clustering and decision trees
•   Time series analysis and forecasting
•   Credit Risk Model building-1                                 2
•   Credit Risk Model building-2
Note
• This presentation is just class notes. The course notes for Data
  Analysis Training is by written by me, as an aid for myself.
• The best way to treat this is as a high-level summary; the
  actual session went more in depth and contained other




                                                                           Venkat Reddy
                                                                     Data Analysis Course
  information.
• Most of this material was written as informal notes, not
  intended for publication
• Please send questions/comments/corrections to
  venkat@trenwiseanalytics.com or 21.venkat@gmail.com
• Please check my website for latest version of this document
                                         -Venkat Reddy                      3
Contents
•   What are Descriptive statistics
•   Frequency tables and graphs, Histograms
•   Central Tendency
•   Mean, Median, Mode




                                                    Venkat Reddy
                                              Data Analysis Course
•   Dispersion
•   Range, variance, standard deviation
•   Quartiles, Percentiles
•   Box Plots
•   Bivariate Descriptive Statistics
    • Contingency Tables
    • Correlation                                    4
    • Regression
Why Descriptive statistics?
• Who is a better ODI batsmen - Sachin or Muralidharan?
   • Batting average?
• Who is the reliable- Dhoni or Afridi?
   • Score variance
• A triangular series among Aus, Eng & Newziland ; Who will win?
   • Most number of wins - Mode




                                                                                  Venkat Reddy
                                                                            Data Analysis Course
• I am going to buy shoes. Which brand has verity- Power or Adidas?
   • Price range - Range

• We used Average, Variance, Mode, Range to make some inferences.
  These are nothing but descriptive statistics
• Descriptive statistics tell us what happened in the past.
• Descriptive statistics avoid inferences but, they help us to get a feel
  of the data.
• Some times they are good enough to make an inference.                            5
Descriptive Statistics
• A statistic or a measure that describes the data
  • Average salary of employees
• Describing data with tables and graphs (quantitative or
  categorical variables)




                                                                          Venkat Reddy
                                                                    Data Analysis Course
• Numerical descriptions
  • Center – Give some example measures of center of the data
  • Variability– Give some example measures of variability of the
    data
• Bivariate descriptions (In practice, most studies have several
  variables)
  • Dependency measures(Correlation)
                                                                           6
Simple Descriptive Statistics
•   N
•   Sum
•   Min
•   Max




                                                                       Venkat Reddy
                                                                 Data Analysis Course
•   Average
•   Frequency of each level
•   Variance
•   Standard deviation

These simple descriptive statistics will be use in inferential
statistics later.                                                       7
Frequency tables & Histograms
• Frequency distribution: Lists possible values of variable and
  number of times each occurs




                                                                        Venkat Reddy
                                                                  Data Analysis Course
                                                                         8
Shapes of histograms
• Bell-shaped (IQ, SAT, political ideology in all U.S. )
• Skewed right
  • Example Annual income
  • No. times arrested




                                                                       Venkat Reddy
                                                                 Data Analysis Course
• Skewed left
  • Score on easy exam
  • Daily level if excitement in office
• Bimodal
  • Hardworking days in a year (Peaks near Mid year & year end
    Appraisal)

                                                                        9
Lab : Histogram
• Create a histogram on variable ‘actual’ in prdsale data
  • How many modes?
  • What is the skewness?
  • What is its kurtosis?
• Create a histogram on variable ‘msrp’ in cars data




                                                                  Venkat Reddy
                                                            Data Analysis Course
  • How many modes?
  • What is the skewness?
  • What is its kurtosis?
• Create a histogram on variable ‘weight’ in cars data
  • How many modes?
  • What is the skewness?
  • What is its kurtosis?

                                                                10
Compare the above three histograms.
Central tendency
• What is the flight fare from Bangalore to Delhi? 3500–Exact or
  average?
• What is central tendency? - Average
• Three types of Averages




                                                                         Venkat Reddy
                                                                   Data Analysis Course
  • Mean
  • Median
  • Mode




                                                                       11
Mean
• Center of gravity
• Evenly partitions the sum of all measurement among all cases;
  average of all measures
                              n

                            x




                                                                        Venkat Reddy
                                                                  Data Analysis Course
                                      i
                    x       i 1

                                  n
• Crucial for inferential statistics
• Mean is not very resistant to outliers –See in Median
                                                                      12
Median
• What is the mean of [0.1     0.8    0.4    0.3     0.1
      0.4      9.0    0.1      0.9    0.3    1.0     0.3
      0.1]
• Guess without calculation – Around 0.5?
• Now calculate the mean




                                                                      Venkat Reddy
                                                                Data Analysis Course
• Median is exactly in the middle. Isn’t mean exactly in the
  middle
• Order the observations in ascending or descending order and
  pick the middle observation
• less useful for inferential purposes
                                                                    13
• More resistant to effects of outliers…
Calculation of Median
        rim diameter (cm)

               unit 1 unit 2
                  9.7   9.0
                11.5 11.2
                11.6 11.3




                                         Venkat Reddy
                                   Data Analysis Course
                12.1 11.7
                12.4 12.2
                12.6 12.5
         12.9 <--      13.2 13.2
                13.1 13.8
                13.5 14.0
                13.6 15.5
                14.8 15.6
                16.3 16.2
                26.9 16.4
                                       14
Mode
• How do you express average size of the shoes ?
  • 6.567 or 6?

• Mode is the most numerous category
• Can be more or less created by the grouping procedure




                                                                        Venkat Reddy
                                                                  Data Analysis Course
• For theoretical distributions—simply the location of the peak
  on the frequency distribution




                                                                      15
Lab
•   Run Proc means data product data
•   What is the mean of ‘msrp’ in cars data?
•   Is it reflecting the average value of price?
•   What is median of ‘msrp’ in cars data?




                                                                       Venkat Reddy
                                                                 Data Analysis Course
•   Is it reflecting the average value of price?
•   Run Proc Univariate on weight varaibale in cars data. Find
    mean, Median & Mode.




                                                                     16
Dispersion
Person1: What is the average depth of this river? 5 feet
Person2: I am 5.5 I can easily cross it(and starts crossing it)
Person 2: Help….help.
Person 1: Some times just knowing the central tendency is not




                                                                        Venkat Reddy
                                                                  Data Analysis Course
sufficient

• Measures of dispersion summarize the degree of
  clustering/spread of cases, esp. with respect to central
  tendency…
  • range
  • variance
  • standard deviation                                                17
Range
                  unit 1 unit 2
• Max –Min          9.7    9.0
                   11.5 11.2
                   11.6 11.3
                   12.1 11.7
                   12.4 12.2




                                        Venkat Reddy
                                  Data Analysis Course
                   12.6 12.5
    R: range(x)    13.1 13.2
                   13.5 13.8
                   13.6 14.0
                   14.8 15.5
                   16.3 15.6
                   26.9 16.2
                          16.4


                                      18
Variance
 • Take deviation from Mean- It can be zero some times
 • Hence take square of deviation from mean  Take average of
   that
 • Average mean squared distance is variance




                                                                               Venkat Reddy
                                                                         Data Analysis Course
                               n

                               x  x 
                                         2
                                     i
                       2    i 1
                                     n

• Units of variance are squared… this makes variance hard to interpret
• Eg : Mean length = 22.6 mm variance = 38 mm2
• What does this mean??? –I don’t Know
                                                                             19
Standard Deviation
• Square root of variance                n

                                        xi  x 2
                                  s    i 1
                                               n




                                                            Venkat Reddy
                                                      Data Analysis Course
• Units are in same units as base measurements
• Mean = 22.6 mm standard deviation = 6.2 mm
• Mean +/- sd (16.4—28.8 mm)
  • should give at least some intuitive sense of
    where most of the cases lie, barring major
    effects of outliers

                                                          20
Quartiles & Percentiles
• pth percentile: p percent of observations below it, (100 - p)%
  above it.
• Like 95% of CAT percentile means 5% are above & 95% are
  below
• 1,2,3,4,5,6,7,8,9,10 - What is 25th percentile?




                                                                         Venkat Reddy
                                                                   Data Analysis Course
• 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 - What is
  25th percentile? What is 80th percentile?

  • p = 50: median
  • p = 25: lower quartile (LQ)
  • p = 75: upper quartile (UQ)

• Interquartile range IQR = UQ - LQ                                    21
Box Plots
• Quartiles portrayed graphically by box plots




                                                       Venkat Reddy
                                                 Data Analysis Course
                                                     22
Box Plots




                                                         Venkat Reddy
                                                   Data Analysis Course
Example: weekly TV watching for n=60, 3 outliers       23
Box Plots Interpretation
• Box plots have box from LQ to UQ, with median marked. They
  portray a five-number summary of the data: Minimum, LQ,
  Median, UQ, Maximum
• Except for outliers identified separately
• Outlier = observation falling




                                                                       Venkat Reddy
                                                                 Data Analysis Course
          below LQ – 1.5(IQR) or above UQ + 1.5(IQR)
• Ex. If LQ = 2, UQ = 10, then IQR = 8 and outliers above 10 +
  1.5(8) = 22




                                                                     24
Lab
• Run proc univariate on a variable from sample data in sas
  default library(prd sale / cars)
• Run proc means on actual & predicted variables from product
  sales data
• What are the values of Range, Variance, SD




                                                                      Venkat Reddy
                                                                Data Analysis Course
• What are 1,2,3 & 4 quartile values
• What is 95th percentile?
• Use “all” option to display the box plots



                                                                    25
Contingency Tables
• Cross classifications of categorical variables in which rows (typically)
  represent categories of explanatory variable and columns represent
  categories of response variable.
• Counts in “cells” of the table give the numbers of individuals at the
  corresponding combination of levels of the two variables




                                                                                   Venkat Reddy
                                                                             Data Analysis Course
   Example: Happiness and Family Income of 1993 families (GSS 2008 data:
   “happy,” “finrela”)
                             Happiness
    Income       Very Pretty Nottoo               Total
                -------------------------------
     Above Aver. 164 233               26         423
     Average      293 473             117         883
     Below Aver. 132 383              172          687
                 ------------------------------
    Total         589 1089           315          1993
                                                                                 26
Contingency tables
• Example: Percentage “very happy” is
  • 39% for above average income (164/423 = 0.39)
  • 33% for average income (293/883 = 0.33)
  • What percent for below average income?




                                                                      Venkat Reddy
                                                                Data Analysis Course
                        Happiness
  Income      Very        Pretty Not oo                 Total
           --------------------------------------------
   Above 164 (39%) 233 (55%) 26 (6%)                     423
   Average 293 (33%) 473 (54%) 117 (13%) 883
   Below 132 (19%) 383 (56%) 172 (25%) 687
          ----------------------------------------------
• What can we conclude? Is happiness depending on Income? Or
  Happiness is independent of Income?                               27
• Inference questions for later chapters?
Correlation
• Correlation describes strength of association between
  two variables
• Falls between -1 and +1, with sign indicating direction of
  association (formula & other details later )




                                                                     Venkat Reddy
                                                               Data Analysis Course
• The larger the correlation in absolute value, the stronger
  the association (in terms of a straight line trend)
• Examples: (positive or negative, how strong?)
  • Mental impairment and life events, correlation =
  • GDP and fertility, correlation =
  • GDP and percent using Internet, correlation =
                                                                   28
Strength of Association
• Correlation 0 No linear association
• Correlation 0 to 0.25 Negligible positive
  association
• Correlation 0.25-0.5  Weak positive
  association
• Correlation 0.5-0.75 Moderate positive




                                                     Venkat Reddy
                                               Data Analysis Course
  association
• Correlation >0.75 Very Strong positive
  association
• What are the limits for negative
  correlation




                                                   29
Regression
• Regression analysis gives line predicting y using
  x(algorithm & other details later )

• y = college GPA, x = high school GPA




                                                            Venkat Reddy
                                                      Data Analysis Course
• Predicted y = 0.234 + 1.002(x)




                                                          30
Lab
• Create a contingency table for product sales data
• Find contingency tables for
  • Region by product type
  • Division by Product type




                                                                         Venkat Reddy
                                                                   Data Analysis Course
• Find the correlation between actual sales and predicted sales.
• Find the correlation between weight & msrp in cars data




                                                                       31
Venkat Reddy Konasani
Manager at Trendwise Analytics
venkat@TrendwiseAnalytics.com
21.venkat@gmail.com




                                          Venkat Reddy
                                    Data Analysis Course
+91 9886 768879
www.TrendwiseAnalytics.com/venkat




                                        32

Más contenido relacionado

La actualidad más candente

Inferential statistics.ppt
Inferential statistics.pptInferential statistics.ppt
Inferential statistics.ppt
Nursing Path
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
Aiden Yeh
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
sristi1992
 

La actualidad más candente (20)

Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Type of data
Type of dataType of data
Type of data
 
1.2 types of data
1.2 types of data1.2 types of data
1.2 types of data
 
Descriptive statistics ii
Descriptive statistics iiDescriptive statistics ii
Descriptive statistics ii
 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and Statistics
 
Inferential statistics.ppt
Inferential statistics.pptInferential statistics.ppt
Inferential statistics.ppt
 
Introduction to Descriptive Statistics
Introduction to Descriptive StatisticsIntroduction to Descriptive Statistics
Introduction to Descriptive Statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Descriptive and Inferential Statistics
Descriptive and Inferential StatisticsDescriptive and Inferential Statistics
Descriptive and Inferential Statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Measures of central tendency ppt
Measures of central tendency pptMeasures of central tendency ppt
Measures of central tendency ppt
 
Multinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationshipsMultinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationships
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
 
Statistics "Descriptive & Inferential"
Statistics "Descriptive & Inferential"Statistics "Descriptive & Inferential"
Statistics "Descriptive & Inferential"
 
Introduction to Statistics
Introduction to StatisticsIntroduction to Statistics
Introduction to Statistics
 
Data Display and Summary
Data Display and SummaryData Display and Summary
Data Display and Summary
 
Data
DataData
Data
 
DATA Types
DATA TypesDATA Types
DATA Types
 

Similar a Descriptive statistics

LESSON 4_UNGROUPED.pptx.pdf
LESSON  4_UNGROUPED.pptx.pdfLESSON  4_UNGROUPED.pptx.pdf
LESSON 4_UNGROUPED.pptx.pdf
nnzuliyana2
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
Turi, Inc.
 

Similar a Descriptive statistics (20)

Data analysis Design Document
Data analysis Design DocumentData analysis Design Document
Data analysis Design Document
 
Timeseries forecasting
Timeseries forecastingTimeseries forecasting
Timeseries forecasting
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Testing of hypothesis
Testing of hypothesisTesting of hypothesis
Testing of hypothesis
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
LESSON 4_UNGROUPED.pptx.pdf
LESSON  4_UNGROUPED.pptx.pdfLESSON  4_UNGROUPED.pptx.pdf
LESSON 4_UNGROUPED.pptx.pdf
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
 
Be a pro in statistics
Be a pro in statisticsBe a pro in statistics
Be a pro in statistics
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Analysis, Intepretation
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Data analysis
Data analysisData analysis
Data analysis
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
T7 data analysis
T7 data analysisT7 data analysis
T7 data analysis
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
Introduction to Data Analytics with R
Introduction to Data Analytics with RIntroduction to Data Analytics with R
Introduction to Data Analytics with R
 
Calibration of weights in surveys with nonresponse and frame imperfections
Calibration of weights in surveys with nonresponse and frame imperfectionsCalibration of weights in surveys with nonresponse and frame imperfections
Calibration of weights in surveys with nonresponse and frame imperfections
 

Más de Venkata Reddy Konasani

Más de Venkata Reddy Konasani (20)

Transformers 101
Transformers 101 Transformers 101
Transformers 101
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Decision tree
Decision treeDecision tree
Decision tree
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Testing of hypothesis case study
Testing of hypothesis case study Testing of hypothesis case study
Testing of hypothesis case study
 
L101 predictive modeling case_study
L101 predictive modeling case_studyL101 predictive modeling case_study
L101 predictive modeling case_study
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Online data sources for analaysis
Online data sources for analaysis Online data sources for analaysis
Online data sources for analaysis
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 

Último

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 

Último (20)

PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 

Descriptive statistics

  • 1. Data Analysis Course Descriptive Statistics(Version-1) Venkat Reddy
  • 2. Data Analysis Course • Data analysis design document • Introduction to statistical data analysis • Descriptive statistics • Data exploration, validation & sanitization Venkat Reddy Data Analysis Course • Probability distributions examples and applications • Simple correlation and regression analysis • Multiple liner regression analysis • Logistic regression analysis • Testing of hypothesis • Clustering and decision trees • Time series analysis and forecasting • Credit Risk Model building-1 2 • Credit Risk Model building-2
  • 3. Note • This presentation is just class notes. The course notes for Data Analysis Training is by written by me, as an aid for myself. • The best way to treat this is as a high-level summary; the actual session went more in depth and contained other Venkat Reddy Data Analysis Course information. • Most of this material was written as informal notes, not intended for publication • Please send questions/comments/corrections to venkat@trenwiseanalytics.com or 21.venkat@gmail.com • Please check my website for latest version of this document -Venkat Reddy 3
  • 4. Contents • What are Descriptive statistics • Frequency tables and graphs, Histograms • Central Tendency • Mean, Median, Mode Venkat Reddy Data Analysis Course • Dispersion • Range, variance, standard deviation • Quartiles, Percentiles • Box Plots • Bivariate Descriptive Statistics • Contingency Tables • Correlation 4 • Regression
  • 5. Why Descriptive statistics? • Who is a better ODI batsmen - Sachin or Muralidharan? • Batting average? • Who is the reliable- Dhoni or Afridi? • Score variance • A triangular series among Aus, Eng & Newziland ; Who will win? • Most number of wins - Mode Venkat Reddy Data Analysis Course • I am going to buy shoes. Which brand has verity- Power or Adidas? • Price range - Range • We used Average, Variance, Mode, Range to make some inferences. These are nothing but descriptive statistics • Descriptive statistics tell us what happened in the past. • Descriptive statistics avoid inferences but, they help us to get a feel of the data. • Some times they are good enough to make an inference. 5
  • 6. Descriptive Statistics • A statistic or a measure that describes the data • Average salary of employees • Describing data with tables and graphs (quantitative or categorical variables) Venkat Reddy Data Analysis Course • Numerical descriptions • Center – Give some example measures of center of the data • Variability– Give some example measures of variability of the data • Bivariate descriptions (In practice, most studies have several variables) • Dependency measures(Correlation) 6
  • 7. Simple Descriptive Statistics • N • Sum • Min • Max Venkat Reddy Data Analysis Course • Average • Frequency of each level • Variance • Standard deviation These simple descriptive statistics will be use in inferential statistics later. 7
  • 8. Frequency tables & Histograms • Frequency distribution: Lists possible values of variable and number of times each occurs Venkat Reddy Data Analysis Course 8
  • 9. Shapes of histograms • Bell-shaped (IQ, SAT, political ideology in all U.S. ) • Skewed right • Example Annual income • No. times arrested Venkat Reddy Data Analysis Course • Skewed left • Score on easy exam • Daily level if excitement in office • Bimodal • Hardworking days in a year (Peaks near Mid year & year end Appraisal) 9
  • 10. Lab : Histogram • Create a histogram on variable ‘actual’ in prdsale data • How many modes? • What is the skewness? • What is its kurtosis? • Create a histogram on variable ‘msrp’ in cars data Venkat Reddy Data Analysis Course • How many modes? • What is the skewness? • What is its kurtosis? • Create a histogram on variable ‘weight’ in cars data • How many modes? • What is the skewness? • What is its kurtosis? 10 Compare the above three histograms.
  • 11. Central tendency • What is the flight fare from Bangalore to Delhi? 3500–Exact or average? • What is central tendency? - Average • Three types of Averages Venkat Reddy Data Analysis Course • Mean • Median • Mode 11
  • 12. Mean • Center of gravity • Evenly partitions the sum of all measurement among all cases; average of all measures n x Venkat Reddy Data Analysis Course i x i 1 n • Crucial for inferential statistics • Mean is not very resistant to outliers –See in Median 12
  • 13. Median • What is the mean of [0.1 0.8 0.4 0.3 0.1 0.4 9.0 0.1 0.9 0.3 1.0 0.3 0.1] • Guess without calculation – Around 0.5? • Now calculate the mean Venkat Reddy Data Analysis Course • Median is exactly in the middle. Isn’t mean exactly in the middle • Order the observations in ascending or descending order and pick the middle observation • less useful for inferential purposes 13 • More resistant to effects of outliers…
  • 14. Calculation of Median rim diameter (cm) unit 1 unit 2 9.7 9.0 11.5 11.2 11.6 11.3 Venkat Reddy Data Analysis Course 12.1 11.7 12.4 12.2 12.6 12.5 12.9 <-- 13.2 13.2 13.1 13.8 13.5 14.0 13.6 15.5 14.8 15.6 16.3 16.2 26.9 16.4 14
  • 15. Mode • How do you express average size of the shoes ? • 6.567 or 6? • Mode is the most numerous category • Can be more or less created by the grouping procedure Venkat Reddy Data Analysis Course • For theoretical distributions—simply the location of the peak on the frequency distribution 15
  • 16. Lab • Run Proc means data product data • What is the mean of ‘msrp’ in cars data? • Is it reflecting the average value of price? • What is median of ‘msrp’ in cars data? Venkat Reddy Data Analysis Course • Is it reflecting the average value of price? • Run Proc Univariate on weight varaibale in cars data. Find mean, Median & Mode. 16
  • 17. Dispersion Person1: What is the average depth of this river? 5 feet Person2: I am 5.5 I can easily cross it(and starts crossing it) Person 2: Help….help. Person 1: Some times just knowing the central tendency is not Venkat Reddy Data Analysis Course sufficient • Measures of dispersion summarize the degree of clustering/spread of cases, esp. with respect to central tendency… • range • variance • standard deviation 17
  • 18. Range unit 1 unit 2 • Max –Min 9.7 9.0 11.5 11.2 11.6 11.3 12.1 11.7 12.4 12.2 Venkat Reddy Data Analysis Course 12.6 12.5 R: range(x) 13.1 13.2 13.5 13.8 13.6 14.0 14.8 15.5 16.3 15.6 26.9 16.2 16.4 18
  • 19. Variance • Take deviation from Mean- It can be zero some times • Hence take square of deviation from mean  Take average of that • Average mean squared distance is variance Venkat Reddy Data Analysis Course n  x  x  2 i 2  i 1 n • Units of variance are squared… this makes variance hard to interpret • Eg : Mean length = 22.6 mm variance = 38 mm2 • What does this mean??? –I don’t Know 19
  • 20. Standard Deviation • Square root of variance n  xi  x 2 s i 1 n Venkat Reddy Data Analysis Course • Units are in same units as base measurements • Mean = 22.6 mm standard deviation = 6.2 mm • Mean +/- sd (16.4—28.8 mm) • should give at least some intuitive sense of where most of the cases lie, barring major effects of outliers 20
  • 21. Quartiles & Percentiles • pth percentile: p percent of observations below it, (100 - p)% above it. • Like 95% of CAT percentile means 5% are above & 95% are below • 1,2,3,4,5,6,7,8,9,10 - What is 25th percentile? Venkat Reddy Data Analysis Course • 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 - What is 25th percentile? What is 80th percentile? • p = 50: median • p = 25: lower quartile (LQ) • p = 75: upper quartile (UQ) • Interquartile range IQR = UQ - LQ 21
  • 22. Box Plots • Quartiles portrayed graphically by box plots Venkat Reddy Data Analysis Course 22
  • 23. Box Plots Venkat Reddy Data Analysis Course Example: weekly TV watching for n=60, 3 outliers 23
  • 24. Box Plots Interpretation • Box plots have box from LQ to UQ, with median marked. They portray a five-number summary of the data: Minimum, LQ, Median, UQ, Maximum • Except for outliers identified separately • Outlier = observation falling Venkat Reddy Data Analysis Course below LQ – 1.5(IQR) or above UQ + 1.5(IQR) • Ex. If LQ = 2, UQ = 10, then IQR = 8 and outliers above 10 + 1.5(8) = 22 24
  • 25. Lab • Run proc univariate on a variable from sample data in sas default library(prd sale / cars) • Run proc means on actual & predicted variables from product sales data • What are the values of Range, Variance, SD Venkat Reddy Data Analysis Course • What are 1,2,3 & 4 quartile values • What is 95th percentile? • Use “all” option to display the box plots 25
  • 26. Contingency Tables • Cross classifications of categorical variables in which rows (typically) represent categories of explanatory variable and columns represent categories of response variable. • Counts in “cells” of the table give the numbers of individuals at the corresponding combination of levels of the two variables Venkat Reddy Data Analysis Course Example: Happiness and Family Income of 1993 families (GSS 2008 data: “happy,” “finrela”) Happiness Income Very Pretty Nottoo Total ------------------------------- Above Aver. 164 233 26 423 Average 293 473 117 883 Below Aver. 132 383 172 687 ------------------------------ Total 589 1089 315 1993 26
  • 27. Contingency tables • Example: Percentage “very happy” is • 39% for above average income (164/423 = 0.39) • 33% for average income (293/883 = 0.33) • What percent for below average income? Venkat Reddy Data Analysis Course Happiness Income Very Pretty Not oo Total -------------------------------------------- Above 164 (39%) 233 (55%) 26 (6%) 423 Average 293 (33%) 473 (54%) 117 (13%) 883 Below 132 (19%) 383 (56%) 172 (25%) 687 ---------------------------------------------- • What can we conclude? Is happiness depending on Income? Or Happiness is independent of Income? 27 • Inference questions for later chapters?
  • 28. Correlation • Correlation describes strength of association between two variables • Falls between -1 and +1, with sign indicating direction of association (formula & other details later ) Venkat Reddy Data Analysis Course • The larger the correlation in absolute value, the stronger the association (in terms of a straight line trend) • Examples: (positive or negative, how strong?) • Mental impairment and life events, correlation = • GDP and fertility, correlation = • GDP and percent using Internet, correlation = 28
  • 29. Strength of Association • Correlation 0 No linear association • Correlation 0 to 0.25 Negligible positive association • Correlation 0.25-0.5  Weak positive association • Correlation 0.5-0.75 Moderate positive Venkat Reddy Data Analysis Course association • Correlation >0.75 Very Strong positive association • What are the limits for negative correlation 29
  • 30. Regression • Regression analysis gives line predicting y using x(algorithm & other details later ) • y = college GPA, x = high school GPA Venkat Reddy Data Analysis Course • Predicted y = 0.234 + 1.002(x) 30
  • 31. Lab • Create a contingency table for product sales data • Find contingency tables for • Region by product type • Division by Product type Venkat Reddy Data Analysis Course • Find the correlation between actual sales and predicted sales. • Find the correlation between weight & msrp in cars data 31
  • 32. Venkat Reddy Konasani Manager at Trendwise Analytics venkat@TrendwiseAnalytics.com 21.venkat@gmail.com Venkat Reddy Data Analysis Course +91 9886 768879 www.TrendwiseAnalytics.com/venkat 32