Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Basic Analysis using Python

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Próximo SlideShare
Basic Analysis using R
Basic Analysis using R
Cargando en…3
×

Eche un vistazo a continuación

1 de 40 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Basic Analysis using Python (20)

Anuncio

Más reciente (20)

Anuncio

Basic Analysis using Python

  1. 1. Basic Analysis Using Python
  2. 2. SECTION 1 Descriptive Statistics Summarising your Data 2
  3. 3. Data Snapshot Data Descriptionbasic_salary_P3 3 The data has 41 rows and 7 columns First_Name First Name Last_Name Last Name Grade Grade Location Location Function Department ba Basic Allowance ms Management Supplements
  4. 4. Describing Variable salary.describe() ba ms count 39.000000 37.000000 mean 17209.743590 11939.054054 std 4159.515241 3223.018305 min 10940.000000 2700.000000 25% 13785.000000 10450.000000 50% 16230.000000 12420.000000 75% 19305.000000 14200.000000 max 29080.000000 16970.000000 4 salary=pd.read_csv('basic_salary_P3.csv') #Importing Data #Checking the variable features using summary function summary() gives descriptive measures for numeric variable
  5. 5. Measures of Central Tendency print(salary.ba.mean()) 17209.74 # Mean mean(), gives mean of the variable. print(salary.ba.median()) 16230 # Median median() gives median of the variable. from scipy import stats BasicAll=salary.ba.dropna(axis=0) trimmed_mean= trim_mean(BasicAll, 0.1) trimmed_mean 16879 Import stats from scipy. Missing values are removed from ba using dropna() Here, trim_mean() is excluding 10% observations from each side of the data from the mean print(salary.ba.mode()) NA # Mode mode() gives us the mode of the variable. 5
  6. 6. Measures of Variation statistics.variance(BasicAll) 17301567 6 import statistics statistics.stdev(BasicAll) 4159.515 # Standard Deviation Import statistics library to use functions for calculating standard deviation and variance Use the BasicAll object created previously, for calculating Standard deviation, variance and co- efficient of variation. stdev() gives standard deviation of the variable var() gives variance of the variable stats.variation(BasicAll) 0.23857 # Co-efficient of Variation variation() from scipy.stats gives us the co- efficient of variation.
  7. 7. Skewness and Kurtosis stats.kurtosis(BasicAll, bias=False) 0.4996513 7 stats.skew(BasicAll, bias=False) 0.9033507 # Skewness skew() gives skewness of the variable. bias=False corrects the calculations for statistical bias. from scipy import stats Using package scipy to calculate skewness and kurtosis. # Kurtosis kurtosis() gives kurtosis of the variable.
  8. 8. SECTION 2 Bivariate Analysis 8
  9. 9. Data Snapshot The data has 25 rows and 6 columns empno Employee Number aptitude Aptitude Score of the Employee testofen Test of English tech_ Technical Score g_k_ General Knowledge Score job_prof Job Proficiency Score Data Description job_proficiency_P3 9
  10. 10. Scatter Plot 10 import pandas as pd import matplotlib as mlt import matplotlib.pyplot as plt job= pd.read_csv('job_proficiency_P3') plt.scatter(job.aptitude,job.job_prof) # Plotting Scatter plot scatter() gives a scatterplot of the two variables mentioned. col= Argument to add colour
  11. 11. Pearson Correlation Coefficient Pearson Correlation Coefficient 0.5144 There is positive relation between aptitude and job proficiency but the relation is of moderate degree. import numpy as np np.corrcoef(job.aptitude,job.job_prof) # Scatterplot array([[ 1. , 0.51441069], [ 0.51441069, 1. ]]) corrcoef gives the Pearson Correlation Coefficient of the two variables mentioned
  12. 12. sns.lmplot('aptitude','job_prof',data=job);plt.xlabel('Aptitude');plt.yl abel('Job Proficiency') ScatterPlot with Regression Line #Scatterplot of job proficiency against aptitude with Regression Line 12 #Importing Library Seaborn import seaborn as sns sns.lmplot Calls a scatter plot from sns object plt.xlabel Defines the label on the X axis Plt.ylabel Defines the label on the Y axis
  13. 13. 13 OUT [3]: ScatterPlot with Regression Line
  14. 14. Scatter Plot Matrix using seabornpackage 14 sns.pairplot(job) #ScatterPlot Matrix
  15. 15. SECTION 3 DataVisualisation Graphs in Python 15
  16. 16. Data Snapshot The data has 1000 rows and 10 columns CustID Customer ID Age Age Gender Gender PinCode PinCode Active Whether the customer was active in past 24 weeks or not Calls Number of Calls made Minutes Number of minutes spoken Amt Amount charged AvgTime Mean Time per call Age_Group Age Group of the Customer Data Descriptiontelecom_P3 16
  17. 17. Data Visualization Data Visualization is possible thanks to matplotlib. It is a multiplatform visualization tool built on top of Numpy that works with the SciPy library to create graphical models . It provides the user with complete control over the graph and comes with two interfaces, an object oriented style and a MATPLOT style. matplotlib is fairly low level and can be cumbersome to use byitself, which is why several libraries and wrappers exist on top of it's API such as Seaborn, Altair, Bokeh and even pandas. We will be using the pandas wrapper as a quick tool for visualizing our data and learn about seaborn as we move on to higher level visualizations. However, the fact remains that we will essentially working with matplotlib for both. 17
  18. 18. telecom_data=pd.read_csv('telecom_P3.csv') import pandas as pd import matplotlib as mlt import matplotlib.pyplot as plt import seaborn as sns Diagrams #Importing the Libraries #Importing Data 18 #Aggregate & Merge Data working=telecom_data.groupby('Age_Group')['CustID'].count() Aggregating the CustID data by the age groups.
  19. 19. Simple Bar Chart 19 working.plot.bar(title='Simple Bar Chart') #Create a basic bar chart using plot function plot() This function is a convenience method to plot all columns with labels bar() Plots a bar chart. Can also be called by passing the argument kind ='bar' in plot. title A string argument to give the plot a title.
  20. 20. Simple Bar Chart 20 OUT [7]:
  21. 21. Simple Bar Chart 21 plt.figure(); working.plot.bar(title='Simple Bar Chart', color='red'); plt.xlabel('Age Groups'); plt.ylabel('No. of Calls') #Customizing your chart using additional arguments (both provide the same results) plt.figure() This function is a convenience method to plot all columns with labels. ax Matplotlib axes object containing the actual plot (with data points). color An argument to specify the plot colour. Accepts strings, hex numbers and colour code. plt.xlabel, ax.set_xlabel Function/method to specify the x label. plt.ylabel, ax.set_ylabel Function/method to specify the x label. plt.figure(); ax=working.plot.bar(title='Simple Bar Chart', color='red'); ax.set_xlabel('Age Groups'); ax.set_ylabel('No. of Calls') OR
  22. 22. Simple Bar Chart 22 OUT [8]:
  23. 23. Stacked Bar Chart 23 #Stacked Bar Chart pivot_table Reshapes the data and aggregates according to function specified. Here, we are aggregating the number of calls made by gender and age group. index The column or array to group by on the x axes (pivot table rows). columns The column or array to group by on the y axes (pivot table column). values Column to aggregate aggfunc Function to aggregate by. stacked Returns a stacked chart. Default is False. working2=pd.pivot_table(telecom_data, index=['Age_Group'], columns=['Gender'], values=['CustID'], aggfunc='count') plt.figure(); working2.plot.bar(title='Stacked Bar Chart', stacked=True); plt.xlabel('Age Groups'); plt.ylabel('No. of Calls')
  24. 24. Stacked Bar Chart 24 OUT [11]:
  25. 25. Percentage Bar Chart 25 #Stacked Bar Chart working3=working2.div(working2.sum(1).astype(float), axis=0) plt.figure(); working3.plot.bar(title='Percentage Bar Chart', stacked=True); plt.xlabel('Age Groups'); plt.ylabel('No. of Calls') Creates percentage values by dividing the count data by column sum. ax Matplotlib axes object contaning the actual plot (with data points). color An argument to specify the plot colour. Accepts strings, hex numbers and colour code. plt.xlabel, ax.set_xlabe l Function/method to specify the x label. plt.ylabel, ax.set_ylabe l Function/method to specify the x label.
  26. 26. Percentage Bar Chart 26 OUT [13]:
  27. 27. Multiple Bar Chart 27 #Stacked Bar Chart pivot_table Reshapes the data and aggregates according to function specified. index The column or array to group by on the x axes (pivot table rows). columns The column or array to group by on the y axes (pivot table column). values Column to aggregate aggfunction Function to aggregate by. plt.figure(); working2.plot.bar(title='Multiple Bar Chart'); plt.xlabel('Age Groups'); plt.ylabel('No. of Customers')
  28. 28. Multiple Bar Chart 28 OUT [14]:
  29. 29. Pie Chart 29 working.plot.pie(label=('Age Groups'), colormap='brg') #Pie Bar Chart pie() Creates a pie chart label Specifies the Label to be used colormap String argument that specifies what colors to choose from
  30. 30. Pie Chart OUT [15]:
  31. 31. Box Plot 31 telecom_data.Calls.plot.box(label='No. Of Calls') #BoxPlot box() in pandas yields a different types of box chart Calls specifies vector (column) for which the box plot needs to be plotted label provides a user defined label for the variable on Y axis color can be used to input your choice of color to the bars
  32. 32. BoxPlot Chart 32 OUT [17]:
  33. 33. Box Plot 33 telecom_data.boxplot(column='Calls', by='Age_Group', grid=False) #BoxPlot using multiple variables. Here, we are plotting number of calls by gender. boxplot() in pandas yields a different types of box chart. It's a different way of writing plot.box() column specifies vector (variable) for which the box plot needs to be plotted by Specifies the vector (column) by which the distribution should be plotted. label provides a user defined label for the variable on Y axis color can be used to input your choice of color to the bars grid Can be used to remove the background grid seen in each plot
  34. 34. Box Plot 34 OUT [18]:
  35. 35. Histogram 35 telecom_data.Calls.hist(bins=12,grid=False) #Histogram hist() in base Python yields a histogram bins specifies the width of each bar label provides a user defined label for the variable on X and Y axis color can be used to input your choice of color to the bars
  36. 36. Histogram 36 Out [18]:
  37. 37. Stem and Leaf Plot 37 plt.stem(telecom_data.Calls) #Stem and Leaf Plot using matplotlib stem() in matplotlib yields a stem and leaf chart telecom_data.Ca lls specifies vector (variable) for which the stemplot needs to be plotted
  38. 38. Heat Map 38 plt.show; ax=sns.heatmap(agg);ax.set(xlabel='Gender', ylabel='Age Group',title='Heatmap for Number of Calls by Age & Gender') # Heat Map ax Axes object returned by seaborn heatmap() Seaborn method for creating a heatmap ax.set Sets text data in the graph linewidths Adds lines between each cell. Default is zero. #Importing data and aggregating calls by gender and age group agg=pd.pivot_table(telecom_data, index=['Age_Group'], columns=['Gender'], values=['Calls'], aggfunc='sum')
  39. 39. Heat Map 39 OUT [8]
  40. 40. THANK YOU! 40

Notas del editor

  • VO:
  • 12
  • 13
  • 14
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39

×