Sampling: An an often overlooked art in exploratory data analysis

•

6 recomendaciones•4,125 vistas

The document discusses sampling in exploratory data analysis. It notes that sampling is an often overlooked art and provides some key considerations for sampling, including what data is being sampled, any prior assumptions that can be made, and what operations or analyses will be performed on the sampled data. The document advocates for designing a sampling plan and executing it rather than simply hitting a "big red button" to analyze all available data, in order to gain insights faster through an iterative process of exploring, hypothesizing, and modeling on sampled data.

Datos y análisis

Sampling
An often overlooked art in exploratory
data analysis
Eli Bressert
@astrobiased
Stitch Fix

exploratory
data analysis
what to
optimize
1
2

1. obtain data
2. explore
3. do research/create data product
4. ﬁne tune project and release
5. rinse and repeat

basic statistics
simple graphics
formulate hypotheses
assess best models & approaches

0etric 00 0etric 01 0etric 02 0etric 03
0etric 04 0etric 05 0etric 06 0etric 07
0etric 08 0etric 09 0etric 10 0etric 11
0etric 12 0etric 13 0etric 14 0etric 15
0etric 16 0etric 17 0etric 18 0etric 19
0etric 20 0etric 21 0etric 22 0etric 23
0etric 24 0etric 25 0etric 26 0etric 27
0etric 28 0etric 29 0etric 30 0etric 31
0etric 32 0etric 33 0etric 34 0etric 35
0etric 36 0etric 37 0etric 38

metric00
metric01
metric02
metric03
metric04
metric05
metric 01
metric 02
metric 03
metric 04
metric 05
metric 06
−0.4
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
0.4

10 8.04
8 6.95
13 7.58
9 8.81
11 8.33
14 9.96
6 7.24
4 4.26
12 10.84
7 4.82
5 5.68
10 9.14
8 8.14
13 8.74
9 8.77
11 9.26
14 8.1
6 6.13
4 3.1
12 9.13
7 7.26
5 4.74
10 7.46
8 6.77
13 12.74
9 7.11
11 7.81
14 8.84
6 6.08
4 5.39
12 8.15
7 6.42
5 5.73
8 6.58
8 5.76
8 7.71
8 8.84
8 8.47
8 7.04
8 5.25
19 12.5
8 5.56
8 7.91
8 6.89
I II III IV

import seaborn as sns
from scipy.optimize import curve_fit
def func(x, a, b):
return a + b * x
df = sns.load_dataset(“anscombe")
df.x.mean()
df.y.mean()
df.x.var()
df.y.var()
df.x.corr(tmp.y))
popt, pcov = curve_fit(func, tmp.x, tmp.y)

Mean x: 9.0
Mean y: 7.5
Variance x: 11.00
Variance y: 4.13
Correlation between x and y: 0.816
Linear regression coefficients: y = 3.00 + 0.50x
http://goo.gl/Zuw4Qe

2
4
6
8
10
12
14
y
dataVet I dataVet II
2 4 6 8 10 12 14 16 18 20
x
2
4
6
8
10
12
14
y
dataVet III
2 4 6 8 10 12 14 16 18 20
x
dataVet IV
dataVet
I
II
III
IV

design your
data sample
plan and
execute
hit the big red
button and wait
for the process
to ﬁnish

hit red button
design and sample
explore, hypothesize, model
explore, hypothesize, model
time

what you’re sampling
priors that you can assume
what operations you will run

Más contenido relacionado

Similar a Sampling: An an often overlooked art in exploratory data analysis

Image ClassificationAnwar Jameel

Visual Analytics Best PracticesTableau Software

2013.11.14 Big Data Workshop Bruno Voisin NUI Galway

Information Visualization: See Patterns, Gain Insights & Make DecisionsUniversity of Maryland

Using the python_data_toolkit_timbers_slidesTiffany Timbers

Python seaborn cheat_sheetNishant Upadhyay

Introduction to machine learning algorithmsbigdata trunk

MyStataLab Assignment HelpStatistics Assignment Help

Language Language Models (in 2023) - OpenAISamuelButler15

Humanizing Data AnalysisJan Aerts

ML基本からResNetまでInstitute of Agricultural Machinery, NARO

Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Simplilearn

visualisasi data praktik pakai excel, pyElmaLyrics

Welcome to pythonKyunghoon Kim

A Semantic Web Platform for Improving the Automation and Reproducibility of F...Ratnesh Sahay

Seminar PSU 10.10.2014 mmeVyacheslav Arbuzov

Human_Activity_Recognition_Predictive_ModelDavid Ritchie

Info vis 4-2012-part1University of Maryland

Chapter 1: Linear RegressionAkmelSyed

PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-Nearest Neighbo...Aboul Ella Hassanien

Similar a Sampling: An an often overlooked art in exploratory data analysis (20)

Image Classification

Visual Analytics Best Practices

2013.11.14 Big Data Workshop Bruno Voisin

Information Visualization: See Patterns, Gain Insights & Make Decisions

Using the python_data_toolkit_timbers_slides

Python seaborn cheat_sheet

Introduction to machine learning algorithms

MyStataLab Assignment Help

Language Language Models (in 2023) - OpenAI

Humanizing Data Analysis

ML基本からResNetまで

Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...

visualisasi data praktik pakai excel, py

Welcome to python

A Semantic Web Platform for Improving the Automation and Reproducibility of F...

Seminar PSU 10.10.2014 mme

Human_Activity_Recognition_Predictive_Model

Info vis 4-2012-part1

Chapter 1: Linear Regression

PSOk-NN: A Particle Swarm Optimization Approach to Optimize k-Nearest Neighbo...

Último

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

ALSO dropshipping via API with DroFx.pptxolyaivanovalion

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Probability Grade 10 Third Quarter LessonsJoseMangaJr1

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Sampling: An an often overlooked art in exploratory data analysis

1. Sampling An often overlooked art in exploratory data analysis Eli Bressert @astrobiased Stitch Fix

2. exploratory data analysis what to optimize 1 2

3. What we [data scientists] do

4. 1. obtain data 2. explore 3. do research/create data product 4. ﬁne tune project and release 5. rinse and repeat

5. 1. obtain data 2.explore 3. do research/create data product 4. ﬁne tune project and release 5. rinse and repeat

6. basic statistics simple graphics formulate hypotheses assess best models & approaches

7. graphic simplicity

8. 0etric 00 0etric 01 0etric 02 0etric 03 0etric 04 0etric 05 0etric 06 0etric 07 0etric 08 0etric 09 0etric 10 0etric 11 0etric 12 0etric 13 0etric 14 0etric 15 0etric 16 0etric 17 0etric 18 0etric 19 0etric 20 0etric 21 0etric 22 0etric 23 0etric 24 0etric 25 0etric 26 0etric 27 0etric 28 0etric 29 0etric 30 0etric 31 0etric 32 0etric 33 0etric 34 0etric 35 0etric 36 0etric 37 0etric 38

9. metric00 metric01 metric02 metric03 metric04 metric05 metric 01 metric 02 metric 03 metric 04 metric 05 metric 06 −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4

10. −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3

11. Anscombe’s Quartet

12. 10 8.04 8 6.95 13 7.58 9 8.81 11 8.33 14 9.96 6 7.24 4 4.26 12 10.84 7 4.82 5 5.68 10 9.14 8 8.14 13 8.74 9 8.77 11 9.26 14 8.1 6 6.13 4 3.1 12 9.13 7 7.26 5 4.74 10 7.46 8 6.77 13 12.74 9 7.11 11 7.81 14 8.84 6 6.08 4 5.39 12 8.15 7 6.42 5 5.73 8 6.58 8 5.76 8 7.71 8 8.84 8 8.47 8 7.04 8 5.25 19 12.5 8 5.56 8 7.91 8 6.89 I II III IV

13. import seaborn as sns from scipy.optimize import curve_fit def func(x, a, b): return a + b * x df = sns.load_dataset(“anscombe") df.x.mean() df.y.mean() df.x.var() df.y.var() df.x.corr(tmp.y)) popt, pcov = curve_fit(func, tmp.x, tmp.y)

14. Mean x: 9.0 Mean y: 7.5 Variance x: 11.00 Variance y: 4.13 Correlation between x and y: 0.816 Linear regression coefficients: y = 3.00 + 0.50x http://goo.gl/Zuw4Qe

15. 2 4 6 8 10 12 14 y dataVet I dataVet II 2 4 6 8 10 12 14 16 18 20 x 2 4 6 8 10 12 14 y dataVet III 2 4 6 8 10 12 14 16 18 20 x dataVet IV dataVet I II III IV

16. EDA results will aﬀect all that follows

17. processing speed

18. faster technology

19. bigger data

20. Boundaries Pushing

21. You have two options

22. design your data sample plan and execute hit the big red button and wait for the process to ﬁnish

23.

24. attention span

25. ?

26. time cost

27. hit red button design and sample explore, hypothesize, model explore, hypothesize, model time

28. hit red button design and sample explore, hypothesize, model explore, hypothesize, model time

29. fail frequently learn fast

30. tried and true models and methods

31. sampling considerations

32. what you’re sampling priors that you can assume what operations you will run

33.

34. ?

Sampling: An an often overlooked art in exploratory data analysis

Recomendados

Recomendados

Más contenido relacionado

Similar a Sampling: An an often overlooked art in exploratory data analysis

Similar a Sampling: An an often overlooked art in exploratory data analysis (20)

Último

Último (20)

Sampling: An an often overlooked art in exploratory data analysis