The document discusses sampling in exploratory data analysis. It notes that sampling is an often overlooked art and provides some key considerations for sampling, including what data is being sampled, any prior assumptions that can be made, and what operations or analyses will be performed on the sampled data. The document advocates for designing a sampling plan and executing it rather than simply hitting a "big red button" to analyze all available data, in order to gain insights faster through an iterative process of exploring, hypothesizing, and modeling on sampled data.
13. import seaborn as sns
from scipy.optimize import curve_fit
def func(x, a, b):
return a + b * x
df = sns.load_dataset(“anscombe")
df.x.mean()
df.y.mean()
df.x.var()
df.y.var()
df.x.corr(tmp.y))
popt, pcov = curve_fit(func, tmp.x, tmp.y)
14. Mean x: 9.0
Mean y: 7.5
Variance x: 11.00
Variance y: 4.13
Correlation between x and y: 0.816
Linear regression coefficients: y = 3.00 + 0.50x
http://goo.gl/Zuw4Qe
15. 2
4
6
8
10
12
14
y
dataVet I dataVet II
2 4 6 8 10 12 14 16 18 20
x
2
4
6
8
10
12
14
y
dataVet III
2 4 6 8 10 12 14 16 18 20
x
dataVet IV
dataVet
I
II
III
IV