Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 16 Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Welcome to CS310! (20)

Anuncio

Más de Dmitry Zinoviev (20)

Más reciente (20)

Anuncio

Welcome to CS310!

  1. 1. Welcome to CMPSC-310! Introduction to Data Science
  2. 2. What Is Data Science? Extraction of knowledge from data (also known as knowledge discovery and data mining, KDD). Data science := Computer science (for data structures, algorithms, visualization, big data support, general programming) + Statistics (for regressions and inference) + Domain knowledge (for asking questions and interpreting results). 2
  3. 3. Data, Information, Knowledge, etc. 3 (by David Somerville @smrvl)
  4. 4. Data Science and Other Disciplines: BI Business Intelligence engineers traditionally make tools for others to analyze data with. BI engineers do not analyze the data. Data scientists will both make and analyze using what they made. If you are a software engineer you need to learn statistical modeling and how to communicate results. You will need to use these datasets and work with them to make decisions. 4
  5. 5. Data Science and Other Disciplines: STATS Statisticians are traditionally content with the assumption (condition) that all their data will fit in main memory at the same time. Statisticians traditionally used math or created new math to squeeze as much information as possible from small numbers of observations or features. Data scientists recognize the need to use and create math to handle analyses in data-poor environments but will use and create new software engineering tools to handle very large datasets, and they recognize that some the models are the same in both cases. You need to learn to deal with data that does not fit in memory to be a data scientist because it’s no longer safe to assume. 5
  6. 6. Data Science and Other Disciplines: DB Database programmers and administrators bring useful skills to data science but they are traditionally focused on one data model: relational. Handling graphs’ nodes and edges (e.g., pagerank), images, video, text, as well as SQL when appropriate, are more like data science. You need to deal with unstructured data to be a data scientist. 6
  7. 7. Data Science and Other Disciplines: Visualization Visualization experts and business analysts bring skills but are traditionally not concerned with massive scale like hundreds or thousands of machines. If you are a business analyst then you need to learn about algorithms and tradeoffs at large scale. With cloud computing and with algorithms, you may get an answer but it may cost more or less than it did 5 years ago. It is no longer safe to throw your trust over the wall to some algorithm or to your staff to run some algorithm. You will need to internalize the tradeoffs of choosing one model or another yourself. 7
  8. 8. Data Science and Other Disciplines: ML Machine learning is similar to data science but it’s a small fraction of it. The getting of data, cleaning, exploring, and making interactive visualizations and data products for yourself and for others to use (e.g. data driven language translators, spellcheckers) as well as doing ML, these are more like data science. 8
  9. 9. Topics ● Numeric data analysis ● Signal processing ● Text data analysis (information/document/text retrieval, natural language processing) ● Statistical inference ● Databases (information integration) ● Complex network analysis ● Data visualization 9
  10. 10. Define the Question of Study ● Descriptive: Describe a set of data. ● Exploratory: Find new relationships. ● Inferential: Use a small data sample to describe a bigger population. Based on statistics. ● Predictive: Use data on some objects to predict values for another object. ● Causal: Does one variable affect another variable? Based on statistics. Correlation != Causation. ● Mechanistic: Exactly how does one variable affect another variable? Based on deep domain knowledge. 10
  11. 11. Get and Clean Data 1. Define the ideal data set Determine what data you can access 2. Obtain the data Raw data vs processed data. Always use raw data, but process it once; record all processing steps 3. Clean the data 11
  12. 12. Explore Data ● Exploratory data analysis ● Model data and predict ● Interpret results ● Challenge results ● Present results to the data sponsor 12
  13. 13. Create Reproducible Code ● Don't do things by hand–teach the computer! All things done by hand must be precisely documents ● Don't use interactive GUI tools (no history!) ● Use version control software (Git/GitHub) ● Avoid intermediate files, unless they are hard to build (in which case cache them) 13
  14. 14. Report Structure ● Project report ○ Abstract: A brief description of the project. ○ Introduction. ○ Methods. ○ Results. ○ Conclusion. ● Code ○ Well-commented scripts that can be executed without any command line parameters or interaction. 14
  15. 15. Suggested Directory Structure ● data – for the input data, if needed ● cache – for the previously downloaded data ● results – for numerical results ● code – for the Python script(s) ● doc – for the report and figures 15
  16. 16. Data Acquisition Pipeline 16

×