Multiple time frame trading analysis -brianshannon.pdf
R to wrangle, analyze, and visualize data faster and better
1. 1
Up your data game: How to use R to wrangle,
analyze, and visualize data faster and better
MERL Tech DC 2018
Charles Guedenet, MEL Technical Advisor, IREX
Jonathan Seiden, Learning Research Specialist, Save the Children
Please: Plug in your laptop and check its wifi
download your dataset and r files here: http://bit.ly/merl-r
2. 2
Objectives
By the end of this session, you should:
1. Have R and Rstudio setup on your computer
2. Have a better understanding of what R programming is
and what it can do for you
3. Learn about useful R tools (functions and packages)
that you can use in your own work
4. Feel intrigued (and excited?) enough about R to pursue
further learning
3. 3
What is R?
A programming language for data manipulation
Command-line driven vs. Not point-and-click
Who uses it?
• Academics, journalists, statisticians, open data enthusiasts
6. 6
Pro #2: top notch data visualization!
https://www.r-graph-gallery.com/
7. 7
Pro #3: Flexible & Comprehensive
Work with data across the data life cycle
Get data Cleaning Analysis Data viz Reporting
8. 8
Pro #4: Large & active community
Tutorials, blogs, websites
R-bloggers.com – news and tutorials by 750+ bloggers
Stats.stackexchange.com
So much free code! Copy + paste
Kaggle.com
Github.com
A package for everything - +13k packages
www.r-pkg.org
9. 9
Pro #5: It’s free!
Compare with: ANNUAL Cost
SPSS $1,200 (statistics only) + $$ for addons
SAS $8,700 first year (basic Analytics Pro)
STATA $595 - $1,500
11. 11
Pros and Cons
Programming language =
reproducible work & huge
efficiency gains
Top notch data visualization
capabilities
Flexible & comprehensive
Active R community
Free and open source
A steep learning curve for
programming newbies
Colleagues/friends may still
prefer STATA SPSS users
Pros Cons
12. 12
Getting setup
1. Install R on your computer
http://lib.stat.cmu.edu/R/CRAN/
Windows or MacOS - choose one of the precompiled
binary distributions (i.e., ready-to-run applications)
linked at the top of the R Project’s webpage.
2. Install RStudio
https://www.rstudio.com/products/rstudio/download/
16. 16
Introduction to Rstudio: Source tabs
This is a built-in text editor
Open an empty script File New File R Script
You can write a script and then execute it in the console using Ctrl Shift Enter
17. 17
Global Environment & History
Environment tab - where you can see the values and functions that you’ve
created or imported
History tab –where you can see a list of key strokes you’ve entered in console
18. 18
Files, plots, packages, help
Files – navigate your computer’s files
Packages – find and install packages
Help – search and find help with functions and packages
19. 19
Files, plots, packages, help
A function is a set of statements (or instructions) organized to
perform a specific task.
e.g. sqrt(), sd(), mean()
Packages are collections of R functions, data, and compiled
code in a well-defined format.
The library is where packages are storied on your computer.
20. 20
Basic tips
1. To run a command/function, click the “Run” button or press Ctrl +
Enter
2. R is case sensitive. Make sure your spelling and capitalization are
correct.
3. The $ symbol is used to select a particular column within a table
(e.g., table$column).
4. The # symbol: Any text that you do not want R to act on (such as
comments, notes, or instructions) needs to be preceded by
the # symbol (a.k.a. hash-tag, comment, pound, or number symbol). R
ignores the remainder of the script line following #.
From <http://ncss-tech.github.io/stats_for_soil_survey/chapters/1_introduction/1_introduction.html>
Editor's Notes
Intro:
R programming language is one of the most popular languages to do data science, used by tons of companies and universities around the globe in all sorts of fields.
*The name “R” comes from the initials of the two men who first developed the language at the University of Auckland, Robert Gentleman and Ross Ihaka
Image from https://www.quora.com/Who-even-uses-R-language . (James McInnes)
Intro:
R programming language is one of the most popular languages to do data science, used by tons of companies and universities around the globe in all sorts of fields.
Who uses it and to do what?
Why learn R and how does it compare to other programs like Excel, STATA, SPSS, and others?
Ford uses R to improve the design of its vehicles.
Basically, Twitter uses R to monitor user experience.
The US National Weather Service uses R to predict severe flooding.
The Human Rights Data Analysis Group uses R to quantify the impact of war.
R is being used by The New York Times to create infographics.
Google uses R to calculate the ROI of advertising campaigns
1. R is an actual programming language, with a command-line interface for executing code versus point-and-click
Efficiency gains – Useful if you need to reformat every graph or chart at once OR If your data changes after already making your calculations
Reproducible - Just like creating a recipe for a meal, a recipe with code makes it possible for anyone to reproduce your work (including your future self). Every step of your analysis is recorded. (It's also useful for validating with other researchers).
https://www.r-graph-gallery.com/
Obtain and reads virtually any type of data – Excel, web scraping, databases (MySQL), foreign (SPSS, SAS)
Cleaning and wrangling
Analysis – Quant, Qualitative, network analysis, Machine learning, Text mining
Data visualization
Reporting – publish static results in Word or pdf or make it interactive with rmarkdown and Shiny packages
https://www.r-bloggers.com/
https://cran.r-project.org/web/packages/available_packages_by_name.html
3. Active community
It's easy to create so-called packages, which are extensions to R. R's very active community has created thousands of these packages for many different fields.
Github.com and stats.stackexchange.com are full of freely available programming code
PROS
1. R is an actual programming language, with a command-line interface for executing code versus point-and-click
Efficiency gains – Useful if you need to reformat every graph or chart at once OR If your data changes after already making your calculations
2. Data visualization
3. Flexible & comprehensive -
CONS
Steep learning curve
Resistance from colleagues/friends
Setup
If you are running Windows or MacOS, you should choose one of the precompiled binary distributions (i.e., ready-to-run applications) linked at the top of the R Project’s webpage.
Rstudio
R Studio is an “Integrated Development Environment”, or IDE. This means it is a front-end for R that makes it much easier to work with. R Studio is also free, and available for Windows, Mac, and Linux platforms.
Rstudio is a particularly user-friendly interface for working with R. It has drop-down menus and many customization options.
**The first time you open Rstuio, you’ll see three windows. By default, a fourth window is hidden. Open it by clicking on the File drop-down menu, then New file, and R Script.
The Console is where you can type code and execute. It’s also where you warnings and error messages appear to help you with debugging your code
The benefits are writing code in a script is that your work is saved and can be changed later if, for example, your data changes, you want to make edits to your charts, or you want to share your work with others.
Almost everything in R is done through functions.
A function is a set of statements organized together to perform a specific task. R has a large number of in-built functions and the user can create their own functions. Functions may include arguments to help the function accomplish a task. (for example, do not include NA values)
Numeric functions
Sqrt()
Character functions
e.g. strsplit(data, split)
From <http://ncss-tech.github.io/stats_for_soil_survey/chapters/1_introduction/1_introduction.html>