SlideShare una empresa de Scribd logo
1 de 112
Descargar para leer sin conexión
Nuts and Bolts
Matthew Gentzkow
Jesse M. Shapiro
Chicago Booth and NBER
Introduction
We have focused on the statistical / econometric issues that arise with
big data
In the time that remains, we want to spend a little time on the
practical issues...
Introduction
We have focused on the statistical / econometric issues that arise with
big data
In the time that remains, we want to spend a little time on the
practical issues...
E.g., where do you actually put a 2 TB dataset?
Introduction
We have focused on the statistical / econometric issues that arise with
big data
In the time that remains, we want to spend a little time on the
practical issues...
E.g., where do you actually put a 2 TB dataset?
Goal: Sketch some basic computing ideas relevant to working with
large datasets.
Introduction
We have focused on the statistical / econometric issues that arise with
big data
In the time that remains, we want to spend a little time on the
practical issues...
E.g., where do you actually put a 2 TB dataset?
Goal: Sketch some basic computing ideas relevant to working with
large datasets.
Caveat: We are all amateurs
The Good News
Much of what we've talked about here you can do on your laptop
Your OS knows how to do parallel computing (multiple processors,
multiple cores)
Many big datasets are  5 GB
Save the data to local disk, re up Stata or R, and o you go...
How Big is Big?
Congressional record text (1870-2010) ≈50 GB
Congressional record pdfs (1870-2010) ≈500 GB
Nielsen scanner data (34k stores, 2004-2010) ≈5 TB
Wikipedia (2013) ≈6 TB
20% Medicare claims data (1997-2009) ≈10 TB
Facebook (2013) ≈100,000 TB
All data in the world ≈2.7 billion TB
Outline
Software engineering for economists
Databases
Cluster computing
Scenarios
Software Engineering for Economists
Motivation
A lot of the time spent in empirical research is writing, reading, and
debugging code.
Common situations...
Broken Code
Incoherent Data
Rampant Duplication
Replication Impossible
Tons of Versions
This Talk
We are not software engineers or computer scientists.
But we have learned that most common problems in social sciences
have analogues in these elds and there are standard solutions.
Goal is to highlight a few of these that we think are especially valuable
to researchers.
Focus on incremental changes: one step away from common practice.
Automation
Raw Data
Data from original source...
Manual Approach
Open spreadsheet
Output to text les
Open Stata
Load data, merge les
Compute log(chip sales)
Run regression
Copy results to MS Word and save
Manual Approach
Two main problems with this approach
Replication: how can we be sure we'll nd our way back to the exact
same numbers?
Eciency: what happens if we change our mind about the right
specication?
Semi-automated Approach
Problems
Which le does what?
In what order?
Fully Automated Approach
File: rundirectory.bat
stattransfer export_to_csv.stc
statase -b mergefiles.do
statase -b cleandata.do
statase -b regressions.do
statase -b figures.do
pdflatex tv_potato.tex
All steps controlled by a shell script
Order of steps unambiguous
Easy to call commands from dierent packages
Make
Framework to go from source to target
Tracks dependencies and revisions
Avoids rebuilding components that are up to date
Used to build executable les
Version Control
After Some Editing
Dates demarcate versions, initials demarcate authors
Why do this?
Facilitates comparison
Facilitates undo
What's Wrong with the Approach?
Why not do this?
It's a pain: always have to remember to tag every new le
It's confusing:
Which log le came from regressions_022713_mg.do?
Which version of cleandata.do makes the data used by
regressions_022413.do?
It fails the market test: No software rm does it this way
Version Control
Software that sits on top of your lesystem
Keeps track of multiple versions of the same le
Records date, authorship
Manages conicts
Benets
Single authoritative version of the directory
Edit without fear: an undo command for everything
Life After Version Control
Life After Version Control
Life After Version Control
Life After Version Control
Life After Version Control
Life After Version Control
Life After Version Control
Aside: If you always run rundirectory.bat before you commit, you
guarantee replicability.
Directories
One Directory Does Everything
Pros: Self-contained, simple
Cons:
Have to rerun everything for every change
Hard to gure out dependencies
Functional Directories
Dependencies Obvious
One Resource, Many Projects
Keys
Research Assistant Output
county state cnty_pop state_pop region
36037 NY 3817735 43320903 1
36038 NY 422999 43320903 1
36039 NY 324920 . 1
36040 . 143432 43320903 1
. NY . 43320903 1
37001 VA 3228290 7173000 3
37002 VA 449499 7173000 3
37003 VA 383888 7173000 4
37004 VA 483829 7173000 3
Causes for Concern
county state cnty_pop state_pop region
36037 NY 3817735 43320903 1
36038 NY 422999 43320903 1
36039 NY 324920 . 1
36040 . 143432 43320903 1
. NY . 43320903 1
37001 VA 3228290 7173000 3
37002 VA 449499 7173000 3
37003 VA 383888 7173000 4
37004 VA 483829 7173000 3
Relational Databases
county state population
36037 NY 3817735
36038 NY 422999
36039 NY 324920
36040 NY 143432
37001 VA 3228290
37002 VA 449499
37003 VA 383888
37004 VA 483829
state population region
NY 43320903 1
VA 7173000 3
Each variable is an attribute of an element of the table
Each table has a key
Tables are connected by foreign keys (state eld in the county table)
Steps
Store data in normalized format as above
Can use at les, doesn't have to be fancy relational database software
Construct a second set of les with key transformations
e.g., log population
Merge data together and run analysis
To Come
What to do with enormous databases?
Abstraction
Rampant Duplication
Abstracted
Three Leave-Out Means
Copy and Paste Errors
Abstracted
Documentation
Too Much Documentation
Too Much Documentation
Too Much Documentation
Too Much Documentation
Too Much Documentation
Unclear Code
Self-Documenting Code
Management
A Friendly Chat
A Friendly Chat
A Friendly Chat
A Friendly Chat
A Friendly Chat
A Friendly Chat
Task Management
Parting Thoughts
Code and Data
Data are getting larger
Research is getting more collaborative
Need to manage code and data responsibly for collaboration and
replicability
Learn from the pros, not from us
Databases
What is a Database?
Database Theory
Principles for how to store / organize / retrieve data eciently
(normalization, indexing, optimization, etc.)
Database Software
Manages storage / organization / retrieval of data (SQL, Oracle,
Access, etc.)
Economists rarely use this software because we typically store data in
at les  interact with them using statistical programs
When we receive extracts from large datasets (the census, Medicare
claims, etc.) someone else often interacts with the database on the
back end
Normalization
Database Normalization is the process of organizing the elds and
tables of a relational database to minimize redundancy and
dependency. Normalization usually involves dividing large tables into
smaller (and less redundant) tables and dening relationships between
them.
Benets of Normalization
Ecient storage
Ecient modication
Guarantees coherence
Makes logical structure of data clear
Indexing
Medicare claims data for 1997-2010 are roughly 10 TB
These data are stored at NBER in thousands of zipped SAS les
Indexing
Medicare claims data for 1997-2010 are roughly 10 TB
These data are stored at NBER in thousands of zipped SAS les
To extract, say, all claims for heart disease patients aged 55-65, you
would need to read every line of every one of those les
THIS IS SLOW!!!
Indexing
The obvious solution, long understood for book, libraries, economics
journals, and so forth, is to build an index
Database software handles this automatically
Allows you to specify elds that will be often used for lookups,
subsetting, etc. to be indexed
For the Medicare data, we could index age, gender, type of treatment,
etc. to allow much faster extraction
Indexing
Benets
Fast lookups
Easy to police data constraints
Costs
Storage
Time
Database optimization is the art of tuning database structure and
indexing for a specic set of needs
Data Warehouses
Traditional databases are optimized for operational environments
Bank transactions
Airline reservations
etc.
Data Warehouses
Traditional databases are optimized for operational environments
Bank transactions
Airline reservations
etc.
Characteristics
Many small reads and writes
Many users accessing simultaneously
Premium on low latency
Only care about current state
Data Warehouses
In analytic / research environments, however, the requirements are
dierent
Frequent large reads, infrequent writes
Relatively little simultaneous access
Value throughput relative to latency
May care about history as well as current state
Need to create and re-use many custom extracts
Data Warehouses
In analytic / research environments, however, the requirements are
dierent
Frequent large reads, infrequent writes
Relatively little simultaneous access
Value throughput relative to latency
May care about history as well as current state
Need to create and re-use many custom extracts
Database systems tuned to these requirements are commonly called
data warehouses
Distributed Computing
Distributed Computing
Denition: Computation shared among many independent processors
Distributed Computing
Denition: Computation shared among many independent processors
Terminology
Distributed vs. Parallel (latter usually refers to systems with shared
memory)
Cluster vs. Grid (latter usually more decentralized  heterogeneous)
On Your Local Machine
Your OS can run multiple processors each with multiple cores
Your video card has hundreds of cores
Stata, R, Matlab, etc. can all exploit these resources to do parallel
computing
On Your Local Machine
Your OS can run multiple processors each with multiple cores
Your video card has hundreds of cores
Stata, R, Matlab, etc. can all exploit these resources to do parallel
computing
Stata
Buy appropriate MP version of Stata
Software does the rest
On Your Local Machine
Your OS can run multiple processors each with multiple cores
Your video card has hundreds of cores
Stata, R, Matlab, etc. can all exploit these resources to do parallel
computing
Stata
Buy appropriate MP version of Stata
Software does the rest
R / Matlab
Install appropriate add-ins (parallel package in R, parallel computing
toolbox in Matlab)
Include parallel commands in code (e.g., parfor in place of for in
Matlab)
On Cluster / Grid
Resources abound
University / department computing clusters
Non-commercial scientic computing grids (e.g., XSEDE)
Commercial grids (e.g., Amazon EC2)
On Cluster / Grid
Resources abound
University / department computing clusters
Non-commercial scientic computing grids (e.g., XSEDE)
Commercial grids (e.g., Amazon EC2)
Most of these run Linux w/ distribution handled by a batch scheduler
Write code using your favorite application, then send it to scheduler
with a bash script
MapReduce
MapReduce is a programming model that facilitates distributed
computing
Developed by Google around 2004, though ideas predate that
MapReduce
MapReduce is a programming model that facilitates distributed
computing
Developed by Google around 2004, though ideas predate that
Most algorithms for distributed data processing can be represented in
two steps
Map: Process individual chunk of data to generate an intermediate
summary
Reduce: Combine summaries from dierent chunks to produce a
single output le
MapReduce
MapReduce is a programming model that facilitates distributed
computing
Developed by Google around 2004, though ideas predate that
Most algorithms for distributed data processing can be represented in
two steps
Map: Process individual chunk of data to generate an intermediate
summary
Reduce: Combine summaries from dierent chunks to produce a
single output le
If you structure your code this way, MapReduce software will handle
all the details of distribution:
Partitioning data
Scheduling execution across nodes
Managing communication between machines
Handling errors / machine failures
MapReduce: Examples
Count words in a large collection of documents
Map: Document i → Set of (word, count) pairs Ci
Reduce: Collapse {Ci }, summing count within word
MapReduce: Examples
Count words in a large collection of documents
Map: Document i → Set of (word, count) pairs Ci
Reduce: Collapse {Ci }, summing count within word
Extract medical claims for 65-year old males
Map: Record set i → Subset of i that are 65-year old males Hi
Reduce: Append elements of {Hi }
MapReduce: Examples
Count words in a large collection of documents
Map: Document i → Set of (word, count) pairs Ci
Reduce: Collapse {Ci }, summing count within word
Extract medical claims for 65-year old males
Map: Record set i → Subset of i that are 65-year old males Hi
Reduce: Append elements of {Hi }
Compute marginal regression for text analysis (e.g., Gentzkow 
Shapiro 2010)
Map: Counts xij of phrase j → Parameters ˆαj , ˆβj from
E (xij |yi ) = αj + βj xij
Reduce: Append ˆαj , ˆβj
MapReduce: Model
User
Program
Master
(1) fork
worker
(1) fork
worker
(1) fork
(2)
assign
map
(2)
assign
reduce
split 0
split 1
split 2
split 3
split 4
output
file 0
(6) write
worker
(3) read
worker
(4) local write
Map
phase
Intermediate files
(on local disks)
worker output
file 1
Input
files
(5) remote read
Reduce
phase
Output
files
Figure 1: Execution overview
MapReduce: Implementation
MapReduce is the original software developed by Google
Hadoop is the open-source version most people use (developed by
Apache)
Amazon has a hosted implementation (Amazon EMR)
MapReduce: Implementation
MapReduce is the original software developed by Google
Hadoop is the open-source version most people use (developed by
Apache)
Amazon has a hosted implementation (Amazon EMR)
How does it work?
Write your code as two functions called map and reduce
Send code  data to scheduler using bash script
Distributed File Systems
Data transfer is the main bottleneck in distributed systems
For big data, it makes sense to distribute data as well as computation
Data broken up into chunks, each of which lives on a separate node
File system keeps track of where the pieces are and allocates jobs so
computation happens close to data whenever possible
Distributed File Systems
Data transfer is the main bottleneck in distributed systems
For big data, it makes sense to distribute data as well as computation
Data broken up into chunks, each of which lives on a separate node
File system keeps track of where the pieces are and allocates jobs so
computation happens close to data whenever possible
Tight coupling between MapReduce software and associated le
systems
MapReduce → Google File System (GFS)
Hadoop → Hadoop Distributed File System (HDFS)
Amazon EMR → Amazon S3
Distributed File Systems
Legend:
Data messages
Control messages
Application
(file name, chunk index)
(chunk handle,
chunk locations)
GFS master
File namespace
/foo/bar
Instructions to chunkserver
Chunkserver state
GFS chunkserverGFS chunkserver
(chunk handle, byte range)
chunk data
chunk 2ef0
Linux file system Linux file system
GFS client
Figure 1: GFS Architecture
and replication decisions using global knowledge. However,
we must minimize its involvement in reads and writes so
that it does not become a bottleneck. Clients never read
and write file data through the master. Instead, a client asks
the master which chunkservers it should contact. It caches
this information for a limited time and interacts with the
chunkservers directly for many subsequent operations.
Let us explain the interactions for a simple read with refer-
ence to Figure 1. First, using the fixed chunk size, the client
translates the file name and byte offset specified by the ap-
plication into a chunk index within the file. Then, it sends
the master a request containing the file name and chunk
index. The master replies with the corresponding chunk
tent TCP connection to the chunkserver over an extended
period of time. Third, it reduces the size of the metadata
stored on the master. This allows us to keep the metadata
in memory, which in turn brings other advantages that we
will discuss in Section 2.6.1.
On the other hand, a large chunk size, even with lazy space
allocation, has its disadvantages. A small file consists of a
small number of chunks, perhaps just one. The chunkservers
storing those chunks may become hot spots if many clients
are accessing the same file. In practice, hot spots have not
been a major issue because our applications mostly read
large multi-chunk files sequentially.
However, hot spots did develop when GFS was first used
Scenarios
Scenario 1: Not-So-Big Data
My data is 100 gb or less
Scenario 1: Not-So-Big Data
My data is 100 gb or less
Advice
Store data locally in at les (csv, Stata, R, etc.)
Organize data in normalized tables for robustness and clarity
Run code serially or (if computation is slow) in parallel
Scenario 2: Big Data, Small Analysis
My raw data is  100 gb, but the extracts I actually use for analysis
are  100 gb
Scenario 2: Big Data, Small Analysis
My raw data is  100 gb, but the extracts I actually use for analysis
are  100 gb
Example
Medicare claims data → analyze heart attack spending by patient by
year
Nielsen scanner data → analyze average price by store by month
Scenario 2: Big Data, Small Analysis
My raw data is  100 gb, but the extracts I actually use for analysis
are  100 gb
Example
Medicare claims data → analyze heart attack spending by patient by
year
Nielsen scanner data → analyze average price by store by month
Advice
Store data in relational database optimized to produce analysis extracts
eciently
Store extracts locally in at les (csv, Stata, R, etc.)
Organize extracts in normalized tables for robustness and clarity
Run code serially or (if computation is slow) in parallel
Scenario 2: Big Data, Small Analysis
My raw data is  100 gb, but the extracts I actually use for analysis
are  100 gb
Example
Medicare claims data → analyze heart attack spending by patient by
year
Nielsen scanner data → analyze average price by store by month
Advice
Store data in relational database optimized to produce analysis extracts
eciently
Store extracts locally in at les (csv, Stata, R, etc.)
Organize extracts in normalized tables for robustness and clarity
Run code serially or (if computation is slow) in parallel
Note: Gains to database increase for more structured data. For
completely unstructured data, you may be better o using distributed
le system + map reduce to create extracts.
Scenario 3: Big Data, Big Analysis
My data is  100 GB and my analysis code needs to touch all of the
data
Scenario 3: Big Data, Big Analysis
My data is  100 GB and my analysis code needs to touch all of the
data
Example
2 TB of SEC ling text → run variable selection using all data
Scenario 3: Big Data, Big Analysis
My data is  100 GB and my analysis code needs to touch all of the
data
Example
2 TB of SEC ling text → run variable selection using all data
Advice
Store data in distributed le system
Use MapReduce or other distributed algorithms for analysis

Más contenido relacionado

La actualidad más candente

Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics PresentationSkylar Ritchie
 
Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPromptCloud
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationSeth Grimes
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2Sara Hooker
 
Strategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert MunroStrategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert MunroRobert Munro
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1Sara Hooker
 
Predictive Text Analytics
Predictive Text AnalyticsPredictive Text Analytics
Predictive Text AnalyticsSeth Grimes
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learningbutest
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratorySara Hooker
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningTamir Taha
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparationSara Hooker
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics
 
Research Data Management
Research  Data ManagementResearch  Data Management
Research Data ManagementMahmoud91Tx
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Seth Grimes
 
Statistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and PredictingStatistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and PredictingGalit Shmueli
 

La actualidad más candente (20)

Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics Presentation
 
Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics Algorithms
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
 
Strategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert MunroStrategies for Practical Active Learning, Robert Munro
Strategies for Practical Active Learning, Robert Munro
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
Predictive Text Analytics
Predictive Text AnalyticsPredictive Text Analytics
Predictive Text Analytics
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text Analytics
 
Research Data Management
Research  Data ManagementResearch  Data Management
Research Data Management
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010
 
Statistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and PredictingStatistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and Predicting
 
Data analysis
Data analysisData analysis
Data analysis
 

Similar a Nuts and bolts

Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignJuliet Hougland
 
Big Data
Big DataBig Data
Big DataNGDATA
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
INTRODUCTION TO Database Management System (DBMS)
INTRODUCTION TO Database Management System (DBMS)INTRODUCTION TO Database Management System (DBMS)
INTRODUCTION TO Database Management System (DBMS)Prof Ansari
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
 
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew Morgan
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew MorganCh-ch-ch-ch-changes....Stitch Triggers - Andrew Morgan
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew MorganMongoDB
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6varshakumar21
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecyclebartlowe
 
BI on Big Data Presentation
BI on Big Data PresentationBI on Big Data Presentation
BI on Big Data PresentationArcadia Data
 
BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?Tuan Yang
 
Gerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and InvestmentGerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and Investmentvijayk23x
 
Unit i big data introduction
Unit  i big data introductionUnit  i big data introduction
Unit i big data introductionSujaMaryD
 

Similar a Nuts and bolts (20)

Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
 
Big Data
Big DataBig Data
Big Data
 
Database
DatabaseDatabase
Database
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
INTRODUCTION TO Database Management System (DBMS)
INTRODUCTION TO Database Management System (DBMS)INTRODUCTION TO Database Management System (DBMS)
INTRODUCTION TO Database Management System (DBMS)
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White Paper
 
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew Morgan
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew MorganCh-ch-ch-ch-changes....Stitch Triggers - Andrew Morgan
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew Morgan
 
Data science unit2
Data science unit2Data science unit2
Data science unit2
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecycle
 
Mr bi amrp
Mr bi amrpMr bi amrp
Mr bi amrp
 
Bigdata notes
Bigdata notesBigdata notes
Bigdata notes
 
BI on Big Data Presentation
BI on Big Data PresentationBI on Big Data Presentation
BI on Big Data Presentation
 
BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Database
DatabaseDatabase
Database
 
Database
DatabaseDatabase
Database
 
Gerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and InvestmentGerenral insurance Accounts IT and Investment
Gerenral insurance Accounts IT and Investment
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Unit i big data introduction
Unit  i big data introductionUnit  i big data introduction
Unit i big data introduction
 

Más de NBER

FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATESFISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATESNBER
 
Business in the United States Who Owns it and How Much Tax They Pay
Business in the United States Who Owns it and How Much Tax They PayBusiness in the United States Who Owns it and How Much Tax They Pay
Business in the United States Who Owns it and How Much Tax They PayNBER
 
Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...
Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...
Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...NBER
 
The Distributional E ffects of U.S. Clean Energy Tax Credits
The Distributional Effects of U.S. Clean Energy Tax CreditsThe Distributional Effects of U.S. Clean Energy Tax Credits
The Distributional E ffects of U.S. Clean Energy Tax CreditsNBER
 
An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...
An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...
An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...NBER
 
Nbe rcausalpredictionv111 lecture2
Nbe rcausalpredictionv111 lecture2Nbe rcausalpredictionv111 lecture2
Nbe rcausalpredictionv111 lecture2NBER
 
Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2NBER
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsNBER
 
Jackson nber-slides2014 lecture3
Jackson nber-slides2014 lecture3Jackson nber-slides2014 lecture3
Jackson nber-slides2014 lecture3NBER
 
Jackson nber-slides2014 lecture1
Jackson nber-slides2014 lecture1Jackson nber-slides2014 lecture1
Jackson nber-slides2014 lecture1NBER
 
Acemoglu lecture2
Acemoglu lecture2Acemoglu lecture2
Acemoglu lecture2NBER
 
Acemoglu lecture4
Acemoglu lecture4Acemoglu lecture4
Acemoglu lecture4NBER
 
The NBER Working Paper Series at 20,000 - Joshua Gans
The NBER Working Paper Series at 20,000 - Joshua GansThe NBER Working Paper Series at 20,000 - Joshua Gans
The NBER Working Paper Series at 20,000 - Joshua GansNBER
 
The NBER Working Paper Series at 20,000 - Claudia Goldin
The NBER Working Paper Series at 20,000 - Claudia GoldinThe NBER Working Paper Series at 20,000 - Claudia Goldin
The NBER Working Paper Series at 20,000 - Claudia GoldinNBER
 
The NBER Working Paper Series at 20,000 - James Poterba
The NBER Working Paper Series at 20,000 - James PoterbaThe NBER Working Paper Series at 20,000 - James Poterba
The NBER Working Paper Series at 20,000 - James PoterbaNBER
 
The NBER Working Paper Series at 20,000 - Scott Stern
The NBER Working Paper Series at 20,000 - Scott SternThe NBER Working Paper Series at 20,000 - Scott Stern
The NBER Working Paper Series at 20,000 - Scott SternNBER
 
The NBER Working Paper Series at 20,000 - Glenn Ellison
The NBER Working Paper Series at 20,000 - Glenn EllisonThe NBER Working Paper Series at 20,000 - Glenn Ellison
The NBER Working Paper Series at 20,000 - Glenn EllisonNBER
 
L3 1b
L3 1bL3 1b
L3 1bNBER
 
Econometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsEconometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsNBER
 
High-Dimensional Methods: Examples for Inference on Structural Effects
High-Dimensional Methods: Examples for Inference on Structural EffectsHigh-Dimensional Methods: Examples for Inference on Structural Effects
High-Dimensional Methods: Examples for Inference on Structural EffectsNBER
 

Más de NBER (20)

FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATESFISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
 
Business in the United States Who Owns it and How Much Tax They Pay
Business in the United States Who Owns it and How Much Tax They PayBusiness in the United States Who Owns it and How Much Tax They Pay
Business in the United States Who Owns it and How Much Tax They Pay
 
Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...
Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...
Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...
 
The Distributional E ffects of U.S. Clean Energy Tax Credits
The Distributional Effects of U.S. Clean Energy Tax CreditsThe Distributional Effects of U.S. Clean Energy Tax Credits
The Distributional E ffects of U.S. Clean Energy Tax Credits
 
An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...
An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...
An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...
 
Nbe rcausalpredictionv111 lecture2
Nbe rcausalpredictionv111 lecture2Nbe rcausalpredictionv111 lecture2
Nbe rcausalpredictionv111 lecture2
 
Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and Algorithms
 
Jackson nber-slides2014 lecture3
Jackson nber-slides2014 lecture3Jackson nber-slides2014 lecture3
Jackson nber-slides2014 lecture3
 
Jackson nber-slides2014 lecture1
Jackson nber-slides2014 lecture1Jackson nber-slides2014 lecture1
Jackson nber-slides2014 lecture1
 
Acemoglu lecture2
Acemoglu lecture2Acemoglu lecture2
Acemoglu lecture2
 
Acemoglu lecture4
Acemoglu lecture4Acemoglu lecture4
Acemoglu lecture4
 
The NBER Working Paper Series at 20,000 - Joshua Gans
The NBER Working Paper Series at 20,000 - Joshua GansThe NBER Working Paper Series at 20,000 - Joshua Gans
The NBER Working Paper Series at 20,000 - Joshua Gans
 
The NBER Working Paper Series at 20,000 - Claudia Goldin
The NBER Working Paper Series at 20,000 - Claudia GoldinThe NBER Working Paper Series at 20,000 - Claudia Goldin
The NBER Working Paper Series at 20,000 - Claudia Goldin
 
The NBER Working Paper Series at 20,000 - James Poterba
The NBER Working Paper Series at 20,000 - James PoterbaThe NBER Working Paper Series at 20,000 - James Poterba
The NBER Working Paper Series at 20,000 - James Poterba
 
The NBER Working Paper Series at 20,000 - Scott Stern
The NBER Working Paper Series at 20,000 - Scott SternThe NBER Working Paper Series at 20,000 - Scott Stern
The NBER Working Paper Series at 20,000 - Scott Stern
 
The NBER Working Paper Series at 20,000 - Glenn Ellison
The NBER Working Paper Series at 20,000 - Glenn EllisonThe NBER Working Paper Series at 20,000 - Glenn Ellison
The NBER Working Paper Series at 20,000 - Glenn Ellison
 
L3 1b
L3 1bL3 1b
L3 1b
 
Econometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsEconometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse Models
 
High-Dimensional Methods: Examples for Inference on Structural Effects
High-Dimensional Methods: Examples for Inference on Structural EffectsHigh-Dimensional Methods: Examples for Inference on Structural Effects
High-Dimensional Methods: Examples for Inference on Structural Effects
 

Último

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Último (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Nuts and bolts

  • 1. Nuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER
  • 2. Introduction We have focused on the statistical / econometric issues that arise with big data In the time that remains, we want to spend a little time on the practical issues...
  • 3. Introduction We have focused on the statistical / econometric issues that arise with big data In the time that remains, we want to spend a little time on the practical issues... E.g., where do you actually put a 2 TB dataset?
  • 4. Introduction We have focused on the statistical / econometric issues that arise with big data In the time that remains, we want to spend a little time on the practical issues... E.g., where do you actually put a 2 TB dataset? Goal: Sketch some basic computing ideas relevant to working with large datasets.
  • 5. Introduction We have focused on the statistical / econometric issues that arise with big data In the time that remains, we want to spend a little time on the practical issues... E.g., where do you actually put a 2 TB dataset? Goal: Sketch some basic computing ideas relevant to working with large datasets. Caveat: We are all amateurs
  • 6. The Good News Much of what we've talked about here you can do on your laptop Your OS knows how to do parallel computing (multiple processors, multiple cores) Many big datasets are 5 GB Save the data to local disk, re up Stata or R, and o you go...
  • 7. How Big is Big? Congressional record text (1870-2010) ≈50 GB Congressional record pdfs (1870-2010) ≈500 GB Nielsen scanner data (34k stores, 2004-2010) ≈5 TB Wikipedia (2013) ≈6 TB 20% Medicare claims data (1997-2009) ≈10 TB Facebook (2013) ≈100,000 TB All data in the world ≈2.7 billion TB
  • 8. Outline Software engineering for economists Databases Cluster computing Scenarios
  • 10. Motivation A lot of the time spent in empirical research is writing, reading, and debugging code. Common situations...
  • 16. This Talk We are not software engineers or computer scientists. But we have learned that most common problems in social sciences have analogues in these elds and there are standard solutions. Goal is to highlight a few of these that we think are especially valuable to researchers. Focus on incremental changes: one step away from common practice.
  • 18. Raw Data Data from original source...
  • 19.
  • 20. Manual Approach Open spreadsheet Output to text les Open Stata Load data, merge les Compute log(chip sales) Run regression Copy results to MS Word and save
  • 21. Manual Approach Two main problems with this approach Replication: how can we be sure we'll nd our way back to the exact same numbers? Eciency: what happens if we change our mind about the right specication?
  • 22. Semi-automated Approach Problems Which le does what? In what order?
  • 23. Fully Automated Approach File: rundirectory.bat stattransfer export_to_csv.stc statase -b mergefiles.do statase -b cleandata.do statase -b regressions.do statase -b figures.do pdflatex tv_potato.tex All steps controlled by a shell script Order of steps unambiguous Easy to call commands from dierent packages
  • 24. Make Framework to go from source to target Tracks dependencies and revisions Avoids rebuilding components that are up to date Used to build executable les
  • 26. After Some Editing Dates demarcate versions, initials demarcate authors Why do this? Facilitates comparison Facilitates undo
  • 27. What's Wrong with the Approach? Why not do this? It's a pain: always have to remember to tag every new le It's confusing: Which log le came from regressions_022713_mg.do? Which version of cleandata.do makes the data used by regressions_022413.do? It fails the market test: No software rm does it this way
  • 28. Version Control Software that sits on top of your lesystem Keeps track of multiple versions of the same le Records date, authorship Manages conicts Benets Single authoritative version of the directory Edit without fear: an undo command for everything
  • 35. Life After Version Control Aside: If you always run rundirectory.bat before you commit, you guarantee replicability.
  • 37. One Directory Does Everything Pros: Self-contained, simple Cons: Have to rerun everything for every change Hard to gure out dependencies
  • 40. One Resource, Many Projects
  • 41. Keys
  • 42. Research Assistant Output county state cnty_pop state_pop region 36037 NY 3817735 43320903 1 36038 NY 422999 43320903 1 36039 NY 324920 . 1 36040 . 143432 43320903 1 . NY . 43320903 1 37001 VA 3228290 7173000 3 37002 VA 449499 7173000 3 37003 VA 383888 7173000 4 37004 VA 483829 7173000 3
  • 43. Causes for Concern county state cnty_pop state_pop region 36037 NY 3817735 43320903 1 36038 NY 422999 43320903 1 36039 NY 324920 . 1 36040 . 143432 43320903 1 . NY . 43320903 1 37001 VA 3228290 7173000 3 37002 VA 449499 7173000 3 37003 VA 383888 7173000 4 37004 VA 483829 7173000 3
  • 44. Relational Databases county state population 36037 NY 3817735 36038 NY 422999 36039 NY 324920 36040 NY 143432 37001 VA 3228290 37002 VA 449499 37003 VA 383888 37004 VA 483829 state population region NY 43320903 1 VA 7173000 3 Each variable is an attribute of an element of the table Each table has a key Tables are connected by foreign keys (state eld in the county table)
  • 45. Steps Store data in normalized format as above Can use at les, doesn't have to be fancy relational database software Construct a second set of les with key transformations e.g., log population Merge data together and run analysis
  • 46. To Come What to do with enormous databases?
  • 51. Copy and Paste Errors
  • 70. Code and Data Data are getting larger Research is getting more collaborative Need to manage code and data responsibly for collaboration and replicability Learn from the pros, not from us
  • 72. What is a Database? Database Theory Principles for how to store / organize / retrieve data eciently (normalization, indexing, optimization, etc.) Database Software Manages storage / organization / retrieval of data (SQL, Oracle, Access, etc.) Economists rarely use this software because we typically store data in at les interact with them using statistical programs When we receive extracts from large datasets (the census, Medicare claims, etc.) someone else often interacts with the database on the back end
  • 73. Normalization Database Normalization is the process of organizing the elds and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and dening relationships between them.
  • 74. Benets of Normalization Ecient storage Ecient modication Guarantees coherence Makes logical structure of data clear
  • 75. Indexing Medicare claims data for 1997-2010 are roughly 10 TB These data are stored at NBER in thousands of zipped SAS les
  • 76. Indexing Medicare claims data for 1997-2010 are roughly 10 TB These data are stored at NBER in thousands of zipped SAS les To extract, say, all claims for heart disease patients aged 55-65, you would need to read every line of every one of those les THIS IS SLOW!!!
  • 77. Indexing The obvious solution, long understood for book, libraries, economics journals, and so forth, is to build an index Database software handles this automatically Allows you to specify elds that will be often used for lookups, subsetting, etc. to be indexed For the Medicare data, we could index age, gender, type of treatment, etc. to allow much faster extraction
  • 78. Indexing Benets Fast lookups Easy to police data constraints Costs Storage Time Database optimization is the art of tuning database structure and indexing for a specic set of needs
  • 79. Data Warehouses Traditional databases are optimized for operational environments Bank transactions Airline reservations etc.
  • 80. Data Warehouses Traditional databases are optimized for operational environments Bank transactions Airline reservations etc. Characteristics Many small reads and writes Many users accessing simultaneously Premium on low latency Only care about current state
  • 81. Data Warehouses In analytic / research environments, however, the requirements are dierent Frequent large reads, infrequent writes Relatively little simultaneous access Value throughput relative to latency May care about history as well as current state Need to create and re-use many custom extracts
  • 82. Data Warehouses In analytic / research environments, however, the requirements are dierent Frequent large reads, infrequent writes Relatively little simultaneous access Value throughput relative to latency May care about history as well as current state Need to create and re-use many custom extracts Database systems tuned to these requirements are commonly called data warehouses
  • 84. Distributed Computing Denition: Computation shared among many independent processors
  • 85. Distributed Computing Denition: Computation shared among many independent processors Terminology Distributed vs. Parallel (latter usually refers to systems with shared memory) Cluster vs. Grid (latter usually more decentralized heterogeneous)
  • 86. On Your Local Machine Your OS can run multiple processors each with multiple cores Your video card has hundreds of cores Stata, R, Matlab, etc. can all exploit these resources to do parallel computing
  • 87. On Your Local Machine Your OS can run multiple processors each with multiple cores Your video card has hundreds of cores Stata, R, Matlab, etc. can all exploit these resources to do parallel computing Stata Buy appropriate MP version of Stata Software does the rest
  • 88. On Your Local Machine Your OS can run multiple processors each with multiple cores Your video card has hundreds of cores Stata, R, Matlab, etc. can all exploit these resources to do parallel computing Stata Buy appropriate MP version of Stata Software does the rest R / Matlab Install appropriate add-ins (parallel package in R, parallel computing toolbox in Matlab) Include parallel commands in code (e.g., parfor in place of for in Matlab)
  • 89. On Cluster / Grid Resources abound University / department computing clusters Non-commercial scientic computing grids (e.g., XSEDE) Commercial grids (e.g., Amazon EC2)
  • 90. On Cluster / Grid Resources abound University / department computing clusters Non-commercial scientic computing grids (e.g., XSEDE) Commercial grids (e.g., Amazon EC2) Most of these run Linux w/ distribution handled by a batch scheduler Write code using your favorite application, then send it to scheduler with a bash script
  • 91. MapReduce MapReduce is a programming model that facilitates distributed computing Developed by Google around 2004, though ideas predate that
  • 92. MapReduce MapReduce is a programming model that facilitates distributed computing Developed by Google around 2004, though ideas predate that Most algorithms for distributed data processing can be represented in two steps Map: Process individual chunk of data to generate an intermediate summary Reduce: Combine summaries from dierent chunks to produce a single output le
  • 93. MapReduce MapReduce is a programming model that facilitates distributed computing Developed by Google around 2004, though ideas predate that Most algorithms for distributed data processing can be represented in two steps Map: Process individual chunk of data to generate an intermediate summary Reduce: Combine summaries from dierent chunks to produce a single output le If you structure your code this way, MapReduce software will handle all the details of distribution: Partitioning data Scheduling execution across nodes Managing communication between machines Handling errors / machine failures
  • 94. MapReduce: Examples Count words in a large collection of documents Map: Document i → Set of (word, count) pairs Ci Reduce: Collapse {Ci }, summing count within word
  • 95. MapReduce: Examples Count words in a large collection of documents Map: Document i → Set of (word, count) pairs Ci Reduce: Collapse {Ci }, summing count within word Extract medical claims for 65-year old males Map: Record set i → Subset of i that are 65-year old males Hi Reduce: Append elements of {Hi }
  • 96. MapReduce: Examples Count words in a large collection of documents Map: Document i → Set of (word, count) pairs Ci Reduce: Collapse {Ci }, summing count within word Extract medical claims for 65-year old males Map: Record set i → Subset of i that are 65-year old males Hi Reduce: Append elements of {Hi } Compute marginal regression for text analysis (e.g., Gentzkow Shapiro 2010) Map: Counts xij of phrase j → Parameters ˆαj , ˆβj from E (xij |yi ) = αj + βj xij Reduce: Append ˆαj , ˆβj
  • 97. MapReduce: Model User Program Master (1) fork worker (1) fork worker (1) fork (2) assign map (2) assign reduce split 0 split 1 split 2 split 3 split 4 output file 0 (6) write worker (3) read worker (4) local write Map phase Intermediate files (on local disks) worker output file 1 Input files (5) remote read Reduce phase Output files Figure 1: Execution overview
  • 98. MapReduce: Implementation MapReduce is the original software developed by Google Hadoop is the open-source version most people use (developed by Apache) Amazon has a hosted implementation (Amazon EMR)
  • 99. MapReduce: Implementation MapReduce is the original software developed by Google Hadoop is the open-source version most people use (developed by Apache) Amazon has a hosted implementation (Amazon EMR) How does it work? Write your code as two functions called map and reduce Send code data to scheduler using bash script
  • 100. Distributed File Systems Data transfer is the main bottleneck in distributed systems For big data, it makes sense to distribute data as well as computation Data broken up into chunks, each of which lives on a separate node File system keeps track of where the pieces are and allocates jobs so computation happens close to data whenever possible
  • 101. Distributed File Systems Data transfer is the main bottleneck in distributed systems For big data, it makes sense to distribute data as well as computation Data broken up into chunks, each of which lives on a separate node File system keeps track of where the pieces are and allocates jobs so computation happens close to data whenever possible Tight coupling between MapReduce software and associated le systems MapReduce → Google File System (GFS) Hadoop → Hadoop Distributed File System (HDFS) Amazon EMR → Amazon S3
  • 102. Distributed File Systems Legend: Data messages Control messages Application (file name, chunk index) (chunk handle, chunk locations) GFS master File namespace /foo/bar Instructions to chunkserver Chunkserver state GFS chunkserverGFS chunkserver (chunk handle, byte range) chunk data chunk 2ef0 Linux file system Linux file system GFS client Figure 1: GFS Architecture and replication decisions using global knowledge. However, we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations. Let us explain the interactions for a simple read with refer- ence to Figure 1. First, using the fixed chunk size, the client translates the file name and byte offset specified by the ap- plication into a chunk index within the file. Then, it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk tent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1. On the other hand, a large chunk size, even with lazy space allocation, has its disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially. However, hot spots did develop when GFS was first used
  • 104. Scenario 1: Not-So-Big Data My data is 100 gb or less
  • 105. Scenario 1: Not-So-Big Data My data is 100 gb or less Advice Store data locally in at les (csv, Stata, R, etc.) Organize data in normalized tables for robustness and clarity Run code serially or (if computation is slow) in parallel
  • 106. Scenario 2: Big Data, Small Analysis My raw data is 100 gb, but the extracts I actually use for analysis are 100 gb
  • 107. Scenario 2: Big Data, Small Analysis My raw data is 100 gb, but the extracts I actually use for analysis are 100 gb Example Medicare claims data → analyze heart attack spending by patient by year Nielsen scanner data → analyze average price by store by month
  • 108. Scenario 2: Big Data, Small Analysis My raw data is 100 gb, but the extracts I actually use for analysis are 100 gb Example Medicare claims data → analyze heart attack spending by patient by year Nielsen scanner data → analyze average price by store by month Advice Store data in relational database optimized to produce analysis extracts eciently Store extracts locally in at les (csv, Stata, R, etc.) Organize extracts in normalized tables for robustness and clarity Run code serially or (if computation is slow) in parallel
  • 109. Scenario 2: Big Data, Small Analysis My raw data is 100 gb, but the extracts I actually use for analysis are 100 gb Example Medicare claims data → analyze heart attack spending by patient by year Nielsen scanner data → analyze average price by store by month Advice Store data in relational database optimized to produce analysis extracts eciently Store extracts locally in at les (csv, Stata, R, etc.) Organize extracts in normalized tables for robustness and clarity Run code serially or (if computation is slow) in parallel Note: Gains to database increase for more structured data. For completely unstructured data, you may be better o using distributed le system + map reduce to create extracts.
  • 110. Scenario 3: Big Data, Big Analysis My data is 100 GB and my analysis code needs to touch all of the data
  • 111. Scenario 3: Big Data, Big Analysis My data is 100 GB and my analysis code needs to touch all of the data Example 2 TB of SEC ling text → run variable selection using all data
  • 112. Scenario 3: Big Data, Big Analysis My data is 100 GB and my analysis code needs to touch all of the data Example 2 TB of SEC ling text → run variable selection using all data Advice Store data in distributed le system Use MapReduce or other distributed algorithms for analysis