SlideShare a Scribd company logo
1 of 42
Download to read offline
Querying 1.6 1.8 billion reddit comments
with python
Daniel Rodriguez / PyData NYC / Nov 11, 2015
github.com/danielfrg/pydata-nyc-2015 (https://github.com/danielfrg/pydata-nyc-2015)
About me
Daniel Rodriguez
Data Scientist and Software Developer at Continuum Analytics
Anaconda Cluster
Bogotá, Colombia [1]
now in Austin, TX
[1]Yes,it’sspelledlikethat,withtwo‘o’s.Columbiawitha‘u’isauniversityinNewYorkCity.It’snotthathard:ColOmbiaisacOuntry,ColUmbiaisaUniversity.
twitter.com/danielfrg (twitter.com/danielfrg)
github.com/danielfrg (github.com/danielfrg)
danielfrg.com (danielfrg.com)
What is this talk about?
Querying 1.8 billion reddit comments with python
And how you can do it
Data science "friendly" clusters
ETL plus a little bit on data formats
Parquet
Querying data with some new python libraries that target remote engines
Impala
Assumes some basic knowledge of big data tools like HDFS, Map Reduce, Spark or similar
More info at: http://blaze.pydata.org/blog/2015/09/16/reddit-impala/
(http://blaze.pydata.org/blog/2015/09/16/reddit-impala/)
Data
Data
The frontpage of the Internet
Data
I have every publicly available Reddit comment for research. ~ 1.7 billion comments @ 250 GB
compressed. Any interest in this?
(https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/)
Data
Is available on S3: s3://blaze-data/reddit/json
New monthly dumps at: http://pan.whatbox.ca:36975/reddit/comments/monthly/
(http://pan.whatbox.ca:36975/reddit/comments/monthly/)
Clusters
Clusters
Big Data technologies (Hadoop zoo) are here to stay
Really good management tools for IT/DevOps people from Cloudera and Hortonworks
Automated deployment and configuration
Customizable monitoring and reporting
Effortless, robust troubleshooting
Zero downtime maintenance: rolling upgrades and rollbacks
Security: Kerberos, LDAP
Clusters
Data science "friendly" clusters
Some of the features before plus:
Data analysis packages and environment management
Interactive access to the cluster (Jupyter Notebook)
Short living clusters (?)
CLI instead of UI (?)
More freedom (?)
Still requires to know what you are doing: AWS, Keypairs, Security groups, SSH.
No need to hide stuff. No magic.
Clusters: Anaconda Cluster
Resource management tool that allows users to easily create, provision, and manage
bare-metal or cloud-based clusters.
It enables management of Conda environments on clusters
Provides integration, configuration, and setup management for Hadoop
Supported platforms include Amazon Web Services, physical machines, or even a
collection of virtual machines.
http://docs.continuum.io/anaconda-cluster (http://docs.continuum.io/anaconda-cluster)
$ conda install anaconda-client
$ anaconda login
$ conda install anaconda-cluster -c anaconda-cluster
Not open source: 4 free nodes
Soon 16 free nodes in the cloud 4 in-house
Clusters: Anaconda Cluster
Provider:
aws:
cloud_provider: ec2
keyname: {{ keyname in aws }}
location: us-east-1
private_key: ~/.ssh/{{ keyname in aws }}.pem
secret_id: {{ aws key }}
secret_key: {{ aws secret }}
Profile:
name: impala-profile
provider: aws
user: ubuntu
num_nodes: 10
node_id: ami-08faa660 # Ubuntu 12.04
node_type: m3.2xlarge
root_size: 1000
plugins:
- hdfs:
namenode_dirs:
- /data/dfs/nn
datanode_dirs:
- /data/dfs/dn
- hive
- impala
- notebook
Launch cluster:
$ acluster create impala-cluster -p impala-profile
Clusters: Anaconda Cluster
acluster
$ acluster create name -p profile
$ acluster destroy
$ acluster ssh
$ acluster cmd 'date'
$ acluster cmd 'apt-get install build-essential' --sudo
$ acluster conda install numpy
$ acluster conda install my_pkg -c channel
$ acluster submit script.py
$ acluster put script.py /tmp/script.py
$ acluster get /tmp/script.py script.py
Clusters: Anaconda Cluster
acluster install
$ acluster install hdfs
$ acluster install hive
$ acluster install impala
$ acluster install elasticsearch
$ acluster install kibana
$ acluster install logstash
$ acluster install notebook
$ acluster install spark-standalone
$ acluster install spark-yarn
$ acluster install storm
$ acluster install ganglia
Clusters: Anaconda Cluster
$ acluster conda install -c r r-essentials
Installing packages on cluster "impala": r-essentials
Node "ip-172-31-0-186.ec2.internal":
Successful actions: 1/1
Node "ip-172-31-0-190.ec2.internal":
Successful actions: 1/1
Node "ip-172-31-0-182.ec2.internal":
Successful actions: 1/1
Node "ip-172-31-0-189.ec2.internal":
Successful actions: 1/1
Node "ip-172-31-0-191.ec2.internal":
Successful actions: 1/1
Node "ip-172-31-0-183.ec2.internal":
Successful actions: 1/1
Node "ip-172-31-0-184.ec2.internal":
Successful actions: 1/1
Node "ip-172-31-0-187.ec2.internal":
Successful actions: 1/1
Node "ip-172-31-0-185.ec2.internal":
Successful actions: 1/1
Node "ip-172-31-0-188.ec2.internal":
Successful actions: 1/1
Clusters: DataScienceBox
Pre Anaconda Cluster: Command line utility to create instances in the cloud ready for data
science. Includes conda package management plus some Big Data frameworks (spark).
CLI will be available:
https://github.com/danielfrg/datasciencebox (https://github.com/danielfrg/datasciencebox)
$ pip install datasciencebox
$ datasciencebox
$ dsb
$ dsb up
$ dsb install miniconda
$ dsb install conda numpy
$ dsb install notebook
$ dsb install hdfs
$ dsb install spark
$ dsb install impala
Clusters: Under the Hood
Both DSB and AC use a very similar approach: SSH for basic stuff and then use Salt
Salt:
100% free and open source
Fast: ZMQ instead of SSH
Secure
Scalable to thousands of nodes
Declarative yaml languague instead of bash scripts
A lot of free formulas online
https://github.com/saltstack/salt (https://github.com/saltstack/salt)
numpy-install:
conda.installed:
- name: numpy
- require:
- file: config-file
Data
Data: Moving the data
Move data from S3 to our HDFS cluster
hadoop distcp -Dfs.s3n.awsAccessKeyId={{ }} -Dfs.s3n.awsSecretAccessKey={{ }}
s3n://blaze-data/reddit/json/*/*.json /user/ubuntu
Data: Parquet
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem,
regardless of the choice of data processing framework, data model or programming language.
Data: Parquet
[1]http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html
Data: Load
hive > CREATE TABLE reddit_json (
archived boolean,
author string,
author_flair_css_class string,
author_flair_text string,
body string,
controversiality int,
created_utc string,
distinguished string,
downs int,
edited boolean,
gilded int,
id string,
link_id string,
name string,
parent_id string,
removal_reason string,
retrieved_on timestamp,
score int,
score_hidden boolean,
subreddit string,
subreddit_id string,
ups int
)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ('paths'='archived,author,author_flair_css_class,author_flai
r_text,body,controversiality,created_utc,distinguished,downs,edited,gilded,id,link_i
d,name,parent_id,removal_reason,retrieved_on,score,score_hidden,subreddit,subreddit_i
d,ups');
hive > LOAD DATA INPATH '/user/ubuntu/*.json' INTO TABLE reddit_json;
Data: Transform
hive > CREATE TABLE reddit_parquet(
archived boolean,
author string,
author_flair_css_class string,
author_flair_text string,
body string,
controversiality int,
created_utc string,
distinguished string,
downs int,
edited boolean,
gilded int,
id string,
link_id string,
name string,
parent_id string,
removal_reason string,
retrieved_on timestamp,
score int,
score_hidden boolean,
subreddit string,
subreddit_id string,
ups int,
created_utc_t timestamp
)
STORED AS PARQUET;
hive > SET dfs.block.size=1g;
hive > INSERT OVERWRITE TABLE reddit_parquet select *, cast(cast(created_utc as doubl
e) as timestamp) as created_utc_t FROM reddit_json;
Querying
Querying
Using the regular hive/impala shell
impala > invalidate metadata;
impala > SELECT count(*) FROM reddit_parquet;
Query: select count(*) FROM reddit_parquet
+------------+
| count(*) |
+------------+
| 1830807828 |
+------------+
Fetched 1 row(s) in 4.88s
Querying with python
Numpy and Pandas like API that targets not local files but another engines
SQL:
Postgres
Impala
Hive
Spark SQL
NoSQL:
Mongo DB
Still local files
Projects:
Blaze:
Ibis:
github.com/blaze/blaze (https://github.com/blaze/blaze)
github.com/cloudera/ibis (https://github.com/cloudera/ibis)
Blaze
An interface to query data on different storage systems
Code at
Blaze ecosystem:
https://github.com/blaze/blaze (https://github.com/blaze/blaze)
http://blaze.pydata.org (http://blaze.pydata.org)
In [1]: importblazeasbz
importpandasaspd
In [2]: data=bz.Data('impala://54.209.0.148/default::reddit_parquet')
Blaze
Number of comments
In [4]: data.id.count()
Out[4]: 1830807828
In [5]: print(bz.compute(data.id.count()))
SELECT count(reddit_parquet.id) AS id_count
FROM reddit_parquet
Blaze
Total number of up votes
In [6]: n_up_votes=data.ups.sum()
In [7]: print(bz.compute(n_up_votes))
SELECT sum(reddit_parquet.ups) AS ups_sum
FROM reddit_parquet
In [8]: %timeint(n_up_votes)
Out[8]:
CPU times: user 22.4 ms, sys: 6.84 ms, total: 29.2 ms
Wall time: 3.69 s
9696701385
Blaze
Counting the total number of posts in the /r/soccersubreddit
In [9]: n_posts_in_r_soccer=data[data.subreddit=='soccer'].id.count()
In [10]: print(bz.compute(n_posts_in_r_soccer))
SELECT count(alias_1.id) AS id_count
FROM (SELECT reddit_parquet.id AS id
FROM reddit_parquet
WHERE reddit_parquet.subreddit = %(subreddit_1)s) AS alias_1
In [11]: %timeint(n_posts_in_r_soccer)
Out[11]:
CPU times: user 28.6 ms, sys: 8.61 ms, total: 37.2 ms
Wall time: 5 s
13078620
Blaze
Counting the number of comments before a specific hour
In [12]: before_1pm=data.id[bz.hour(data.created_utc_t)<13].count()
In [13]: print(bz.compute(before_1pm))
SELECT count(alias_3.id) AS id_count
FROM (SELECT reddit_parquet.id AS id
FROM reddit_parquet
WHERE EXTRACT(hour FROM reddit_parquet.created_utc_t) < %(param_1)s) AS alias_3
In [14]: %timeint(before_1pm)
Out[14]:
CPU times: user 32.7 ms, sys: 9.88 ms, total: 42.6 ms
Wall time: 5.54 s
812870494
Blaze
Plotting the daily frequency of comments in the /r/IAmAsubreddit
In [15]: iama=data[(data.subreddit=='IAmA')]
In [16]: days=(bz.year(iama.created_utc_t)-2007)*365+(bz.month(iama.created_utc_t)-1)
*31 +bz.day(iama.created_utc_t)
In [17]: iama_with_day=bz.transform(iama,day=days)
In [18]: by_day=bz.by(iama_with_day.day,posts=iama_with_day.created_utc_t.count())
Blaze
Plotting the daily frequency of comments in the /r/IAmAsubreddit
Pandas
In [19]: by_day_result=bz.odo(by_day,pd.DataFrame) # Actually triggers the computation
In [20]: by_day_result.head()
Out[20]: day posts
0 2405 16202
1 2978 2361
2 1418 5444
3 1874 8833
4 1257 4480
In [21]: by_day_result=by_day_result.sort_values(by=['day'])
In [22]: rng=pd.date_range('5/28/2009',periods=len(by_day_result),freq='D')
by_day_result.index=rng
Blaze
Plotting the daily frequency of comments in the /r/IAmAsubreddit
In [23]: frombokeh._legacy_chartsimportTimeSeries,output_notebook,show
In [24]: output_notebook()
BokehJS successfully loaded.
(http://bokeh.pydata.org)
Ibis
Ibis is a new Python data analysis framework with the goal of enabling data scientists and data
engineers to be as productive working with big data as they are working with small and medium
data today. In doing so, we will enable Python to become a true first-class language for Apache
Hadoop, without compromises in functionality, usability, or performance
Code at:
More info: http://www.ibis-project.org (http://www.ibis-project.org)
Ibis
Number of posts with more than 1k up votes
In [1]: importibis
fromibis.impala.compilerimportto_sql
importpandasaspd
In [2]: ibis.options.interactive=True
ibis.options.sql.default_limit=20000
In [3]: hdfs=ibis.hdfs_connect(host='52.91.39.64')
con=ibis.impala.connect(host='54.208.255.126',hdfs_client=hdfs)
In [4]: data=con.table('reddit_parquet')
Ibis
In [5]: data.schema()
Out[5]: ibis.Schema {
archived boolean
author string
author_flair_css_class string
author_flair_text string
body string
controversiality int32
created_utc string
distinguished string
downs int32
edited boolean
gilded int32
id string
link_id string
name string
parent_id string
removal_reason string
retrieved_on timestamp
score int32
score_hidden boolean
subreddit string
subreddit_id string
ups int32
created_utc_t timestamp
}
Ibis
Number of posts with more than 1k up votes
In [6]: more_than_1k=data[data.ups>=1000]
In [7]: month=(more_than_1k.created_utc_t.year()-2007)*12+more_than_1k.created_utc_t.mo
nth()
month=month.name('month')
In [8]: with_month=more_than_1k['id',month]
In [9]: posts=with_month.count()
groups=with_month.aggregate([posts],by='month')
In [10]: month_df=groups.execute()
Ibis
Number of posts with more than 1k up votes
Pandas
In [11]: month_df=month_df.set_index('month')
month_df.sort_index(inplace=True)
rng=pd.date_range('10/01/2007',periods=len(month_df),freq='M')
month_df.index=rng
In [12]: frombokeh._legacy_chartsimportTimeSeries,output_notebook,show
output_notebook()
BokehJS successfully loaded.
(http://bokeh.pydata.org)
Spark?
Data is on HDFS in Parquet, Spark is happy with that
In [ ]: frompyspark.sqlimportSQLContext
sqlContext=SQLContext(sc)
In [ ]: # Read in the Parquet file
parquetFile=sqlContext.read.parquet("people.parquet")
# Parquet files can also be registered as tables and then used in SQL statements
parquetFile.registerTempTable("parquetFile");
teenagers=sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 1
9")
# These are spark DataFrames so you can do other stuff like map
teenNames=teenagers.map(lambdap:"Name: "+p.name)
forteenNameinteenNames.collect():
print(teenName)
UDFs
Takenfrom:http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
Wrap up
We can make Data Scientist access to big data tools easier
Data Scientists need to understand the underlying big data and dev ops tools to some
degree
Some of these tools are very useful for SQL type queries (BI, Tableau) but for more
advanced analytics or ML other tools are needed
Download Anaconda Cluster (or DataScienceBox) and try for yourself
Ideas:
Thanks

More Related Content

What's hot

Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmInfoFarm
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLDatabricks
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 

What's hot (20)

Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Scio
ScioScio
Scio
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQL
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 

Similar to Querying 1.8 billion reddit comments with python

Finding target for hacking on internet is now easier
Finding target for hacking on internet is now easierFinding target for hacking on internet is now easier
Finding target for hacking on internet is now easierDavid Thomas
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
Lab Handson: Power your Creations with Intel Edison!
Lab Handson: Power your Creations with Intel Edison!Lab Handson: Power your Creations with Intel Edison!
Lab Handson: Power your Creations with Intel Edison!Codemotion
 
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONMicroservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONAdrian Cockcroft
 
Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Igalia
 
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixOSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixManish Pandit
 
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Threestackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick ThreeNETWAYS
 
Ben ford intro
Ben ford introBen ford intro
Ben ford introPuppet
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordPuppet
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Django deployment with PaaS
Django deployment with PaaSDjango deployment with PaaS
Django deployment with PaaSAppsembler
 
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)Igalia
 
Connecting Python To The Spark Ecosystem
Connecting Python To The Spark EcosystemConnecting Python To The Spark Ecosystem
Connecting Python To The Spark EcosystemSpark Summit
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redactedRyan Breed
 
Three Years of Lessons Running Potentially Malicious Code Inside Containers
Three Years of Lessons Running Potentially Malicious Code Inside ContainersThree Years of Lessons Running Potentially Malicious Code Inside Containers
Three Years of Lessons Running Potentially Malicious Code Inside ContainersBen Hall
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapPadraig O'Sullivan
 

Similar to Querying 1.8 billion reddit comments with python (20)

Basic Linux kernel
Basic Linux kernelBasic Linux kernel
Basic Linux kernel
 
Bids talk 9.18
Bids talk 9.18Bids talk 9.18
Bids talk 9.18
 
Finding target for hacking on internet is now easier
Finding target for hacking on internet is now easierFinding target for hacking on internet is now easier
Finding target for hacking on internet is now easier
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Lab Handson: Power your Creations with Intel Edison!
Lab Handson: Power your Creations with Intel Edison!Lab Handson: Power your Creations with Intel Edison!
Lab Handson: Power your Creations with Intel Edison!
 
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONMicroservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
 
Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)
 
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixOSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
 
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Threestackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
stackconf 2022: Cluster Management: Heterogeneous, Lightweight, Safe. Pick Three
 
tit
tittit
tit
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Django deployment with PaaS
Django deployment with PaaSDjango deployment with PaaS
Django deployment with PaaS
 
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
 
Connecting Python To The Spark Ecosystem
Connecting Python To The Spark EcosystemConnecting Python To The Spark Ecosystem
Connecting Python To The Spark Ecosystem
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redacted
 
Three Years of Lessons Running Potentially Malicious Code Inside Containers
Three Years of Lessons Running Potentially Malicious Code Inside ContainersThree Years of Lessons Running Potentially Malicious Code Inside Containers
Three Years of Lessons Running Potentially Malicious Code Inside Containers
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
 

Recently uploaded

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Recently uploaded (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

Querying 1.8 billion reddit comments with python

  • 1. Querying 1.6 1.8 billion reddit comments with python Daniel Rodriguez / PyData NYC / Nov 11, 2015 github.com/danielfrg/pydata-nyc-2015 (https://github.com/danielfrg/pydata-nyc-2015)
  • 2. About me Daniel Rodriguez Data Scientist and Software Developer at Continuum Analytics Anaconda Cluster Bogotá, Colombia [1] now in Austin, TX [1]Yes,it’sspelledlikethat,withtwo‘o’s.Columbiawitha‘u’isauniversityinNewYorkCity.It’snotthathard:ColOmbiaisacOuntry,ColUmbiaisaUniversity. twitter.com/danielfrg (twitter.com/danielfrg) github.com/danielfrg (github.com/danielfrg) danielfrg.com (danielfrg.com)
  • 3. What is this talk about? Querying 1.8 billion reddit comments with python And how you can do it Data science "friendly" clusters ETL plus a little bit on data formats Parquet Querying data with some new python libraries that target remote engines Impala Assumes some basic knowledge of big data tools like HDFS, Map Reduce, Spark or similar More info at: http://blaze.pydata.org/blog/2015/09/16/reddit-impala/ (http://blaze.pydata.org/blog/2015/09/16/reddit-impala/)
  • 5. Data The frontpage of the Internet
  • 6. Data I have every publicly available Reddit comment for research. ~ 1.7 billion comments @ 250 GB compressed. Any interest in this? (https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/)
  • 7. Data Is available on S3: s3://blaze-data/reddit/json New monthly dumps at: http://pan.whatbox.ca:36975/reddit/comments/monthly/ (http://pan.whatbox.ca:36975/reddit/comments/monthly/)
  • 9. Clusters Big Data technologies (Hadoop zoo) are here to stay Really good management tools for IT/DevOps people from Cloudera and Hortonworks Automated deployment and configuration Customizable monitoring and reporting Effortless, robust troubleshooting Zero downtime maintenance: rolling upgrades and rollbacks Security: Kerberos, LDAP
  • 10. Clusters Data science "friendly" clusters Some of the features before plus: Data analysis packages and environment management Interactive access to the cluster (Jupyter Notebook) Short living clusters (?) CLI instead of UI (?) More freedom (?) Still requires to know what you are doing: AWS, Keypairs, Security groups, SSH. No need to hide stuff. No magic.
  • 11. Clusters: Anaconda Cluster Resource management tool that allows users to easily create, provision, and manage bare-metal or cloud-based clusters. It enables management of Conda environments on clusters Provides integration, configuration, and setup management for Hadoop Supported platforms include Amazon Web Services, physical machines, or even a collection of virtual machines. http://docs.continuum.io/anaconda-cluster (http://docs.continuum.io/anaconda-cluster) $ conda install anaconda-client $ anaconda login $ conda install anaconda-cluster -c anaconda-cluster Not open source: 4 free nodes Soon 16 free nodes in the cloud 4 in-house
  • 12. Clusters: Anaconda Cluster Provider: aws: cloud_provider: ec2 keyname: {{ keyname in aws }} location: us-east-1 private_key: ~/.ssh/{{ keyname in aws }}.pem secret_id: {{ aws key }} secret_key: {{ aws secret }} Profile: name: impala-profile provider: aws user: ubuntu num_nodes: 10 node_id: ami-08faa660 # Ubuntu 12.04 node_type: m3.2xlarge root_size: 1000 plugins: - hdfs: namenode_dirs: - /data/dfs/nn datanode_dirs: - /data/dfs/dn - hive - impala - notebook Launch cluster: $ acluster create impala-cluster -p impala-profile
  • 13. Clusters: Anaconda Cluster acluster $ acluster create name -p profile $ acluster destroy $ acluster ssh $ acluster cmd 'date' $ acluster cmd 'apt-get install build-essential' --sudo $ acluster conda install numpy $ acluster conda install my_pkg -c channel $ acluster submit script.py $ acluster put script.py /tmp/script.py $ acluster get /tmp/script.py script.py
  • 14. Clusters: Anaconda Cluster acluster install $ acluster install hdfs $ acluster install hive $ acluster install impala $ acluster install elasticsearch $ acluster install kibana $ acluster install logstash $ acluster install notebook $ acluster install spark-standalone $ acluster install spark-yarn $ acluster install storm $ acluster install ganglia
  • 15. Clusters: Anaconda Cluster $ acluster conda install -c r r-essentials Installing packages on cluster "impala": r-essentials Node "ip-172-31-0-186.ec2.internal": Successful actions: 1/1 Node "ip-172-31-0-190.ec2.internal": Successful actions: 1/1 Node "ip-172-31-0-182.ec2.internal": Successful actions: 1/1 Node "ip-172-31-0-189.ec2.internal": Successful actions: 1/1 Node "ip-172-31-0-191.ec2.internal": Successful actions: 1/1 Node "ip-172-31-0-183.ec2.internal": Successful actions: 1/1 Node "ip-172-31-0-184.ec2.internal": Successful actions: 1/1 Node "ip-172-31-0-187.ec2.internal": Successful actions: 1/1 Node "ip-172-31-0-185.ec2.internal": Successful actions: 1/1 Node "ip-172-31-0-188.ec2.internal": Successful actions: 1/1
  • 16. Clusters: DataScienceBox Pre Anaconda Cluster: Command line utility to create instances in the cloud ready for data science. Includes conda package management plus some Big Data frameworks (spark). CLI will be available: https://github.com/danielfrg/datasciencebox (https://github.com/danielfrg/datasciencebox) $ pip install datasciencebox $ datasciencebox $ dsb $ dsb up $ dsb install miniconda $ dsb install conda numpy $ dsb install notebook $ dsb install hdfs $ dsb install spark $ dsb install impala
  • 17. Clusters: Under the Hood Both DSB and AC use a very similar approach: SSH for basic stuff and then use Salt Salt: 100% free and open source Fast: ZMQ instead of SSH Secure Scalable to thousands of nodes Declarative yaml languague instead of bash scripts A lot of free formulas online https://github.com/saltstack/salt (https://github.com/saltstack/salt) numpy-install: conda.installed: - name: numpy - require: - file: config-file
  • 18. Data
  • 19. Data: Moving the data Move data from S3 to our HDFS cluster hadoop distcp -Dfs.s3n.awsAccessKeyId={{ }} -Dfs.s3n.awsSecretAccessKey={{ }} s3n://blaze-data/reddit/json/*/*.json /user/ubuntu
  • 20. Data: Parquet Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
  • 22. Data: Load hive > CREATE TABLE reddit_json ( archived boolean, author string, author_flair_css_class string, author_flair_text string, body string, controversiality int, created_utc string, distinguished string, downs int, edited boolean, gilded int, id string, link_id string, name string, parent_id string, removal_reason string, retrieved_on timestamp, score int, score_hidden boolean, subreddit string, subreddit_id string, ups int ) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ('paths'='archived,author,author_flair_css_class,author_flai r_text,body,controversiality,created_utc,distinguished,downs,edited,gilded,id,link_i d,name,parent_id,removal_reason,retrieved_on,score,score_hidden,subreddit,subreddit_i d,ups'); hive > LOAD DATA INPATH '/user/ubuntu/*.json' INTO TABLE reddit_json;
  • 23. Data: Transform hive > CREATE TABLE reddit_parquet( archived boolean, author string, author_flair_css_class string, author_flair_text string, body string, controversiality int, created_utc string, distinguished string, downs int, edited boolean, gilded int, id string, link_id string, name string, parent_id string, removal_reason string, retrieved_on timestamp, score int, score_hidden boolean, subreddit string, subreddit_id string, ups int, created_utc_t timestamp ) STORED AS PARQUET; hive > SET dfs.block.size=1g; hive > INSERT OVERWRITE TABLE reddit_parquet select *, cast(cast(created_utc as doubl e) as timestamp) as created_utc_t FROM reddit_json;
  • 25. Querying Using the regular hive/impala shell impala > invalidate metadata; impala > SELECT count(*) FROM reddit_parquet; Query: select count(*) FROM reddit_parquet +------------+ | count(*) | +------------+ | 1830807828 | +------------+ Fetched 1 row(s) in 4.88s
  • 26. Querying with python Numpy and Pandas like API that targets not local files but another engines SQL: Postgres Impala Hive Spark SQL NoSQL: Mongo DB Still local files Projects: Blaze: Ibis: github.com/blaze/blaze (https://github.com/blaze/blaze) github.com/cloudera/ibis (https://github.com/cloudera/ibis)
  • 27. Blaze An interface to query data on different storage systems Code at Blaze ecosystem: https://github.com/blaze/blaze (https://github.com/blaze/blaze) http://blaze.pydata.org (http://blaze.pydata.org) In [1]: importblazeasbz importpandasaspd In [2]: data=bz.Data('impala://54.209.0.148/default::reddit_parquet')
  • 28. Blaze Number of comments In [4]: data.id.count() Out[4]: 1830807828 In [5]: print(bz.compute(data.id.count())) SELECT count(reddit_parquet.id) AS id_count FROM reddit_parquet
  • 29. Blaze Total number of up votes In [6]: n_up_votes=data.ups.sum() In [7]: print(bz.compute(n_up_votes)) SELECT sum(reddit_parquet.ups) AS ups_sum FROM reddit_parquet In [8]: %timeint(n_up_votes) Out[8]: CPU times: user 22.4 ms, sys: 6.84 ms, total: 29.2 ms Wall time: 3.69 s 9696701385
  • 30. Blaze Counting the total number of posts in the /r/soccersubreddit In [9]: n_posts_in_r_soccer=data[data.subreddit=='soccer'].id.count() In [10]: print(bz.compute(n_posts_in_r_soccer)) SELECT count(alias_1.id) AS id_count FROM (SELECT reddit_parquet.id AS id FROM reddit_parquet WHERE reddit_parquet.subreddit = %(subreddit_1)s) AS alias_1 In [11]: %timeint(n_posts_in_r_soccer) Out[11]: CPU times: user 28.6 ms, sys: 8.61 ms, total: 37.2 ms Wall time: 5 s 13078620
  • 31. Blaze Counting the number of comments before a specific hour In [12]: before_1pm=data.id[bz.hour(data.created_utc_t)<13].count() In [13]: print(bz.compute(before_1pm)) SELECT count(alias_3.id) AS id_count FROM (SELECT reddit_parquet.id AS id FROM reddit_parquet WHERE EXTRACT(hour FROM reddit_parquet.created_utc_t) < %(param_1)s) AS alias_3 In [14]: %timeint(before_1pm) Out[14]: CPU times: user 32.7 ms, sys: 9.88 ms, total: 42.6 ms Wall time: 5.54 s 812870494
  • 32. Blaze Plotting the daily frequency of comments in the /r/IAmAsubreddit In [15]: iama=data[(data.subreddit=='IAmA')] In [16]: days=(bz.year(iama.created_utc_t)-2007)*365+(bz.month(iama.created_utc_t)-1) *31 +bz.day(iama.created_utc_t) In [17]: iama_with_day=bz.transform(iama,day=days) In [18]: by_day=bz.by(iama_with_day.day,posts=iama_with_day.created_utc_t.count())
  • 33. Blaze Plotting the daily frequency of comments in the /r/IAmAsubreddit Pandas In [19]: by_day_result=bz.odo(by_day,pd.DataFrame) # Actually triggers the computation In [20]: by_day_result.head() Out[20]: day posts 0 2405 16202 1 2978 2361 2 1418 5444 3 1874 8833 4 1257 4480 In [21]: by_day_result=by_day_result.sort_values(by=['day']) In [22]: rng=pd.date_range('5/28/2009',periods=len(by_day_result),freq='D') by_day_result.index=rng
  • 34. Blaze Plotting the daily frequency of comments in the /r/IAmAsubreddit In [23]: frombokeh._legacy_chartsimportTimeSeries,output_notebook,show In [24]: output_notebook() BokehJS successfully loaded. (http://bokeh.pydata.org)
  • 35. Ibis Ibis is a new Python data analysis framework with the goal of enabling data scientists and data engineers to be as productive working with big data as they are working with small and medium data today. In doing so, we will enable Python to become a true first-class language for Apache Hadoop, without compromises in functionality, usability, or performance Code at: More info: http://www.ibis-project.org (http://www.ibis-project.org)
  • 36. Ibis Number of posts with more than 1k up votes In [1]: importibis fromibis.impala.compilerimportto_sql importpandasaspd In [2]: ibis.options.interactive=True ibis.options.sql.default_limit=20000 In [3]: hdfs=ibis.hdfs_connect(host='52.91.39.64') con=ibis.impala.connect(host='54.208.255.126',hdfs_client=hdfs) In [4]: data=con.table('reddit_parquet')
  • 37. Ibis In [5]: data.schema() Out[5]: ibis.Schema { archived boolean author string author_flair_css_class string author_flair_text string body string controversiality int32 created_utc string distinguished string downs int32 edited boolean gilded int32 id string link_id string name string parent_id string removal_reason string retrieved_on timestamp score int32 score_hidden boolean subreddit string subreddit_id string ups int32 created_utc_t timestamp }
  • 38. Ibis Number of posts with more than 1k up votes In [6]: more_than_1k=data[data.ups>=1000] In [7]: month=(more_than_1k.created_utc_t.year()-2007)*12+more_than_1k.created_utc_t.mo nth() month=month.name('month') In [8]: with_month=more_than_1k['id',month] In [9]: posts=with_month.count() groups=with_month.aggregate([posts],by='month') In [10]: month_df=groups.execute()
  • 39. Ibis Number of posts with more than 1k up votes Pandas In [11]: month_df=month_df.set_index('month') month_df.sort_index(inplace=True) rng=pd.date_range('10/01/2007',periods=len(month_df),freq='M') month_df.index=rng In [12]: frombokeh._legacy_chartsimportTimeSeries,output_notebook,show output_notebook() BokehJS successfully loaded. (http://bokeh.pydata.org)
  • 40. Spark? Data is on HDFS in Parquet, Spark is happy with that In [ ]: frompyspark.sqlimportSQLContext sqlContext=SQLContext(sc) In [ ]: # Read in the Parquet file parquetFile=sqlContext.read.parquet("people.parquet") # Parquet files can also be registered as tables and then used in SQL statements parquetFile.registerTempTable("parquetFile"); teenagers=sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 1 9") # These are spark DataFrames so you can do other stuff like map teenNames=teenagers.map(lambdap:"Name: "+p.name) forteenNameinteenNames.collect(): print(teenName) UDFs Takenfrom:http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
  • 41. Wrap up We can make Data Scientist access to big data tools easier Data Scientists need to understand the underlying big data and dev ops tools to some degree Some of these tools are very useful for SQL type queries (BI, Tableau) but for more advanced analytics or ML other tools are needed Download Anaconda Cluster (or DataScienceBox) and try for yourself Ideas: