SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
1
2
About myself
▪ DigitalPebble Ltd, Bristol
▪ Specialised in Text Engineering
– Web Crawling
– Natural Language Processing
– Machine Learning
– Search
▪ Strong focus on Open Source & Apache ecosystem
– StormCrawler, Apache Nutch, Crawler-Commons
– GATE, Apache UIMA
– Behemoth
– Apache SOLR, Elasticsearch
Julien Nioche
julien@digitalpebble.com
@digitalpebble
@stormcrawlerapi
3
https://www.jasondavies.com/wordcloud/
4
Wish list for (yet) another web crawler
● Scalable
● Low latency
● Efficient yet polite
● Nice to use and extend
● Versatile : use for archiving, search or scraping
● Java
… and not reinvent a whole framework for distributed processing
=> Stream Processing Frameworks
5
6
History
Late 2010 : started by Nathan Marz
September 2011 : open sourced by Twitter
2014 : graduated from incubation and became Apache top-level project
http://storm.apache.org/
0.10.x
1.x => Distributed Cache , Nimbus HA and Pacemaker, new package names
2.x => Clojure to Java
Current stable version: 1.0.2
7
Main concepts
Topologies
Streams
Tuples
Spouts
Bolts
8
Architecture
http://storm.apache.org/releases/1.0.2/Daemon-Fault-Tolerance.html
9
Parallelism
Workers (JVM managed by supervisor) != Executors (threads) != Tasks (instances)
# executors <= #tasks
Rebalance live topology => change # workers or # executors
Number of tasks in a topology is static
http://storm.apache.org/releases/1.0.2/Understanding-the-parallelism-of-a-Storm-topology.html
10
Worker tasks
http://storm.apache.org/releases/1.0.2/Understanding-the-parallelism-of-a-Storm-topology.html
11
Stream Grouping
Define how to streams connect bolts and how tuples are partitioned among tasks
Built-in stream groupings in Storm (implement CustomStreamGrouping) :
1. Shuffle grouping
2. Fields grouping
3. Partial Key grouping
4. All grouping
5. Global grouping
6. None grouping
7. Direct grouping
8. Local or shuffle grouping
12
In code
From http://storm.apache.org/releases/current/Tutorial.html
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("sentences", new RandomSentenceSpout(), 2);
builder.setBolt("split", new SplitSentence(), 4).setNumTasks(8)
.shuffleGrouping("sentences");
builder.setBolt("count", new WordCount(), 12)
.fieldsGrouping("split", new Fields("word"));
13
Run a topology
● Build topology class using TopologyBuilder
● Build über-jar with code and resources
● storm script to interact with Nimbus
○ storm jar topology-jar-path class …
● Uses config in ~/.storm, can also pass individual config elements on command line
● Local mode? Depends how you coded it in topo class
● Will run until kill with storm kill
● Easier ways of doing as we’ll see later
14
Guaranteed message processing
▪ Spout/Bolt => Anchoring and acking
outputCollector.emit(tuple, new Values(order));
outputCollector.ack(tuple);
▪ Spouts have ack / fail methods
– manually failed in bolt (explicitly or implicitly e.g. extensions of BaseBasicBolt)
– triggered by timeout (Config.TOPOLOGY _ MESSAGE _ TIMEOUT _ SECS - default 30s)
▪ Replay logic in spout
– depends on datasource and use case
▪ Optional
– Config.TOPOLOGY_ACKERS set to 0 (default 1 per worker)
See http://storm.apache.org/releases/current/Guaranteeing-message-processing.html
15
Metrics
Nice in-built framework for metrics
Pluggable => default file based one (IMetricsConsumer)
org.apache.storm.metric.LoggingMetricsConsumer
Define metrics in code
this.eventCounter = context.registerMetric("SolrIndexerBolt",
new MultiCountMetric(), 10);
...
context.registerMetric("queue_size", new IMetric() {
@Override
public Object getValueAndReset() {
return queue.size();
}
}, 10);
16
UI
17
Logs
● One log file per worker/machine
● Change log levels via config or UI
● Activate log service to view from UI
○ Regex for search
● Metrics file there as well (if activated)
18
Things I have not mentioned
● Trident : Micro batching, comparable to Spark Streaming
● Windowing
● State Management
● DRPC
● Loads more ...
-> find it all on http://storm.apache.org/
19
20
Storm Crawler
http://stormcrawler.net/
● Released 0.1 in Sept 2014
● Version : 1.1.1
● Release every 2 months on average
● Apache License v2
● Artefacts available from Maven Central
● Built with Apache Maven
21
Collection of resources
● Core
● External
○ Cloudsearch
○ Elasticsearch
○ SOLR
○ SQL
○ Tika
● Completely external i.e. not published as part of SC
○ WARC => https://github.com/DigitalPebble/sc-warc
○ MicroData parser =>
https://github.com/PopSugar/storm-crawler-extensions/tree/master/microdata-par
ser
○ ...
22
23
Step #1 : Fetching
FetcherBolt
Input : <String, Metadata>
Output : <String, byte[], Metadata>
TopologyBuilder builder = new TopologyBuilder();
…
// we have a spout generating String, Metadata tuples!
builder.setBolt("fetch", new FetcherBolt(),3)
.shuffleGrouping("spout");
S
24
Problem #1 : politeness
● Respects robots.txt directives (http://www.robotstxt.org/)
○ Can I fetch this URL? How frequently can I send requests to the server
(crawl-delay)?
● Get and cache the Robots directives
● Then throttle next call if needed
○ fetchInterval.default
Politeness sorted?
No - still have a problem : spout shuffles URLs from same host to 3 different tasks !?!?
25
URLPartitionerBolt
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("url", "key", "metadata"));
}
Partitions URLs by host / domain or IP
partition.url.mode: “byHost” | “byDomain” | “byIP”
builder.setBolt("partitioner", new URLPartitionerBolt())
.shuffleGrouping("spout");
builder.setBolt("fetch", new FetcherBolt(),3)
.fieldsGrouping("partitioner", new Fields("key"));
S P
F
F
F
key
key
key
26
Fetcher Bolts
SimpleFetcherBolt
● Fetch within execute method
● Waits if not enough time since previous call to same host / domain / IP
● Incoming tuples kept in Storm queues i.e. outside bolt instance
FetcherBolt
● Puts incoming tuples into internal queues
● Handles pool of threads : poll from queues
○ fetcher.threads.number
● If not enough time since previous call : thread moves to another queue
27
Next : Parsing
● Extract text and metadata (e.g. for indexing)
○ Calls ParseFilter(s) on document
○ Enrich content of Metadata
● ParseFilter abstract class
public abstract void filter(String URL, byte[] content, DocumentFragment doc, ParseResult parse);
● Out-of-the-box:
○ ContentFilter
○ DebugParseFilter
○ LinkParseFilter : //IMG[@src]
○ XPathFilter
● Configured via JSON file
28
ParseFilter Config => resources/parsefilter.json
{
"com.digitalpebble.stormcrawler.parse.ParseFilters": [
{
"class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
"name": "XPathFilter",
"params": {
"canonical": "//*[@rel="canonical"]/@href",
"parse.description": [
"//*[@name="description"]/@content",
"//*[@name="Description"]/@content"
],
"parse.title": "//META[@name="title"]/@content",
"parse.keywords": "//META[@name="keywords"]/@content"
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.ContentFilter",
"name": "ContentFilter",
"params": {
"pattern": "//DIV[@id="maincontent"]",
"pattern2": "//DIV[@itemprop="articleBody"]"
}
}
]
}
29
Basic topology in code
builder.setBolt("partitioner", new URLPartitionerBolt())
.shuffleGrouping("spout");
builder.setBolt("fetch", new FetcherBolt(),3)
.fieldsGrouping("partitioner", new Fields("key"));
builder.setBolt("parse", new JSoupParserBolt())
.localOrShuffleGrouping("fetch");
builder.setBolt("index", new StdOutIndexer())
.localOrShuffleGrouping("parse");
localOrShuffleGrouping : stay within same worker process => faster de/serialization + less traffic
across nodes (byte[] content is heavy!)
30
Summary : a basic topology
31
Frontier expansion
32
Outlinks
● Done by ParserBolt
● Calls URLFilter(s) on outlinks to normalize and / or blacklists URLs
● URLFilter interface
public String filter(URL sourceUrl, Metadata sourceMetadata, String urlToFilter);
● Configured via JSON file
● Loads of filters available out of the box
○ Depth, metadata, regex, host, robots ...
33
Status Stream
34
Status Stream
// can also use MemoryStatusUpdater for simple recursive crawls
builder.setBolt("status", new StdOutStatusUpdater())
.localOrShuffleGrouping("fetch", Constants.StatusStreamName)
.localOrShuffleGrouping("parse", Constants.StatusStreamName)
.localOrShuffleGrouping("index", Constants.StatusStreamName);
Implementations should extend AbstractStatusUpdaterBolt.java
● Provides an internal cache (for DISCOVERED)
● Calls code to determine which metadata should be persisted to storage
● Calls code to handle scheduling
● Focus on implementing store(String url, Status status, Metadata metadata, Date nextFetch)
public enum Status {
DISCOVERED, FETCHED, FETCH_ERROR, REDIRECTION, ERROR;
}
35
▪ Depends on your scenario :
– Don’t follow outlinks?
– Recursive crawls? i.e can I get to the same URL in more than one
way?
– Refetch URLs? When?
▪ queues, queues + cache, RDB, key value stores, search
▪ Depends on rest of your architecture
– SC does not force you into using one particular tool
Which backend?
36
Make your life easier #1 : ConfigurableTopology
● Takes YAML config for your topology (no more ~/.storm)
● Local or deployed via -local arg (no more coding)
● Auto-kill with -ttl arg (e.g. injection of seeds in status backend)
● Loads crawler-default.yaml into active conf
○ Just need to override with your YAML
● Registers the custom serialization for Metadata class
37
public class CrawlTopology extends ConfigurableTopology {
public static void main(String[] args) throws Exception {
ConfigurableTopology.start(new CrawlTopology(), args);
}
@Override
protected int run(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
// Just declare your spouts and bolts as usual
return submit("crawl", conf, builder);
}
}
Build then run it with
storm jar target/${artifactId}-${version}.jar ${package}.CrawlTopology -conf crawler-conf.yaml -local
ConfigurableTopology
38
Make your life easier #2 : Archetype
Got a working topology class but :
● Still have to write a pom file with the dependencies on Storm and SC
● Still need to include a basic set of resources
○ URLFilters, ParseFilters
● As well as a simple config file
Just use the archetype instead!
mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler
-DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.1.1
And modify the bits you need
39
40
Make your life easier #3 : Flux
http://storm.apache.org/releases/1.0.2/flux.html
Define topology and config via a YAML file
storm jar mytopology.jar org.apache.storm.flux.Flux --local config.yaml
Overlaps partly with ConfigurableTopology
Archetype provides a Flux equivalent to the example Topology
Share the same config file
For more complex cases probably easier to use ConfigurableTopology
41
External modules and repositories
▪ Loads of useful things in core
=> sitemap, feed parser bolt, …
▪ External
– Cloudsearch (indexer)
– Elasticsearch (status + metrics + indexer)
– SOLR (status + metrics + indexer)
– SQL (status)
– Tika (parser)
42
▪ Spout(s) / StatusUpdater bolt
– Info about URLs in ‘status’ index
– Shards by domain/host/IP
– Throttle by domain/host/IP
▪ Indexing Bolt
– Docs fetched sent to ‘index’ index for search
▪ Metrics
– DataPoints sent to ‘metrics’ index for monitoring the crawl
– Use with Kibana for monitoring
43
Use cases
44
● Streaming URLs
● Queue in front of topology
● Content stored in HBase
45
“Find images taken with camera X since Y”
● URLs of images (with source page)
● Various sources : browser plugin, custom logic for specific sites
● Status + metrics with ES
● EXIF and other fields in bespoke index
● Content of images cached on AWS S3 (>35TB)
● Custom bolts for variations of topology : e.g. image tagging with Python bolt
○ using Tensorflow
46
47
News crawl
7000 RSS feeds <= 1K news sites
50K articles per day
Content saved as WARC files on Amazon S3
Code and data publicly available
Elasticsearch status + metrics
WARC module => https://github.com/DigitalPebble/sc-warc
48
Next?
● Selenium-based protocol implementation
○ Ajax pages?
● Move WARC resources to main repo?
49
http://storm.apache.org
Storm Applied (Manning)
http://www.manning.com/sallen/
http://stormcrawler.net
https://github.com/DigitalPebble/storm-crawler/wiki
References
50

Más contenido relacionado

La actualidad más candente

Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitabial
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSSaumitra Srivastav
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Chris Mattmann
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaKnoldus Inc.
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Ben Busby
 
Profile hadoop apps
Profile hadoop appsProfile hadoop apps
Profile hadoop appsBasant Verma
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst AgainVarun Thacker
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiSatoshi Nagayasu
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1Maruf Hassan
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
 
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsySearch-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsyLucidworks
 
Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekCloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekHakka Labs
 
Introducing JDBC for SPARQL
Introducing JDBC for SPARQLIntroducing JDBC for SPARQL
Introducing JDBC for SPARQLRob Vesse
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Logstash family introduction
Logstash family introductionLogstash family introduction
Logstash family introductionOwen Wu
 
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.Airat Khisamov
 

La actualidad más candente (20)

Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkit
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFS
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
Profile hadoop apps
Profile hadoop appsProfile hadoop apps
Profile hadoop apps
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsySearch-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
 
Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekCloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
 
Introducing JDBC for SPARQL
Introducing JDBC for SPARQLIntroducing JDBC for SPARQL
Introducing JDBC for SPARQL
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Logstash family introduction
Logstash family introductionLogstash family introduction
Logstash family introduction
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
 

Destacado

ярош 2 варіант
ярош 2 варіантярош 2 варіант
ярош 2 варіантyarosalyona
 
Chicago Daily Law Bulletin - Two years of continuous employment rule not as
Chicago Daily Law Bulletin - Two years of continuous employment rule not as Chicago Daily Law Bulletin - Two years of continuous employment rule not as
Chicago Daily Law Bulletin - Two years of continuous employment rule not as Paul Porvaznik
 
Intimate Partner Violence Prevention_Program Plan
Intimate Partner Violence Prevention_Program PlanIntimate Partner Violence Prevention_Program Plan
Intimate Partner Violence Prevention_Program PlanTara DeMaderios
 
Social engine development company
Social engine development companySocial engine development company
Social engine development companySocialengine India
 
Google Analytics for Admissions
Google Analytics for AdmissionsGoogle Analytics for Admissions
Google Analytics for AdmissionsJustina Gaddy
 
Baby & Kids Volume 1 - Vector Graphic Artworks
Baby & Kids Volume 1 - Vector Graphic ArtworksBaby & Kids Volume 1 - Vector Graphic Artworks
Baby & Kids Volume 1 - Vector Graphic ArtworksTZipp
 
Commercial%20Banking,%20Collections,%20and%20Bankruptcy%20December%202013
Commercial%20Banking,%20Collections,%20and%20Bankruptcy%20December%202013Commercial%20Banking,%20Collections,%20and%20Bankruptcy%20December%202013
Commercial%20Banking,%20Collections,%20and%20Bankruptcy%20December%202013Paul Porvaznik
 
Ali CV Formate 1
Ali CV Formate 1Ali CV Formate 1
Ali CV Formate 1Ali Nawaz
 
Will CSK and RR return in 2018?
Will CSK and RR return in 2018?Will CSK and RR return in 2018?
Will CSK and RR return in 2018?Bhoomi Patel
 
JournalofPrecisionMedicine_May_June2016
JournalofPrecisionMedicine_May_June2016JournalofPrecisionMedicine_May_June2016
JournalofPrecisionMedicine_May_June2016Franziska Moeckel, MBA
 
Prokochuk_Irina_architectural magazine Sporuda_аdvertising campaign
Prokochuk_Irina_architectural magazine Sporuda_аdvertising campaignProkochuk_Irina_architectural magazine Sporuda_аdvertising campaign
Prokochuk_Irina_architectural magazine Sporuda_аdvertising campaignIra Prokopchuk
 
USI 2009 - Du RIA pour SI
USI 2009 - Du RIA pour SIUSI 2009 - Du RIA pour SI
USI 2009 - Du RIA pour SIDjamel Zouaoui
 
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...Djamel Zouaoui
 
Usi 2013 - NoSql les defis à relever
Usi 2013 -  NoSql les defis à releverUsi 2013 -  NoSql les defis à relever
Usi 2013 - NoSql les defis à releverDjamel Zouaoui
 
USI Casablanca 2010 - Industrialisation et intégration continue
USI Casablanca 2010 - Industrialisation et intégration continueUSI Casablanca 2010 - Industrialisation et intégration continue
USI Casablanca 2010 - Industrialisation et intégration continueDjamel Zouaoui
 

Destacado (16)

ярош 2 варіант
ярош 2 варіантярош 2 варіант
ярош 2 варіант
 
Chicago Daily Law Bulletin - Two years of continuous employment rule not as
Chicago Daily Law Bulletin - Two years of continuous employment rule not as Chicago Daily Law Bulletin - Two years of continuous employment rule not as
Chicago Daily Law Bulletin - Two years of continuous employment rule not as
 
Intimate Partner Violence Prevention_Program Plan
Intimate Partner Violence Prevention_Program PlanIntimate Partner Violence Prevention_Program Plan
Intimate Partner Violence Prevention_Program Plan
 
eGrove Systems - "SOLR" An Apache Product
eGrove Systems - "SOLR" An Apache ProducteGrove Systems - "SOLR" An Apache Product
eGrove Systems - "SOLR" An Apache Product
 
Social engine development company
Social engine development companySocial engine development company
Social engine development company
 
Google Analytics for Admissions
Google Analytics for AdmissionsGoogle Analytics for Admissions
Google Analytics for Admissions
 
Baby & Kids Volume 1 - Vector Graphic Artworks
Baby & Kids Volume 1 - Vector Graphic ArtworksBaby & Kids Volume 1 - Vector Graphic Artworks
Baby & Kids Volume 1 - Vector Graphic Artworks
 
Commercial%20Banking,%20Collections,%20and%20Bankruptcy%20December%202013
Commercial%20Banking,%20Collections,%20and%20Bankruptcy%20December%202013Commercial%20Banking,%20Collections,%20and%20Bankruptcy%20December%202013
Commercial%20Banking,%20Collections,%20and%20Bankruptcy%20December%202013
 
Ali CV Formate 1
Ali CV Formate 1Ali CV Formate 1
Ali CV Formate 1
 
Will CSK and RR return in 2018?
Will CSK and RR return in 2018?Will CSK and RR return in 2018?
Will CSK and RR return in 2018?
 
JournalofPrecisionMedicine_May_June2016
JournalofPrecisionMedicine_May_June2016JournalofPrecisionMedicine_May_June2016
JournalofPrecisionMedicine_May_June2016
 
Prokochuk_Irina_architectural magazine Sporuda_аdvertising campaign
Prokochuk_Irina_architectural magazine Sporuda_аdvertising campaignProkochuk_Irina_architectural magazine Sporuda_аdvertising campaign
Prokochuk_Irina_architectural magazine Sporuda_аdvertising campaign
 
USI 2009 - Du RIA pour SI
USI 2009 - Du RIA pour SIUSI 2009 - Du RIA pour SI
USI 2009 - Du RIA pour SI
 
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
Microsoft Tech days 2007 - Industrialisation des développements : Retours d'e...
 
Usi 2013 - NoSql les defis à relever
Usi 2013 -  NoSql les defis à releverUsi 2013 -  NoSql les defis à relever
Usi 2013 - NoSql les defis à relever
 
USI Casablanca 2010 - Industrialisation et intégration continue
USI Casablanca 2010 - Industrialisation et intégration continueUSI Casablanca 2010 - Industrialisation et intégration continue
USI Casablanca 2010 - Industrialisation et intégration continue
 

Similar a StormCrawler at Bristech

Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaPrajal Kulkarni
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012mumrah
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisationgrooverdan
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheusBob Cotton
 
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki LinnakangasPG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangaspgdayrussia
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Bryan Bende
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsthelabdude
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with GatlingPetr Vlček
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.jsorkaplan
 
cache2k, Java Caching, Turbo Charged, FOSDEM 2015
cache2k, Java Caching, Turbo Charged, FOSDEM 2015cache2k, Java Caching, Turbo Charged, FOSDEM 2015
cache2k, Java Caching, Turbo Charged, FOSDEM 2015cruftex
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachAlexandre Rafalovitch
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Lucidworks
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache MesosJoe Stein
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
"Swoole: double troubles in c", Alexandr Vronskiy
"Swoole: double troubles in c", Alexandr Vronskiy"Swoole: double troubles in c", Alexandr Vronskiy
"Swoole: double troubles in c", Alexandr VronskiyFwdays
 

Similar a StormCrawler at Bristech (20)

The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisation
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheus
 
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki LinnakangasPG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with Gatling
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.js
 
cache2k, Java Caching, Turbo Charged, FOSDEM 2015
cache2k, Java Caching, Turbo Charged, FOSDEM 2015cache2k, Java Caching, Turbo Charged, FOSDEM 2015
cache2k, Java Caching, Turbo Charged, FOSDEM 2015
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
"Swoole: double troubles in c", Alexandr Vronskiy
"Swoole: double troubles in c", Alexandr Vronskiy"Swoole: double troubles in c", Alexandr Vronskiy
"Swoole: double troubles in c", Alexandr Vronskiy
 

Último

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 

Último (20)

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 

StormCrawler at Bristech

  • 1. 1
  • 2. 2 About myself ▪ DigitalPebble Ltd, Bristol ▪ Specialised in Text Engineering – Web Crawling – Natural Language Processing – Machine Learning – Search ▪ Strong focus on Open Source & Apache ecosystem – StormCrawler, Apache Nutch, Crawler-Commons – GATE, Apache UIMA – Behemoth – Apache SOLR, Elasticsearch Julien Nioche julien@digitalpebble.com @digitalpebble @stormcrawlerapi
  • 4. 4 Wish list for (yet) another web crawler ● Scalable ● Low latency ● Efficient yet polite ● Nice to use and extend ● Versatile : use for archiving, search or scraping ● Java … and not reinvent a whole framework for distributed processing => Stream Processing Frameworks
  • 5. 5
  • 6. 6 History Late 2010 : started by Nathan Marz September 2011 : open sourced by Twitter 2014 : graduated from incubation and became Apache top-level project http://storm.apache.org/ 0.10.x 1.x => Distributed Cache , Nimbus HA and Pacemaker, new package names 2.x => Clojure to Java Current stable version: 1.0.2
  • 9. 9 Parallelism Workers (JVM managed by supervisor) != Executors (threads) != Tasks (instances) # executors <= #tasks Rebalance live topology => change # workers or # executors Number of tasks in a topology is static http://storm.apache.org/releases/1.0.2/Understanding-the-parallelism-of-a-Storm-topology.html
  • 11. 11 Stream Grouping Define how to streams connect bolts and how tuples are partitioned among tasks Built-in stream groupings in Storm (implement CustomStreamGrouping) : 1. Shuffle grouping 2. Fields grouping 3. Partial Key grouping 4. All grouping 5. Global grouping 6. None grouping 7. Direct grouping 8. Local or shuffle grouping
  • 12. 12 In code From http://storm.apache.org/releases/current/Tutorial.html TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 2); builder.setBolt("split", new SplitSentence(), 4).setNumTasks(8) .shuffleGrouping("sentences"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));
  • 13. 13 Run a topology ● Build topology class using TopologyBuilder ● Build über-jar with code and resources ● storm script to interact with Nimbus ○ storm jar topology-jar-path class … ● Uses config in ~/.storm, can also pass individual config elements on command line ● Local mode? Depends how you coded it in topo class ● Will run until kill with storm kill ● Easier ways of doing as we’ll see later
  • 14. 14 Guaranteed message processing ▪ Spout/Bolt => Anchoring and acking outputCollector.emit(tuple, new Values(order)); outputCollector.ack(tuple); ▪ Spouts have ack / fail methods – manually failed in bolt (explicitly or implicitly e.g. extensions of BaseBasicBolt) – triggered by timeout (Config.TOPOLOGY _ MESSAGE _ TIMEOUT _ SECS - default 30s) ▪ Replay logic in spout – depends on datasource and use case ▪ Optional – Config.TOPOLOGY_ACKERS set to 0 (default 1 per worker) See http://storm.apache.org/releases/current/Guaranteeing-message-processing.html
  • 15. 15 Metrics Nice in-built framework for metrics Pluggable => default file based one (IMetricsConsumer) org.apache.storm.metric.LoggingMetricsConsumer Define metrics in code this.eventCounter = context.registerMetric("SolrIndexerBolt", new MultiCountMetric(), 10); ... context.registerMetric("queue_size", new IMetric() { @Override public Object getValueAndReset() { return queue.size(); } }, 10);
  • 16. 16 UI
  • 17. 17 Logs ● One log file per worker/machine ● Change log levels via config or UI ● Activate log service to view from UI ○ Regex for search ● Metrics file there as well (if activated)
  • 18. 18 Things I have not mentioned ● Trident : Micro batching, comparable to Spark Streaming ● Windowing ● State Management ● DRPC ● Loads more ... -> find it all on http://storm.apache.org/
  • 19. 19
  • 20. 20 Storm Crawler http://stormcrawler.net/ ● Released 0.1 in Sept 2014 ● Version : 1.1.1 ● Release every 2 months on average ● Apache License v2 ● Artefacts available from Maven Central ● Built with Apache Maven
  • 21. 21 Collection of resources ● Core ● External ○ Cloudsearch ○ Elasticsearch ○ SOLR ○ SQL ○ Tika ● Completely external i.e. not published as part of SC ○ WARC => https://github.com/DigitalPebble/sc-warc ○ MicroData parser => https://github.com/PopSugar/storm-crawler-extensions/tree/master/microdata-par ser ○ ...
  • 22. 22
  • 23. 23 Step #1 : Fetching FetcherBolt Input : <String, Metadata> Output : <String, byte[], Metadata> TopologyBuilder builder = new TopologyBuilder(); … // we have a spout generating String, Metadata tuples! builder.setBolt("fetch", new FetcherBolt(),3) .shuffleGrouping("spout"); S
  • 24. 24 Problem #1 : politeness ● Respects robots.txt directives (http://www.robotstxt.org/) ○ Can I fetch this URL? How frequently can I send requests to the server (crawl-delay)? ● Get and cache the Robots directives ● Then throttle next call if needed ○ fetchInterval.default Politeness sorted? No - still have a problem : spout shuffles URLs from same host to 3 different tasks !?!?
  • 25. 25 URLPartitionerBolt public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("url", "key", "metadata")); } Partitions URLs by host / domain or IP partition.url.mode: “byHost” | “byDomain” | “byIP” builder.setBolt("partitioner", new URLPartitionerBolt()) .shuffleGrouping("spout"); builder.setBolt("fetch", new FetcherBolt(),3) .fieldsGrouping("partitioner", new Fields("key")); S P F F F key key key
  • 26. 26 Fetcher Bolts SimpleFetcherBolt ● Fetch within execute method ● Waits if not enough time since previous call to same host / domain / IP ● Incoming tuples kept in Storm queues i.e. outside bolt instance FetcherBolt ● Puts incoming tuples into internal queues ● Handles pool of threads : poll from queues ○ fetcher.threads.number ● If not enough time since previous call : thread moves to another queue
  • 27. 27 Next : Parsing ● Extract text and metadata (e.g. for indexing) ○ Calls ParseFilter(s) on document ○ Enrich content of Metadata ● ParseFilter abstract class public abstract void filter(String URL, byte[] content, DocumentFragment doc, ParseResult parse); ● Out-of-the-box: ○ ContentFilter ○ DebugParseFilter ○ LinkParseFilter : //IMG[@src] ○ XPathFilter ● Configured via JSON file
  • 28. 28 ParseFilter Config => resources/parsefilter.json { "com.digitalpebble.stormcrawler.parse.ParseFilters": [ { "class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter", "name": "XPathFilter", "params": { "canonical": "//*[@rel="canonical"]/@href", "parse.description": [ "//*[@name="description"]/@content", "//*[@name="Description"]/@content" ], "parse.title": "//META[@name="title"]/@content", "parse.keywords": "//META[@name="keywords"]/@content" } }, { "class": "com.digitalpebble.stormcrawler.parse.filter.ContentFilter", "name": "ContentFilter", "params": { "pattern": "//DIV[@id="maincontent"]", "pattern2": "//DIV[@itemprop="articleBody"]" } } ] }
  • 29. 29 Basic topology in code builder.setBolt("partitioner", new URLPartitionerBolt()) .shuffleGrouping("spout"); builder.setBolt("fetch", new FetcherBolt(),3) .fieldsGrouping("partitioner", new Fields("key")); builder.setBolt("parse", new JSoupParserBolt()) .localOrShuffleGrouping("fetch"); builder.setBolt("index", new StdOutIndexer()) .localOrShuffleGrouping("parse"); localOrShuffleGrouping : stay within same worker process => faster de/serialization + less traffic across nodes (byte[] content is heavy!)
  • 30. 30 Summary : a basic topology
  • 32. 32 Outlinks ● Done by ParserBolt ● Calls URLFilter(s) on outlinks to normalize and / or blacklists URLs ● URLFilter interface public String filter(URL sourceUrl, Metadata sourceMetadata, String urlToFilter); ● Configured via JSON file ● Loads of filters available out of the box ○ Depth, metadata, regex, host, robots ...
  • 34. 34 Status Stream // can also use MemoryStatusUpdater for simple recursive crawls builder.setBolt("status", new StdOutStatusUpdater()) .localOrShuffleGrouping("fetch", Constants.StatusStreamName) .localOrShuffleGrouping("parse", Constants.StatusStreamName) .localOrShuffleGrouping("index", Constants.StatusStreamName); Implementations should extend AbstractStatusUpdaterBolt.java ● Provides an internal cache (for DISCOVERED) ● Calls code to determine which metadata should be persisted to storage ● Calls code to handle scheduling ● Focus on implementing store(String url, Status status, Metadata metadata, Date nextFetch) public enum Status { DISCOVERED, FETCHED, FETCH_ERROR, REDIRECTION, ERROR; }
  • 35. 35 ▪ Depends on your scenario : – Don’t follow outlinks? – Recursive crawls? i.e can I get to the same URL in more than one way? – Refetch URLs? When? ▪ queues, queues + cache, RDB, key value stores, search ▪ Depends on rest of your architecture – SC does not force you into using one particular tool Which backend?
  • 36. 36 Make your life easier #1 : ConfigurableTopology ● Takes YAML config for your topology (no more ~/.storm) ● Local or deployed via -local arg (no more coding) ● Auto-kill with -ttl arg (e.g. injection of seeds in status backend) ● Loads crawler-default.yaml into active conf ○ Just need to override with your YAML ● Registers the custom serialization for Metadata class
  • 37. 37 public class CrawlTopology extends ConfigurableTopology { public static void main(String[] args) throws Exception { ConfigurableTopology.start(new CrawlTopology(), args); } @Override protected int run(String[] args) { TopologyBuilder builder = new TopologyBuilder(); // Just declare your spouts and bolts as usual return submit("crawl", conf, builder); } } Build then run it with storm jar target/${artifactId}-${version}.jar ${package}.CrawlTopology -conf crawler-conf.yaml -local ConfigurableTopology
  • 38. 38 Make your life easier #2 : Archetype Got a working topology class but : ● Still have to write a pom file with the dependencies on Storm and SC ● Still need to include a basic set of resources ○ URLFilters, ParseFilters ● As well as a simple config file Just use the archetype instead! mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.1.1 And modify the bits you need
  • 39. 39
  • 40. 40 Make your life easier #3 : Flux http://storm.apache.org/releases/1.0.2/flux.html Define topology and config via a YAML file storm jar mytopology.jar org.apache.storm.flux.Flux --local config.yaml Overlaps partly with ConfigurableTopology Archetype provides a Flux equivalent to the example Topology Share the same config file For more complex cases probably easier to use ConfigurableTopology
  • 41. 41 External modules and repositories ▪ Loads of useful things in core => sitemap, feed parser bolt, … ▪ External – Cloudsearch (indexer) – Elasticsearch (status + metrics + indexer) – SOLR (status + metrics + indexer) – SQL (status) – Tika (parser)
  • 42. 42 ▪ Spout(s) / StatusUpdater bolt – Info about URLs in ‘status’ index – Shards by domain/host/IP – Throttle by domain/host/IP ▪ Indexing Bolt – Docs fetched sent to ‘index’ index for search ▪ Metrics – DataPoints sent to ‘metrics’ index for monitoring the crawl – Use with Kibana for monitoring
  • 44. 44 ● Streaming URLs ● Queue in front of topology ● Content stored in HBase
  • 45. 45 “Find images taken with camera X since Y” ● URLs of images (with source page) ● Various sources : browser plugin, custom logic for specific sites ● Status + metrics with ES ● EXIF and other fields in bespoke index ● Content of images cached on AWS S3 (>35TB) ● Custom bolts for variations of topology : e.g. image tagging with Python bolt ○ using Tensorflow
  • 46. 46
  • 47. 47 News crawl 7000 RSS feeds <= 1K news sites 50K articles per day Content saved as WARC files on Amazon S3 Code and data publicly available Elasticsearch status + metrics WARC module => https://github.com/DigitalPebble/sc-warc
  • 48. 48 Next? ● Selenium-based protocol implementation ○ Ajax pages? ● Move WARC resources to main repo?
  • 50. 50