3. EBAY MARKETPLACE AT A GLANCE
$19.6B GMV
in Q1 2016
9.5M
New listings added via
mobile per week
300M
Searches each day
63%
Transactions that ship
for free
(in US, UK, DE)
79%
Items sold as new
Q1 2016 data
~900M
Live listings
One of the world’s largest and most vibrant
marketplaces
4. Most Powerful
Selling Platform
For business sellers:
the potential to drive
profitable sales and
build a brand
For consumer sellers:
an easy way to
declutter, sell and
make money
A partnership not a
competition
Best Choice
Providing the
greatest selection of
inventory for our
buyers
From new, everyday
items to rare and
unique goods
And incredible deals
only found on eBay
Most Relevance
A shopping experience
that is simple, data-
driven and personalized
Enabling buyers to
easily find, compare
and purchase items
they need and want
Highlighting the unique
value that eBay brings
OUR
STRATEGY
7. +200 Petabytes
of Consumer
Data and
growing…
Consumers
on 6
Continents
Millions of
Transactions
1000’s of
Product
Categories
Multiple
cookies across
dozens of
business
Actionable
search
insights
+ 9M payments
every day
+ 6K
Total
Payment
Volume per
second
LoyaltyClick
behavior
and
patterns
Device
IDs
100’s of millions
of
Email addresses
Bank
accounts
POS
Autos
Products
IP Address
8. +200 Petabytes
of Consumer
Data and
growing…
Consumers
on 6
Continents
Credit cards
1000’s of
Product
Categories
Multiple
cookies across
dozens of
business
Actionable
search
insights
+ 9M payments
every day
+ 6K
Total
Payment
Volume per
second
Pair of
shoes sold
every 2
second
Loyalty
Cell phone
sold every 4
seconds
Click
behavior
and
patterns
Device
IDs
100’s of millions
of
Email addresses
Bank
accounts
POS
A ladies
handbag is
bought via
mobile every
12 seconds
Auto
Products
IP Address
COLLECT, ANALYZE, PREDICT
9. Big Data @ eBay
*Q3 2015 data
7
Hadoop Clusters*
800M
HDFS operations
(single cluster)*
120 PB
Data*
Hadoop @ eBay
11. Who is accessing the data?
What data are they accessing?
Is someone trying to access data that they don’t have access to?
Are there any anomalous access patterns?
Is there a security threat?
How to monitor and get notified during or prior to an anomalous event occurring?
Motivation for Eagle
12. Apache Eagle
Apache Eagle: Monitor Hadoop in Real Time
Apache Eagle is an Open Source Monitoring Platform for Hadoop eco-system,
which started with monitoring data activities in Hadoop. It can instantly
identify access to sensitive data, recognize attacks/malicious activities and
blocks access in real time.
In conjunction with components such as Ranger, Sentry, Knox,
DgSecure and Splunk etc., Eagle provides comprehensive solution to
secure sensitive data stored in Hadoop.
16. Data Classification - HDFS
•Browse HDFS file system
•Batch import sensitivity metadata through Eagle API
•Manually mark sensitivity in Eagle UI
17. Data Classification - Hive
•Browse Hive databases/tables/columns
•Batch import sensitivity metadata through Eagle API
•Manually mark sensitivity in Eagle UI
18. Define policy in UI and API
curl -u ${EAGLE_SERVICE_USER}:${EAGLE_SERVICE_PASSWD} -X POST -H 'Content-
Type:application/json'
"http://${EAGLE_SERVICE_HOST}:${EAGLE_SERVICE_PORT}/eagle-
service/rest/entities?serviceName=AlertDefinitionService"
-d '
[
{
"prefix": "alertdef",
"tags": {
"site": "sandbox",
"application": "hadoopJmxMetricDataSource",
"policyId": "capacityUsedPolicy",
"alertExecutorId": "hadoopJmxMetricAlertExecutor",
"policyType": "siddhiCEPEngine"
},
"description": "jmx metric ",
"policyDef": "{"expression":"from hadoopJmxMetricEventStream[metric ==
"hadoop.namenode.fsnamesystemstate.capacityused" and convert(value,
"long") > 0] select metric, host, value, timestamp, component, site insert into
tmp; ","type":"siddhiCEPEngine"}",
"enabled": true,
"dedupeDef": "{"alertDedupIntervalMin":10,"emailDedupIntervalMin":10}",
"notificationDef":
"[{"sender":"eagle@apache.org","recipients":"eagle@apache.org","subject
":"missing block
found.","flavor":"email","id":"email_1","tplFileName":""}]"
}
]
'
1 Create policy using API 2 Create policy using UI
20. 1 Single event evaluation
• threshold check with various conditions
Policy Capabilities
2 Event window based evaluation
• various window semantics (time/length sliding/batch window)
• comprehensive aggregation support
3 Correlation for multiple event
streams
• SQL-like join
4 Pattern Match and
Sequence
• a happens followed by b
Powered by Siddhi 3.0.5, and Eagle provides dynamic capabilities
and intuitive API/UI
24. Statistics
• # of events evaluated per
second
• audit for policy change
Eagle Service
As of 0.3.0, Eagle stores metadata and statistics into HBASE, and
support Druid as metric store.
Metadata
• Policy
• Event schema
• Site/Application/UI Features
HBASE
• Store metrics
• Store M/R job/task data
• Rowkey design for time-series data
• HBase Coprocessor
Raw data
• Druid for metric
• HBASE for M/R job/task
etc.
• ES for log (future)
1 Data to be stored 2 Storage 3 API/UI
Druid
• Consume data from Kafka
HBASE
• filter, groupby, sort, top
Druid
• Druid query API
• Dashboard in Eagle
25. Highlights
1. Ease of use: after installation, user defines rules
2. Comprehensive rules on high volume of data: Eagle solves some
unique problem in Hadoop
3. Hot deploy rule: Eagle does not provide a lot of charts, instead it
allows user to write ad-hoc rule and hot deploy it.
4. Metadata driven: metadata includes policy, event schema and UI
component etc.
5. Monolithic storm topology: application pre-processing running
together with alert engine
6. Extensibility: Eagle can’t succeed alone, Eagle has to be integrated
with other system for example data classification, policy enforcement
etc.
26. Alert Engine Limitations in Eagle 0.3
1 High cost for integrating
• Coding for onboarding new data source
• Monolithic topology for pre-processing and alert
3 Policy capability restricted by event
partition
• Can’t do ad-hoc group-by policy expression
For example from groupby user to groupby cmd
2 Not multi-tenant
• Alert engine is embedded into application
• Many separate Storm topologies
4 Correlation is not declarative
• Coding for correlating existing data sources
If traffic is partitioned by user, policy only
supports expression of user based group-by
One storm topology even for one trivial data
source
Even if it is a simple data source, you have
to write storm topology and then deploy
Can’t declare correlations for multiple
metrics
5 Stateful policy evaluation
• fail over when bolt is down
How to replay one week history data when
node is down
28. Extensibility
Sentry/Ranger
• As remediation engine
• As generic data source
DgSecure
• Source of truth for data classification
Splunk
• Syslog format output
• EAGLE alert output is the 1st abstraction of analytics and Splunk is the
2nd abstraction
29. USER PROFILE ALGORITHMS…
Eigen Value Decomposition
• Compute mean and variance
• Compute Eigen Vectors and determine Principal Components
• Normal data points lie near first few principal components
• Abnormal data points lie further from first few principal components and
closer to later components
31. Eagle Next Releases
• Improve User experience
Remote start storm topology
Metadata stored in RDBMS
Eagle 0.4 Eagle 0.5
• Alert Engine as Platform
No monolithic topology
Declarative data source onboard
Easy correlation
Support policies with any field group-by
Elastic capacity management
Today’s eBay isn’t what it used to be - many people think of us only as an auction site, but that perception hasn’t kept up with reality.
The reality is that 79% of what is sold on eBay is new merchandise, available for purchase immediately.
We have more than 900 million items listed for sale and 162 million active buyers, effectively making us the world’s biggest shopping destination.
Our vision for commerce is one that is enabled by people, powered by technology, and open to everyone.
Our strategy is to drive the best choice, have the most relevance, and deliver the most powerful selling platform.
Consumers are overwhelmed by the number of choices they face day-to-day. Smart brands are using data to surface inventory to their consumers in ways that feel relevant, helpful and familiar.
At eBay, we are curating and simplifying content in ways that align to users’ stated (and sometimes unstated) preferences, serving up content in new, simplified interfaces that surprise and delight them. We are also experimenting with machine learning to help bridge the gap between intent and understanding.
Storage – hbase and mysql
Archived logs – hdfs
Eagle storage only for small mount , metadata , policies etc
External for metrics
Aw metrics trend – druid to visualize too
Apache Eagle includes applications and alert engine. Today application connects to alert engine with JAVA API, in future, Alert Engine is a separate component, application can send data into Alert Engine
Policy stored hbase – all metadata stored in habse – we support both hbase and mysql
All logs as well hdfs for historical auditing
Setup – one single Eagle instance can manage multiple sites
Setup – one single Eagle instance can manage multiple sites
Setup – one single Eagle instance can manage multiple sites
Setup – one single Eagle instance can manage multiple sites
Setup – one single Eagle instance can manage multiple sites
There is some policies day-over-day, week-over-week comparison not supported by CEP
There is some policies day-over-day, week-over-week comparison not supported by CEP
So far its policy based alerting but there are certain patterns that can’t be caught by policies
Machine learning
Observe a user over period of time
Learn his typical/normal behaviour
Create user profile – which in terms is policy
EVD – eigen value decomposition
Density estimation
Mentors – Julian, Owen, Henry, Taylor, Amreshwari
Champion – Henry