Transcript of a sponsored discussion on how Etsy uses data science to improve their buyers and sellers’ experience as well as theiown corporate destiny.
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for eCommerce
1. Putting Buyers and Sellers in the Best Light, How Etsy
Leverages Big Data for eCommerce
Transcript of a discussion on how Etsy uses data science to improve their buyers and sellers’
experience as well as their own corporate destiny.
Listen to the podcast. Find it on iTunes. Get the mobile app. Sponsor: Hewlett
Packard Enterprise.
Dana Gardner: Hello, and welcome to the next edition of the Hewlett Packard Enterprise (HPE)
innovator podcast series. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host
and moderator for this ongoing discussion on IT innovation -- and how it’s
making an impact on people’s lives.
Our next big-data case study discussion explores how Etsy, a global e-commerce
site focused on handmade and vintage items, uses data science to improve
buyers and sellers’ discovery and shopping experiences. We'll learn how mining
big data helps Etsy define and distribute top trends, and allows those with
specific interests to find items that will best appeal to them.
To learn more about leveraging big data in the e-commerce space, please join me in welcoming
Chris Bohn aka “CB,” a Senior Data Engineer at Etsy, based in Brooklyn, New York. Welcome,
CB.
CB: Thank you.
Gardner: Tell us about Etsy for those that aren’t familiar with it. I've heard it described as it’s
like being able to go through your grandmother's basement. Is that fair?
Start Your
HPE Vertica
Community Edition Trial Now
CB: Well, I hope it’s not as musty and dusty as my grandmother’s basement. The best way to
describe it is that Etsy is a marketplace. We create a marketplace for sellers of
handcrafted goods and the people who want to buy those goods.
We've been around for 10 years. We're the leader in this space and we went
public in 2015. Just some quick little metrics. The total of value of the
merchandise sold on Etsy in 2014 was about $1.93 billion. We have about 1.5
million sellers and about 22 million buyers.
Gardner: That's an awful lot of stuff that’s being moved around. What does the
big data and analytics role bring to the table?
Gardner
CB
2. CB: It’s all about understanding more about our customers, both buyers and sellers. We want to
know more about them and make the buying experience easier for them. We want them to be
able to find products easier. Too much choice sometimes is no choice. You want to get them to
the product they want to buy as quickly as possible.
We also want to know how people are different in their shopping habits across the geography of
the world. There are some people in different countries that transact differently than we do here
in the States, and big data lets us get some insight into that.
Gardner: Is this insight derived primarily from what they do via their clickstreams, what they're
doing online? Or are there other ways that you can determine insights that then you can share
among yourself and also back to your users?
Data architecture
CB:I'll describe our data architecture a little bit. When Etsy started out, we had a monolithic
Postgres database and we threw everything in there. We had listings, users, sellers, buyers,
conversations, and forums. It was all in there, but we outgrew that really quickly, and so the
solution to that was to shard horizontally.
Now we have many hundreds of sharded MySQL servers, horizontal. Then we decided that we
needed to do some analytics on this stuff. So we scratched our heads. This was about
five years ago. So we said, "Let’s just set up a Postgres server and we'll
copy all the data from these shards into the Postgres server that we call
BI server." And we got that done.
Then, we kind of scratched our heads and said, "Wait a minute. We just came full circle. We
started with a monolithic database, then we went sharded, and now all the data is back
monolithic."
It didn't perform well, because it's hard to get the volume of big data into that database. A
relational database like Postgres just isn’t designed to do analytic-type queries. Those are big
aggregations, and Postgres, even though it is a great relational database, is really tailored for
single-record lookup.
So we decided to get something else going on. About three-and-a-half years ago, we set about
searching for the replacement to our monolithic business-intelligence (BI) database and looked at
what the landscape was. There were a number of very worthy products out there, but we
eventually settled on HPE Vertica for a number of reasons.
One of those is that it derives, in large part, from Postgres. Postgres has a Berkeley license. So
companies could take it private. They can take that code and they don’t have to republish it out to
the community, unlike other types of open source copyright agreements.
3. So we found out that the parser was right out of Postgres and all the date handling and
typecasting stuff that is usually different from database to database was exactly spot-on the same
between Vertica and Postgres. Also, data ingestion via the copy command is the best way to
bulk-load data, exactly the same in both, and it’s the same format.
We said, "This looks good, because we can get the data in quickly, and queries will probably not
have to be edited much." So that's where we went. We experimented with it and we found
exactly that. Queries would run unchanged, except they ran a lot faster and we were able to get
the data in easily.
We built some data replication tools to get data from the shards and also some legacy Postgres
databases that we had laying around for billing and got that all data into HPE Vertica.
Then, we built some tools that allowed our analysts to bring over custom tables they had created
on that old BI machine. We were able to get up to speed really quickly with Vertica, and boom,
we had an analytics database and we were able to hit the ground running with it.
Gardner: And is the challenge for you about the variety of that data? Is it about the velocity that
you need to move it in and out? Is it about simply volume that you just have so much of it, or a
little of some of those?
All of the above
CB:It’s really all of those problems. Velocity-wise, we want our replication system to be
eventually consistent, and we want it to be as near real-time as possible. There is a challenge in
that, because you really start to get into micro-batching data in.
This is where we ended up having to pay off some technical debt, because years ago, disk storage
was fairly pricey, and databases were designed to minimize storage. Practices grew up around
that fact. So data would get deleted and updated. That's the policy that the early originators of
Etsy followed when they designed the first database for it.
Eventually what we have got now is lossy data. If someone changes the description or the tags
that are associated with a listing, the old ones go away. They are lost forever. And that's too bad,
because if we kept those, we could do analytics on a product that wasn’t selling for a long time
and all of a sudden it started selling. What changed? We would love to do analytics on that, but
we can't do it because of the loss of data. That's one thing that we learned in this whole process.
But getting back to your question here about velocity and then also the volume of data, we have
a lot of data from our production databases. We need to get it all into Vertica. We also have a lot
of clickstream data. Etsy is a top 50 website, I believe, for traffic, and that generates a lot of
clicks and that all gets put into Vertica.
4. We run big batch jobs every night to load that. It's important that we have that, because one of
the biggest things that our analytics like to do is correlate clickstream data with our production
data. Clickstream data doesn't have a lot of information about the user who is doing those clicks.
It’s just information about their path through the site at that time.
To really get a value-add on that, you want to be able to join on your user details tables, so that
you can know where this person lives, how old they are, or their buying history in the past. You
need to be able to join those, too, and we do that in HPE Vertica.
Gardner: CB, give us a sense about the paybacks, when you do this well, when you've
architected, and when you've paid your technical debts, as you put it. How are your analysts able
to leverage this in order to make your business better and make the experience of your users
better?
CB: When we first installed Vertica, it was just a small group of analysts that were using it. Our
analytics program was fairly new, but it just exploded. Everybody started to jump in on it,
because all of a sudden, there was a database with which you could write good SQL, with a rich
SQL engine, and get fantastic results quickly.
The results weren’t that different from what we were getting in the past, but they were just
coming to us so fast, the cycle of getting information was greatly shortened. Getting result sets
was so much better that it was like a whole different world. It’s like the Pony Express versus
email. That’s the kind of difference it was. So everybody started jumping in on it.
More dashboards
Engineers who were adding new facets of the product wanted to have dashboards, more or less
real time, so they could monitor what the thing was doing. For example, we added postage to
Etsy, so that our sellers can have preprinted labels. We'd like to monitor that in real time to see
how it's this going. Is it going well or what?
That was something that took a long time to analyze before we got into big-data analytics. All of
a sudden, we had Vertica and we could do that for them, and that pattern has repeated with other
groups in the company.
We're doing different aspects of the site. All of a sudden, you have your marketing people, your
finance people, saying, "Wow, I can run these financial reports that used to take days in literally
seconds." There was a lot of demand. Etsy has about 750 employees and we have way more than
200 Vertica accounts. That shows you how popular it is.
Start Your
HPE Vertica
Community Edition Trial Now
5. One anecdotal story. I've been wanting to update Vertica for the past couple of months. The
woman who runs our analytics team said, "Don't you dare. I have to run Q2 numbers. Everybody
is working on this stuff. You have to wait until this certain week to be able to do that." It’s not
just HPE Vertica, but big data is now relied on for so many things in the company.
Gardner: So the technology led to the culture. Many times we think it's the other way around,
but having that ability to do those easy SQL queries and get information opened up people's
imagination, but it sounds like it has gone beyond that. You have a data-driven company now.
CB: That's an astute observation. You're right. This is technology that has driven the culture. It's
really changed the way people do their job at Etsy. And I hear that elsewhere also, just talking to
other companies and stuff. It really has been impactful.
Gardner: Just for the sake of those of our readers who are on the operations side, how do you
support your data infrastructure? Are you thinking about cloud? Are you on-prem? Are you split
between different data centers? How does that work?
CB: I have some interesting data points there for you. Five-plus years ago, we started doing
Hadoop stuff, and we started out spinning up Hadoop in Amazon Web Service (AWS).
We would run nightly jobs. We collected all of the search terms that were used and buying
patterns and we fed these into MapReduce jobs. The output from that then went into MATLAB,
and we would get a set of rules out of that, that then would drive our search engine, basically
improving search.
Commodity hardware
We did that for a while and then realized we were spending a lot of money in AWS. It was
many thousands of dollars a month. We said, "Wait a minute. This is crazy. We could actually
buy our own servers. This is commodity hardware that this can run on, and we can run this in our
own data center. We will get the data in faster, because there are bigger pipes." So that's what we
did.
We created what we call Etsydoop, which has got 200+ nodes and we actually save a lot of
money doing it that way. That's how we got into it.
We really have a bifurcated data analytics, big-data system. On the one hand, we have Vertica for
doing ad hoc queries, because the analysts and the people out there understand SQL and they
demand it. But for batch jobs, Hadoop rocks, and it's really, really good for that.
But the tradeoff is that those are hard jobs to write. Even a good engineer is not going to get it
right every time, and for most analysts, it's probably a little bit beyond their reach to get down,
roll up their sleeves, and get into actual coding and that kind of stuff.
6. But they're great at SQL, and we want to encourage exploration and discovering new things.
We've discovered things about our business just by some of these analysts wildcatting in the
database, finding interesting stuff, and then exploring it, and we want to encourage that. That's
really important.
Gardner: CB, in getting to understand Etsy a little bit more, I saw that you have something
called Top Trends and Etsy Finds, ways that you can help people with affinity for a product or a
craft or some interest to pursue that. Did that come about as a result of these technologies that
you have put in place, or did they have a set of requirements that they wanted to be able to do
this and then went after you to try to accommodate it? How do you pull off that Etsy Finds
capability?
CB: A lot of that is cross-architecture. Some of our production data is used to find that. Then, a
lot of the hard crunching is done in Vertica to find that. Some of it is MapReduce. There's a
whole mix of things that go into that.
I couldn't claim for Etsy Finds, for example, that it’s all big data. There are other things that go in
there, but definitely HPE Vertica plays a role in that stuff.
I'll give you another example, fraud. We fingerprint a lot of our users digitally, because we have
problems with resellers. These are people who are selling resold mass-produced stuff on Etsy. It's
not huge, but it's an annoyance. Those products compete against really quality handmade
products that our regular sellers sell in their shops.
Sometimes it’s like a game of Whack-a-Mole. You knock one of these guys down -- sometimes
they're from the Far East or other parts of the world -- and as soon as you knock one down,
another one pops up. Being able to capture them quickly is really important, and we use Vertica
for that. We have a team that works just on that problem.
What's next?
Gardner: Thinking about the future, with this great architecture, with your ability to do things
like fraud detection and affinity correlations, what's next? What can you do that will help make
Etsy more impactful in its market and make your users more engaged?
CB: The whole idea behind databases and computing in general is just making things faster.
When the first punch-card machines came out in the 1930s or whatever, the phone companies
could do faster billing, because billing was just getting out of control. That’s where the roots of
IBM lie.
As time went by, punch cards were slow and they wanted to go faster. So they developed
magnetic tape, and then spinning rust disks. Now, we're into SSDs, the flash drives. And it’s the
same way with databases and getting answers. You always want to get answers faster.
7. We do a lot of A/B testing. We have the ability to set the site so that maybe a small percentage of
users get an A path through the site, and the others a B path, and there's control stuff on that. We
analyze those results. This is how we test to see if this kind of button work better than this other
one. Is the placement right? If we just skip this page, is it easier for someone to buy something?
So we do A/B testing. In the past, we've done it where we had to run the test, gather the data, and
then comb through it manually. But now with Vertica, the turnaround time to iterate over each
cycle of an A/B test has shrunk dramatically. We get our data from the clickstreams, which go
into Vertica, and then the next day, we can run the A/B test results on that.
The next step is shrinking that even more. One of the themes that’s out there at the various big
data conferences is streaming analytics. That's a really big thing. There is a new database out
there called PipelineDB, a fork of Postgres. It allows you to create an event steam into Postgres.
You can then create a view and a window on top of that stream. Then you can pump your event
data, like your clickstream data, and you can join the data in that window to your regular
Postgres tables, which is really great, because we could get A/B information in real time. You set
up a one minute turnaround as opposed to one day. I think that’s where a lot of things are going.
If you just look at the history of big data, MapReduce started about 10 years ago at Google, and
that was batch jobs, overnight runs. Then, we started getting into the columnar stores to make
databases like Vertica possible, and it’s really great for aggregation. That kicked it up to the next
level.
Another thing is real-time analytics. It’s not going to replace any of these things, just like Vertica
didn't replace Hadoop. They're complementary. Real-time streaming analytics will be
complementary. So we're continuing to add these tools to our big data toolbox.
Gardner: It has compressed those feedback loops if we provide that capability into innovative,
creative organization. The technology might drive the culture, and who knows what sort of
benefits they will derive from that.
All plugged in
CB:That's very true. You touched earlier about how we do our infrastructure. I'm in data
engineering, and we're responsible for making sure that our big databases are healthy and
running right. But we also have our operations department. They're working on the actual pipes
and hardware and making sure it’s all plugged in. It's tough to get all this stuff working right, but
if you have the right people, it can happen.
I mentioned earlier about AWS. The reason we were able to move off of that and save money is
because we have the people who can do it. When you start using AWS extensively, what you're
doing is you are paying for a very high priced but good IT staff at Amazon. If you have got a
8. good IT staff of your own, you're probably going to be able to realize some efficiencies there,
and that's why really we moved over. We do it all ourselves.
Gardner: Having it as a core competency might be an important thing moving forward. The
whole idea behind databases and computing in general is just making things faster.
CB: Absolutely. You have to stay on top of all this stuff. A lot is made of the word disruption,
and you don't go knocking on disruption’s door; it usually knocks on yours. And you had better
be agile enough to respond to it.
I'll give you an example that ties back into big data. One of the most disruptive things that has
happened to Etsy is the rise of the smartphone. When Etsy started back in 2005, the iPhone
wasn't around yet; it was still two years out. Then, it came on the scene, and people realized that
this was a suitable device for commerce.
It’s very easy to just be complacent and oblivious to new technologies sneaking up on you. But
we started seeing that there was more and more commerce being done on smartphones. We
actually fell a little bit behind, as a lot of companies did five years ago. But our management
made decisions to invest in mobile, and now 60 percent of our traffic is on mobile. That's turned
around in the past two years and it has been pretty amazing.
Big data helps us with that, because we do a lot of crunching of what these mobile devices are
doing. Mobile is not the best device maybe for buying stuff because of the form factor, but it is a
really good device for managing your store, paying your Etsy bill, and doing that kind of stuff.
So we analyzed all that and crunched it in big data.
Gardner: And big data allowed you to know when to make that strategic move and then take
advantage of it?
CB: Exactly. There are all sorts of crossover points that happen with technology, and you have to
monitor it. You have to understand your business really well to see when certain vectors are
happening. If you can pick up on those, you're going to be okay.
Start Your
HPE Vertica
Community Edition Trial Now
Gardner: I'm afraid we'll have to leave it there. We've been exploring how Etsy, a global e-
commerce site focused on handmade and vintage items, uses data science to improve their
buyers' and sellers’ experience as well as their own corporate destiny.
I'd like to thank our guest, CB, Senior Data Engineer at Etsy in Brooklyn, New York. Thanks,
CB.
CB: Thank you very much, Dana.
9. Gardner: And I would also like to thank our audience for joining us for this Hewlett Packard
Enterprise big data innovation case study discussion. I'm Dana Gardner, Principal Analyst at
Interarbor Solutions, your host for this ongoing series of HPE-sponsored discussions.
Listen to the podcast. Find it on iTunes. Get the mobile app. Sponsor: Hewlett
Packard Enterprise.
Transcript of a discussion on how Etsy uses data science to improve their buyers and sellers’
experience as well as their own corporate destiny. Copyright Interarbor Solutions, LLC,
2005-2016. All rights reserved.
You may also be interested in:
• IoT plus big data analytics translate into better services management at Auckland
Transport
• How HPE’s internal DevOps paved the way for speed in global software delivery
• Extreme Apps approach to analysis makes on-site retail experience king again
• How New York Genome Center Manages the Massive Data Generated from DNA
Sequencing
• The UNIX evolution: A history of innovation reaches an unprecedented 20-year
milestone
• Redmonk analysts on best navigating the tricky path to DevOps adoption
• DevOps by design--A practical guide to effectively ushering DevOps into any
organization
• Need for Fast Analytics in Healthcare Spurs Sogeti Converged Solutions Partnership
Model
• HPE's composable infrastructure sets stage for hybrid market brokering role
• Nottingham Trent University Elevates Big Data's role to Improving Student Retention in
Higher Education
• Forrester analyst Kurt Bittner on the inevitability of DevOps
• Agile on fire: IT enters the new era of 'continuous' everything