BigDoor is an online marketing platform that partners with brands to offer loyalty programs where users earn virtual currency for actions and can exchange the currency for rewards. BigDoor aims to increase registration, engagement, and loyalty for its partners. It faces challenges in aggregating and analyzing user data from different sources like transactional databases and log files to evaluate its goals and meet partner metrics.
5. BigDoor Data Goal
Prove that we are meeting Partner goals
Registration: Are people registering?
Registration rate of control and exposed groups
Engagement: Are participants more engaged?
Actions per user in control and exposed groups
Loyalty: Do participants return?
Daily unique users v. monthly unique users
6.
7. Data Challenges
Peak: ~800 requests per second
Business data ->Transactional SQL DB
Optimized for write speed and flexibility
Unregistered user requests -> Apache logs
Flat text files
Need all data in one place
Fast queries
Easy to slice and dice
9. Drop us a line any time!
Contact: eva@bigdoor.com
Notas del editor
Thank you for having me! My name is Eva Monsen and I am a business intelligence developer at BigDoor. I am responsible for making sure BigDoor’s data can answer questions. I’ll be giving a brief real-world example of “big data”. I’ll give an introduction to BigDoor, its customers, and its product, and then I’ll talk about the BigDoor data pipeline, which is the technology path that data takes to get from its raw form to reports and visualizations.
What is BigDoor? BigDoor helps large companies do marketing through loyalty programs. These partner companies already have an online presence, and are looking to grow their online user base by acquiring new users, engaging and retaining users long term. BigDoor adds to the partner website a program where users earn virtual currency and exchange that currency for rewards.
Here’s an example. One of our partner companies is PacSun, a fairly large clothing retailer with both brick and mortar stores, and a website. PacSun wants to increase online sales and create relationships with its online customers.So, they have added BigDoor’s product to their website. BigDoor is a whitelabel product, which means it appears to be integrated with the rest of the website, but under the covers, these widgets are run by bigdoor: and make web API requests to BigDoor servers. Those web requests form the basis for the raw data I will be talking about.The highlighted areas are BigDoor Javascript widgets. On the top is a user profile picture, their currency balance (which PacSun has chosen to call “points”), and some links. On the bottom is what we call the “task bar” or “dock”, which shows some actions that the user can take to earn points.
Here is PacSun’s rewards page. Users can exchange the virtual currency they have earned for items appearing on this page. Rewards include sweepstakes entries, coupons, and physical merchandise. The rewards list is also served by BigDoor servers.
BigDoor receives web API requests whenever a user sees our widgets, registers, logs in, logs out, or takes an action that affects their currency balance such as completing a task or redeeming a reward. These are some example questions we ask of the data from those requests, and some metrics that we use to answer those questions.One way we can measure the answers is to show some users the BigDoor UI and not others. Those shown the BigDoor UI are the “exposed” group and those not shown the BigDoor UI are the “control” group. We can prove whether BigDoor is effective at registrations, for example, by looking at the difference in registration rates between control and exposed groups. If BigDoor is doing its job right, the registration rate should be higher in the exposed group. We also use control and exposed groups to measure user engagement, by looking at the number of actions a user takes while logged in.Whether users return is currently measured by comparing the number of unique visitors per day to the number of unique visitors for the trailing 30 days. Equal numbers would usually mean 100% of users are returning to the site daily.
We answer these questions in the form of reports built using Tableau Software. This is just one example of such a report that shows the number of unique users per hour, per day, and per trailing month.
We face many challenges with BigDoor data pipeline. One million requests per hour is actually a fairly small number in the big data world, but it is enough that we need to constantly load data so that our reporting can stay up-to-date.Most API requests by registered users result in updates or inserts to the transactional database, which is a MySQL database like you may have seen in your coursework. It keeps track of registered users’ profiles, currency balances, badges, reward redemptions, and so on. Requests by unregistered users only end up in our Apache logs, flat files with raw request data such as the query string.We want to combine all of the information in the Apache logs and the transactional database into one place, where it is easy and fast to query, and slice and dice by partner, date, user group and so on.
Finally, the guts of the system. This is the pipeline. It is how data is written to our system and ultimately read from the report.First, all web requests go through our load balancer, which dispatches those requests to a number of identical hosts. (I’ve labeled these “app hosts” because that is the term we use internally. ) I’ve shown three here but we usually have many more than that. The app hosts write data to the transactional database, and they also send their Apache logs – the flat files - to a log processing server every two minutes.The log processing server does some interesting work. Using multiple parallel processes, it parses every request in every Apache log and extracts some information of interest, such as the request timestamp, the partner id, user id, the type of action the user took. It produces output files to be consumed by the next step in the pipeline. This type of work is ideally suited to a distributed processing system like Hadoop, which is what Adam will be talking about next. Ours is a custom-built system, written in Python.ETL stands for “Extract, Transform, Load”, which is what this box does to the data. In this case, it extracts data from the transactional database, and from the log processing server, and transforms that data through a series of steps, and loads it into a data warehouse. You can look at the data warehouse as essentially a record of all of the partner configuration, user information, and every action taken by every user.There are many existing ETL products out there. Our ETL system is custom and written in Ruby. Finally, ETL summarizes all of that data into a series of tables in what I am calling the Aggregation database. These summary tables are very small in comparison to those in the data warehouse and are queried directly by Tableau to generate summary reports.
I know I’ve gone through a lot of information very quickly, but I hope that you now have some idea of what happens to data in the real world. I’ll take a few questions now, and I am always checking email and would love to go into depth about any of this, or general software engineering questions, with you later. Thanks!