This document summarizes an architecture for collecting and analyzing search analytics data using Flume and HBase. It describes:
1) Collecting search log data from applications using Flume agents and sending it to a Flume collector.
2) The Flume collector processes the log messages and sends them to an "raw logs" table in HBase using a Flume HBase sink.
3) The data in HBase is later processed by MapReduce jobs to generate search analytics reports and metrics that are displayed on a reporting web application.
Flume is used simply to collect logs to a central place (HDFS) from multiple agents. But at the end we still have a single log file that something (raw log importer) then needs to process. No HBase is involved directly with Flume here and there is no HBase sink in this scenario.
Making use of Flume's ability to plug in different Sinks, so instead of just collecting data to a log file on HDFS, we hook up FLUME-247 Sink to Flume and make it write directly to HBase.
2h, 2K/min, 1sys (240K actions, 43mb of input data) 1193mb - no prune, no compress 624mb - prune sort index only, no compress 408mb - prune, no compress 196mb - no prune, copress 106mb - prune sort index only, compress 64mb - prune, compress