Real Time User Activity Tracking w/ Divolte Collector and Kafka

This article is aiming to help Data Engineering professionals and other individuals which are with previous experience of software development and IT. If you don’t have any prior knowledge about IT and Big Data in general then this content may be hard for you to grasp.

Table Of Contents:

  1. Idea
  2. Technologies to be used
  3. Building and Running Divolte Collector
  4. Installing and Running Kafka
  5. Consuming Records from Kafka

Idea

We are going to track user activities also called “click stream” data. Ofcourse there are many third party solutions to solve that problem. I am going to show you two key technologies instead of using something of the shelf. Those technologies are JavaScript and Message Broker. Then we are going to combine these two in order to get this job done.

At the end of this work, we will be able to collect and store any kind of user activities in real time happening in a particular website without paying any bill to third parties.

Technologies to be used

We are going to be utilized two different process in this project. Divolte Connect is going to help us to collect interaction of users with our website.

And we will use Kafka Message Broker in order to consume and store activity data which is generated by Divolte Collect.

Divolte Collector is a server for collecting clickstream data on Kafka topics. It has two components; JavaScript tag on the client side and a lightweight server for collecting click stream data to Kafka topics for near real-time processing.

Piece of a JavaScript code has to remain under your web page’s HTML body. Divolte catches user activities and pushes activity information to Kafka topic in a sorted and classified way. It also lets you to decide which kind of activities you want to chase, there is no fixed configuration.

You don’t need to write any extra line of code in order to fetch data from Divolte server. It is enough to create a topic in your existing Kafka Broker and specify your topic’s name in Divolte configuration.

If you want to get more familiar with Kafka Message Broker, please browse it’s documentation page. You can also jump to my previous article about Kafka through this link.

Building and Running Divolte Collector

We need to download files first through this link.

You can run commands below to download required files under /opt.

cd /optwget http://divolte-releases.s3-website-eu-west-1.amazonaws.com/divolte-collector/0.9.0/distributions/divolte-collector-0.9.0.tar.gz

extract all the files,

tar -xzf divolte-collector-*.tar.gz

navigate to main directory of Divolte,

cd divolte-collector-*

create a configuration file under conf/ directory before starting up Divolte Collector.

touch conf/divolte-collector.conf

You can either create your own configuration or you can paste lines below to your conf file, this configuration will tell Divolte Server how you want to store clickstream events.

With using configuration below, you will be using Kafka Message Broker to receive events from website. It is also possible to use HDFS as a storage. But in this example we will use Kafka in order to process incoming events in near real-time.

Do not forget to check/edit your topic name and Kafka connection.

divolte {
global {
kafka {
// Enable Kafka flushing
enabled = true// The properties under the producer key in this
// configuration are used to create a Properties object
// which is passed to Kafka as is. At the very least,
// configure the broker list here. For more options
// that can be passed to a Kafka producer, see this link:
// https://kafka.apache.org/documentation.html#producerconfigs
producer = {
bootstrap.servers = "10.200.8.55:9092,10.200.8.53:9092,10.200.8.54:9092"
}
}
}sinks {
// The name of the sink. (It's referred to by the mapping.)
kafka {
type = kafka// This is the name of the topic that data will be produced on
topic = click-stream
}
}
}

Once you have finished all the steps above, you will be ready to run Divolte Collector. It is going to build a sample website on port 8290 to catch click events as proof of concept.

Navigate to /opt/divolte-directory and make sure of that you set JAVA_HOME variable(echo $JAVA_HOME). If it is not specified, get some help from link below.Installing the JDK Software and Setting JAVA_HOMEIf you do not already have the JDK software installed or if JAVA_HOME is not set, the GlassFish ESB installation will…docs.oracle.com

If everything is OK, you can start Divolte Collector with using command below.

./bin/divolte-collector

Its possible to use Divolte with your existing web site deployment. In order to do that, you have to put divolte-collector.js in your HTML body. Rest of the operations will remain same.

For anything further than the Proof of Concept, like creating custom events or editing configurations in according to your business model, you can dive into connection below.Getting Started – Divolte v0.9.0 User GuideEdit descriptiondivolte-releases.s3-website-eu-west-1.amazonaws.com

Installing and Running Kafka

You can easily download, run and start using Kafka through connection below.Apache KafkaApache Kafka: A Distributed Streaming Platform.kafka.apache.org

Use commands below to run Kafka Server, first Zookeeper then the Kafka Server.

nohup bin/zookeeper-server-start.sh config/zookeeper.properties > cd dvizookeeperlog.log 2>&1 &nohup bin/kafka-server-start.sh config/server.properties > kafka.log 2>&1 &

After you have your Kafka Server up and running, its time to create a topic with using command below.

bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic click-stream

This topic to be created has to be using same name with what you specified in Divolte configuration.

Consuming Records from Kafka

Lets run Kafka Console Consumer to ingest data from given topic, switch to kafka server’s directory first then run the code below(careful about given topic names)

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic click-stream --from-beginning

Now you can go to create some events through localhost:8290 on your browser. Then check back to your console consumer to see data stream.

Do not worry about character encryption. It is just because of the Avro serialization model. In next article we will parse all events in real time with the help of Python.