Driven Administrator’s Guide

version 1.1.1

Troubleshooting Driven

This troubleshooting section is a comprehensive list of current issues that occurred during deployment. As a living document, this section is updated periodically. Feel free to post your queries to Driven Forums or email us at support@concurrentinc.com.

Cascading application cannot send data to the Driven server

This happens when your client application (Cascading application) cannot communicate to the Driven Server.

Verify that your Driven application is up by logging from the browser
Ensure that the Driven URL location is reachable from your Hadoop cluster. You can find out the configured location stored in $HADOOP_CONF/cascading-service.properties

Driven Plugin runs out of memory

Recent bug fixes were made for memory leaks issues.

Make sure that you are on the latest version of Driven
Ensure that you are running your Cascading application with appropriate memory. Run your application with the following settings: -Xms4096m -Xmx4096m

Driven Plugin is making the application slow

There can be many reasons why the Cascading application is running slow, such as the Driven plugin is taking extra time to collect the slice data information from the Hadoop NameNode, the lack of memory resources, or lost connectivity to the Driven server.

Historically, we have found that Hadoop NameNodes have been under provisioned. The Driven plugin collects job execution data associated with each slice from the NameNode.
The cluster has lost connectivity with the Driven server.
The Cascading application is complex with large volumes of data running on a large cluster (these parameters influence the scale of data being collected and transmitted). In this case, you can reduce the events that the Driven plugin is polling in the Hadoop NameNode and reduce the volume of telemetry data that is being sent to the Driven server.

To begin, you can suppress transmission of the slice data by setting the following property in the file cascading.properties

driven.protocol.slice.suppress=true

It is important to note that the process of Driven plugin collecting and sending the telemetry data is decoupled from the actual execution of your application on the cluster. While your client application is gated on completion of both the application processing and the Driven plugin completing the transmission, no SLAs are compromised as a result of the additional latency that may be introduced due to data transmission.

Server runs out of disk space (or data appears corrupted)

The Driven application experience is useful due to the unique insights that developer and IT operation organizations receives from its data. The volume of the data per application run that is collected is proportional to the complexity of the application, the size of the data, and the size of the cluster. You should not intend to use a single-node install of the Driven application in such cases.

Perform the following steps to fix the disk space shortage:

Step 1: Validate that you are running out of disk space by inspecting the admin console

Figure 1: Driven Admin Console

Step 2: Add a new node to the persistence cluser

Refer to Elasticsearch documentation to learn how to add new nodes in the cluster
Add the new host to each hosts to participate in the ElasticSearch cluster (leave them blank for single node systems):

    driven.storage.cluster.discovery.unicast.hosts=

Step 3: Recreate from your last snapshot of Driven

Use the latest snpashot of the Driven server. See Implementing a backup strategy for your Driven Application for information about backup and the latest snapshot of Driven server.

Capture telemetry data sent by a particular application to Driven

You can run the Driven plugin in archival mode. If archive mode is enabled, all records that are sent to the Driven Server are written to disk even if the server is unreachable. This can be useful if the Driven server is not available or unreachable as the archive can later be replayed when the server is reachable. Sending data to the server is idempotent, so re-running data again does not corrupt already recorded data.

You can set the parameter in cascading-service.properties

cascading.management.document.service.archive.dir=/path/to/archive/directory

or…

$ export DRIVEN_ARCHIVE_DIR=/path/to/archive/directory

Stopping Unresponsive Applications

There are times when your application stops sending metrics and become unresponsive for some unknown reason. Updates to Driven are no longer received, however its status still shows as Running. Unresponsive applications can render the Status timeline graph useless if left unattended.

Filtering for Unresponsive Applications

To quickly find all unresponsive applications, set the Search filter to the Running status, where upon the All Application view will display the results. From the application list, select the desired unresponsive application.

zombie_viewResults

Figure 2: The search results are displayed in the application list

Select the desired application by clicking on its link. The Performance view will appear showing the application’s flow and status state. In this example, the tpcds_q40 application is selected.

zombie_markAsStopped_Status

Figure 3: Use Mark as stoppped button to stop a unresponsive application with status state of Running

The application’s status state is displayed in the top right corner of the page. In this example, the status is Running and associated with a Warning icon. This combination indicates that the application is unresponsive while the Driven Server is still waiting for processing updates. Once you have determined that this is the specific application you want to stop, click Mark as stopped. zombie_markAsStopped

Before the application is actually stopped, a confirmation message appears to either confirm the stop or cancel.

zombie_confirm

Click Confirm to stop the application. The status state of the application is displayed as Stopped.

zombie_stopped

Figure 4: The status state is Stopped for the tpcds_q40_ application

Note

The application has a stop timestamp at the time when the Driven Server last received an update timestamp from the Driven plugin that monitors application processes in the Hadoop infrastructure. For example, if you stopped an application at the current (system) time, Driven will mark the stop time at the last updated timestamp that might have occurred three months previously.

Back

Introduction