Driven Administrator’s Guideversion 1.1.1
This troubleshooting section is a comprehensive list of current issues that occurred during deployment. As a living document, this section is updated periodically. Feel free to post your queries to Driven Forums or email us at firstname.lastname@example.org.
Cascading application cannot send data to the Driven server
This happens when your client application (Cascading application) cannot communicate to the Driven Server.
Verify that your Driven application is up by logging from the browser
Ensure that the Driven URL location is reachable from your Hadoop cluster. You can find out the configured location stored in $HADOOP_CONF/cascading-service.properties
Driven Plugin runs out of memory
Recent bug fixes were made for memory leaks issues.
Make sure that you are on the latest version of Driven
Ensure that you are running your Cascading application with appropriate memory. Run your application with the following settings: -Xms4096m -Xmx4096m
Driven Plugin is making the application slow
There can be many reasons why the Cascading application is running slow, such as the Driven plugin is taking extra time to collect the slice data information from the Hadoop NameNode, the lack of memory resources, or lost connectivity to the Driven server.
Historically, we have found that Hadoop NameNodes have been under provisioned. The Driven plugin collects job execution data associated with each slice from the NameNode.
The cluster has lost connectivity with the Driven server.
The Cascading application is complex with large volumes of data running on a large cluster (these parameters influence the scale of data being collected and transmitted). In this case, you can reduce the events that the Driven plugin is polling in the Hadoop NameNode and reduce the volume of telemetry data that is being sent to the Driven server.
To begin, you can suppress transmission of the slice data by setting the following property in the file cascading.properties
It is important to note that the process of Driven plugin collecting and sending the telemetry data is decoupled from the actual execution of your application on the cluster. While your client application is gated on completion of both the application processing and the Driven plugin completing the transmission, no SLAs are compromised as a result of the additional latency that may be introduced due to data transmission.
Server runs out of disk space (or data appears corrupted)
The Driven application experience is useful due to the unique insights that developer and IT operation organizations receives from its data. The volume of the data per application run that is collected is proportional to the complexity of the application, the size of the data, and the size of the cluster. You should not intend to use a single-node install of the Driven application in such cases.
Perform the following steps to fix the disk space shortage:
Step 1: Validate that you are running out of disk space by inspecting the admin console
Figure 1: Driven Admin Console
Step 2: Add a new node to the persistence cluser
Refer to Elasticsearch documentation to learn how to add new nodes in the cluster
Add the new host to each hosts to participate in the ElasticSearch cluster (leave them blank for single node systems):
Step 3: Recreate from your last snapshot of Driven
Use the latest snpashot of the Driven server. See Implementing a backup strategy for your Driven Application for information about backup and the latest snapshot of Driven server.
Capture telemetry data sent by a particular application to Driven
You can run the Driven plugin in archival mode. If archive mode is enabled, all records that are sent to the Driven Server are written to disk even if the server is unreachable. This can be useful if the Driven server is not available or unreachable as the archive can later be replayed when the server is reachable. Sending data to the server is idempotent, so re-running data again does not corrupt already recorded data.
You can set the parameter in cascading-service.properties
$ export DRIVEN_ARCHIVE_DIR=/path/to/archive/directory
Stopping Unresponsive Applications
There are times when your application stops sending metrics and become unresponsive for some unknown reason. Updates to Driven are no longer received, however its status still shows as Running. Unresponsive applications can render the Status timeline graph useless if left unattended.
Filtering for Unresponsive Applications
To quickly find all unresponsive applications, set the Search filter to the Running status, where upon the All Application view will display the results. From the application list, select the desired unresponsive application.
Figure 2: The search results are displayed in the application list
Select the desired application by clicking on its link. The Performance view will appear showing the application’s flow and status state. In this example, the tpcds_q40 application is selected.
Figure 3: Use Mark as stoppped button to stop a unresponsive application with status state of Running
The application’s status state is displayed in the top right corner of the page. In this example, the status is Running and associated with a Warning icon. This combination indicates that the application is unresponsive while the Driven Server is still waiting for processing updates. Once you have determined that this is the specific application you want to stop, click Mark as stopped.
Before the application is actually stopped, a confirmation message appears to either confirm the stop or cancel.
Click Confirm to stop the application. The status state of the application is displayed as Stopped.