Difference between revisions of "Fault Tolerance"

From MTConnect® User's Portal
Jump to: navigation, search
m
Line 9: Line 9:
 
The application failure scenario is easy to manage if the application persists the next sequence number after it processes each response. The MTConnect® protocol provides a simple recovery strategy that only involves reissuing the previous request with the recovered next sequence number.
 
The application failure scenario is easy to manage if the application persists the next sequence number after it processes each response. The MTConnect® protocol provides a simple recovery strategy that only involves reissuing the previous request with the recovered next sequence number.
  
There is the risk of missing some Events, Samples, and Condition if the time between requests exceeds the capacity of the Agent’s buffer. In this case, there is no record of the missing information and it is lost. If the application automatically restarts after failure, the intervening data can be quickly recovered.
+
There is the risk of missing some Events, Samples, and Condition if the time between requests exceeds the capacity of the Agent’s buffer. In this case, there is no record of the missing information and it is lost. If the application automatically restarts after failure, the intervening data can be quickly recovered. If this cannot be done, the current state of the device can be retrieved and the application can continue from that point onward. See Figure 1.
  
[[File:Test.png|xxx]]
+
[[File:Figure1.PNG|400px|none|thumb|Figure 1: Application Failure and Recovery]]
 +
 +
=== Agent Failure ===
 +
 
 +
Agent failure is the more complex scenario and requires the use of the instanceId. The instanceId was created to facilitate recovery when the Agent fails and the application is unaware. Since HTTP is a connectionless protocol, there is no way for the application to easily detect that the Agent has restarted, the buffer has been lost, and the sequence number has been reset to 1. It should also be noted that all values will be reinitialized to UNAVAILABLE upon agent restart except for data items that are constrained to single values. See Unavailability of Data for a full explanation.
 +
 
 +
In Figure 2, the instanceId is increased from 1 to 2 indicating that there was a discontinuity in the sequence numbers and all values for the data items are reset to UNAVAILABLE. When the application detects the change in instanceId, it MUST reset its next sequence number and retry its request from sequence number 1. The next request will retrieve all data starting from the first available event or sample.

Revision as of 15:08, 24 July 2013

Fault Tolerance and Recovery

MTConnect® does not provide a guaranteed delivery mechanism. The protocol places the responsibility for recovery on the application.

//we could elaborate on this more

Application Failure

The application failure scenario is easy to manage if the application persists the next sequence number after it processes each response. The MTConnect® protocol provides a simple recovery strategy that only involves reissuing the previous request with the recovered next sequence number.

There is the risk of missing some Events, Samples, and Condition if the time between requests exceeds the capacity of the Agent’s buffer. In this case, there is no record of the missing information and it is lost. If the application automatically restarts after failure, the intervening data can be quickly recovered. If this cannot be done, the current state of the device can be retrieved and the application can continue from that point onward. See Figure 1.

Figure 1: Application Failure and Recovery

Agent Failure

Agent failure is the more complex scenario and requires the use of the instanceId. The instanceId was created to facilitate recovery when the Agent fails and the application is unaware. Since HTTP is a connectionless protocol, there is no way for the application to easily detect that the Agent has restarted, the buffer has been lost, and the sequence number has been reset to 1. It should also be noted that all values will be reinitialized to UNAVAILABLE upon agent restart except for data items that are constrained to single values. See Unavailability of Data for a full explanation.

In Figure 2, the instanceId is increased from 1 to 2 indicating that there was a discontinuity in the sequence numbers and all values for the data items are reset to UNAVAILABLE. When the application detects the change in instanceId, it MUST reset its next sequence number and retry its request from sequence number 1. The next request will retrieve all data starting from the first available event or sample.