Fault Tolerance

From MTConnect® User's Portal
Revision as of 10:58, 25 July 2013 by Tjones25 (Talk | contribs)

Jump to: navigation, search


Fault Tolerance and Recovery

MTConnect® does not provide a guaranteed delivery mechanism. The protocol places the responsibility for recovery on the application.

//we could elaborate on this more

Application Failure

The application failure scenario is easy to manage if the application persists the next sequence number after it processes each response. The MTConnect® protocol provides a simple recovery strategy that only involves reissuing the previous request with the recovered next sequence number.

There is the risk of missing some Events, Samples, and Condition if the time between requests exceeds the capacity of the Agent’s buffer. In this case, there is no record of the missing information and it is lost. If the application automatically restarts after failure, the intervening data can be quickly recovered. If this cannot be done, the current state of the device can be retrieved and the application can continue from that point onward. See Figure 1.

Figure 1: Application Failure and Recovery

Agent Failure

Agent failure is the more complex scenario and requires the use of the instanceId. The instanceId was created to facilitate recovery when the Agent fails and the application is unaware. Since HTTP is a connectionless protocol, there is no way for the application to easily detect that the Agent has restarted, the buffer has been lost, and the sequence number has been reset to 1. It should also be noted that all values will be reinitialized to UNAVAILABLE upon Agent restart except for data items that are constrained to single values. See Unavailability of Data for a full explanation.

In Figure 2, the instanceId is increased from 1 to 2 indicating that there was a discontinuity in the sequence numbers and all values for the data items are reset to UNAVAILABLE. When the application detects the change in instanceId, it MUST reset its next sequence number and retry its request from sequence number 1. The next request will retrieve all data starting from the first available event or sample.

Figure 2: Agent Failure and Recovery

Data Persistence and Recovery

The implementer of the Agent can decide on the strategy regarding the storage of Events, Condition, and Samples. In the simplest form, the Agent can persist no data and hold all the results in volatile memory. If the Agent has a method of persisting the data fast enough and has sufficient storage, it MAY save as much or as little data as is practical in a recoverable storage system.

If the Agent can recover data and sequence numbers from a storage system, it MUST NOT change the instanceId when it restarts. This will indicate to the application that it need not reset the next sequence number when it requests the next set of data from the Agent.

If the Agent persists no data, then it MUST change the instanceId to a different value when it restarts. This will ensure that every application receiving information from the Agent will know to reset the next sequence number.

The instanceId can be any unique number that will be guaranteed to change every time the Agent restarts. If the Agent will take longer than one second to start, the UNIX time (seconds since January 1, 1970) MAY be used for identification an instance of the MTConnect® Agent in the instanceId.