Fault Tolerance

From MTConnect® User's Portal
Jump to: navigation, search


Fault Tolerance and Recovery

MTConnect® does not provide a guaranteed delivery mechanism. The protocol places the responsibility for recovery on the application.

Application Failure

The application failure scenario is easy to manage if the application persists the next sequence number after it processes each response. The MTConnect® protocol provides a simple recovery strategy that only involves reissuing the previous request with the recovered next sequence number.

There is the risk of missing some Events, Samples, and Condition if the time between requests exceeds the capacity of the Agent’s buffer. In this case, there is no record of the missing information and it is lost. If the application automatically restarts after failure, the intervening data can be quickly recovered. If this cannot be done, the current state of the device can be retrieved and the application can continue from that point onward. See Figure 1.

Figure 1: Application Failure and Recovery

Agent Failure

Agent failure is the more complex scenario and requires the use of the instanceId. The instanceId was created to facilitate recovery when the Agent fails and the application is unaware. Since HTTP is a connectionless protocol, there is no way for the application to easily detect that the Agent has restarted, the buffer has been lost, and the sequence number has been reset to 1. It should also be noted that all values will be reinitialized to UNAVAILABLE upon Agent restart except for data items that are constrained to single values. See Unavailability of Data below for a full explanation.

In Figure 2, the instanceId is increased from 1 to 2 indicating that there was a discontinuity in the sequence numbers and all values for the data items are reset to UNAVAILABLE. When the application detects the change in instanceId, it MUST reset its next sequence number and retry its request from sequence number 1. The next request will retrieve all data starting from the first available event or sample.

Figure 2: Agent Failure and Recovery

Data Persistence and Recovery

The implementer of the Agent can decide on the strategy regarding the storage of Events, Condition, and Samples. In the simplest form, the Agent can persist no data and hold all the results in volatile memory. If the Agent has a method of persisting the data fast enough and has sufficient storage, it MAY save as much or as little data as is practical in a recoverable storage system.

If the Agent can recover data and sequence numbers from a storage system, it MUST NOT change the instanceId when it restarts. This will indicate to the application that it need not reset the next sequence number when it requests the next set of data from the Agent.

If the Agent persists no data, then it MUST change the instanceId to a different value when it restarts. This will ensure that every application receiving information from the Agent will know to reset the next sequence number.

The instanceId can be any unique number that will be guaranteed to change every time the Agent restarts. If the Agent will take longer than one second to start, the UNIX time (seconds since January 1, 1970) MAY be used for identification an instance of the MTConnect® Agent in the instanceId.

Unavailability of Data

Every time the Agent is initialized all values MUST be set to UNAVAILABLE unless they are constant valued data items. Even during restarts this MUST occur so that the application can detect a discontinuity of data and easily determine that gap between the last reported valid values.

In the event no data is available, the value for the data item in the stream MUST be UNAVAILABLE. This value indicates that the value is currently indeterminate and no assumptions are possible. MTConnect® supports multiple data sources per device, and for that reason, every data item MUST be considered independent and MUST maintain its own connection status.

In the following example, the data source for a temperature sensor becomes temporarily disconnected from the Agent. At this point the value changes from the current temperature to UNAVAILABLE since the temperature can no longer be determined. In Figure 3, the temperatures range around 100 until it becomes disconnected and then in the future it reconnects and the temperature is 30. Between these two points assumptions SHOULD NOT be made as to the temperature since no information was available.

Figure 3: Unavailable Data

If data for multiple data items are delivered from one source and that source becomes unavailable, all data items associated with that source MUST have the value UNAVAILABLE. This MUST be a synchronous operation where all related data items will get that value with the same time stamp. The value will remain UNAVAILABLE until the data source has reconnected.