Functional safety analysis of FlexRay bus

Electronic systems have been used in automobiles for decades, and they have greatly improved the safety, energy saving and environmental performance of automobiles. With the deepening of research, many systems need to share and exchange information. In order to save cables, a distributed embedded system that relies on communication is formed. Currently, 90% of the world's systems use CAN bus-based systems. FlexRay is the de facto standard for next-generation communication protocols, and how functional security is critical.

This article refers to the address: http://

1 IEC61508 functional safety requirements

Currently, vehicle control systems are transitioning to wire-by-wire technology (xbywire), such as wire-steering and wire-controlled brakes. The ultimate goal of the line control system is to eliminate mechanical backups, because eliminating these backups can reduce costs, increase design flexibility, expand the scope of application, and create conditions for new features. However, the elimination of mechanical backup has greatly increased the reliability of the electronic system. The car is a moving object, in a moving environment, it may hurt itself and others due to malfunction. Eliminating mechanical backups raises the electronic system from today's failsilent requirements to failsafe requirements [1].

Internationally, the functional safety requirements for industrial applications have established the standard IEC61508, which is mainly concerned with the safety of the controlled equipment and its control system. Although it is also suitable for automobiles, the car not only has the above-mentioned functional safety problems, but also concerns the safety of the entire vehicle system due to functional changes, so the automotive industry is developing the corresponding standard ISO26262. The functional safety level of the car is divided into 4 levels. The highest requirement is ASILD, and the corresponding failure probability is <10-8/h, which is equivalent to SIL3 of IEC61508. According to practical experience, the probability of failure assigned to communication is <10-10/h. An introduction to this can be found in reference [1].

The range of security-critical applications is now expanding [1], and some systems that were not counted before are now counted. For example, the seat adjustment subsystem in the safety pre-operation system (presafe), the lighting control subsystem in the brake assist system, and the subsystem of the telematic automatic call for assistance after collision will all be regarded as safety-critical systems.

1.1 Communication failures that cause system security risks

There are five manifestations of communication failure [11], and the first is the error that causes the value range. The second type is the error that causes time domain, which is the part of industry that is different from civilian use. If a message cannot be delivered before the scheduled time limit, it loses its practical significance. For example, sensor messages related to airbag detonation cannot be delivered within a few ms, causing safety problems. There is a third type of error in multicast or broadcast communication: data integrity error (Byzantine error), that is, the results received by each node are inconsistent. It causes systemic failures, and the strategy to deal with must consider all relevant nodes at the same time. The fourth type is system crash. In addition to hardware failure, there is also interference or software, such as babbling idiot to block communication. The fifth type is frame loss, short-term failure, such as recoverable offline or bug-induced equivalent offline state, and small group error.

1.2 Allowable failure rate of communication

In the analysis of the impact of communication failure on system security, reference [2] provides a method to calculate the length of communication failure based on the possible length of transient interference, and to introduce the failure of the system under the assumed communication failure rate. rate. In this example, the interval of the electric field exceeding 100 V/m on the road section may cause communication failure, the failure rate is approximately 5×10-3, the vehicle speed is 90 km/h, and the identified possible failure time is about 74 s. The communication is in the period of 6 ms, and the frame loss for 7 consecutive cycles is regarded as system failure. Under this condition, the system failure rate is 1.640 9×10-10, which is considered to meet the security requirements of SIL4. This analysis method is effective, but the assumptions are too many, for example, the bit error rate has a large change interval; the change of the frame length affects the failure rate of one transmission; the assumption of the interference duration; the continuous loss of 7 frames is also For the application occasion, the loss of control of the vehicle of 90 km/h for 42 ms has a distance of about 1 m for the braking system. I am afraid that the consequences of the impact are completely different. It is also assumed that SIL4 is completely allocated to the communication, and the CPU is Part of the software-related failure rate is negligible. Today, the software is getting bigger and bigger, this assumption is unreasonable. On the other hand, when determining the system failure rate, other forms of communication failure should also be considered. For example, when a small group error occurs [5] until the time of collision depends on the relative clock drift, the more accurate, the longer the time, the time of failure The longer the reference, the 300 ms after the artificial creation of a small group, the conflict was found, far beyond the 42 ms mentioned above. Therefore, in general articles discussing system security (such as references [1] and [12]), the failure rate of communication is separately specified as 1/100 of the corresponding safety level failure rate.

1.3 Factors Affecting Communication Failure Rate

The functional safety level is related to the coverage of fault detection. If some faults are not detected (not recognized or can not be achieved), of course, the failure scenario cannot be counted, and the division of the safety level is wrong.

Reference [1] introduces the concept of SFF (Safety Failure Fraction): failure is divided into hazard-based failures and safety failures, which are divided into two types: detectable and undetected. The safety failure ratio SFF is the share of the total failure that can be detected as a hazard failure and a safety failure. Diagnostic Coverage (DC) is the share of hazard failures that can be detected as a total hazard failure. It can be derived that SFF has a linear relationship with DC. SFF is related to SIL. The SIL rating of IEC61508 is related to SFF. SIL3 can tolerate one fault when SFF accounts for 90%~99%. Therefore, DC also determines the SIL level that can be achieved. According to the article, the probability of transient faults is two orders of magnitude larger than the probability of hardware failure, so it can be roughly inferred that the coverage of transient fault diagnosis should reach 90%~99% [1]. Hazard failure may be caused by communication failure, and diagnostic coverage becomes an important part of evaluating communication protocols.

In communication, because the CRC has missed detection, this is an obvious diagnosis of the uncovered area, and the diagnosis of the uncovered rate is equivalent to the error rate of the wrong frame, such as the error frame miss detection of CAN [10].

A value domain error or a time domain error occurs in communication and a frame loss is a hazard that can be diagnosed (this is the main object of this analysis). False mistakes, Byzantine mistakes, etc. should be undetected hazards. When a small group error occurs, both frame loss and Byzantine error may occur. The equivalent off-line failure of CAN is also a hazard failure caused by uncovered diagnostics [9]. It is still difficult to calculate the proportion of hazard failures caused by these uncovered diagnoses to total hazard failures, because it is difficult to determine the probability of failure model. However, qualitatively, only the diagnosis of counterfeiting, Byzantine error and small group mistakes can be made to improve the diagnostic coverage (increased SIL level).

2 Introduction to FlexRay

Because the wire control technology can improve the handling performance of the car, reduce the production and use cost, improve safety, energy saving, environmental protection and comfort, it becomes an important part of the progress of the whole vehicle technology. However, in order to eliminate the mechanical or hydraulic backup, the requirements for the reliability of the control device and its communication are greatly improved. This has stricter requirements on the bandwidth and certainty of communication. The CAN bus cannot meet this bandwidth requirement and is not sufficient in certainty, so the FlexRay technology is generated. According to the standard [3], FlexRay can have topologies such as bus, star, and tree. It provides a two-channel controller structure that can be configured for redundant communication or for independent operation of each channel with great flexibility. Each channel can be configured up to 10 Mb/s. FlexRay is a time-triggered communication protocol that is synchronized by a distributed clock. The schedule of the system is determined by cycle\static slot\minislot. A cycle has a fixed number of static slots and minislots, and their durations are equal, as determined by configuration. A node can occupy multiple static slots in a cycle. The static slot can be multiplied. That is, the same static slot of each cycle can be used for different nodes. The data field of the FlexRay frame can be up to 254 bytes. Its header is control information such as identifier and frame length. It has an independent CRC check, and the tail has a 24-bit CRC check covering the full frame. FlexRay has a Bus Guardian design against time domain errors.

Regarding the shortcomings or weaknesses of FlexRay, reference [4] mentions the difficulty of physical layer connection, affecting signal integrity. In fact, it is easier to use active star, but this brings cost improvement; cycle design constraints More, it brings difficulties; synchronization and startup node configuration is related to fault tolerance, which is a challenge; due to limited resources, it is difficult to upgrade and evolve (not like the composability advantage of the time-triggered protocol as before - the author's note). Reference [5] describes the possibility of generating separate clock synchronization small groups in FlexRay, which means that although each node is communicating, there is no effective communication between the two groups, which is a fault condition. The solution is to use 3 cold start nodes and 3 synchronous nodes, but this contradicts the requirement of time synchronization fault tolerance. There is also the need to fill the schedule to avoid the formation of small groups, which is also in conflict with the requirement to leave room for future upgrades. In short, there is no complete solution. Then there is the possibility that the clock may produce a drift in the same direction [6], and the difference from the application clock causes the frame to fail or the frame to be missed. Although FlexRay is designed for high credibility, after the error in the transmission, the processing is solved through the application layer, which brings new problems. This article will analyze what happens if it is not processed.

3 Functional safety levels for Audi and BMW FlexRay bus applications

BMW and Audi were the first to use the FlexRay bus in batches. Their specific usage has not been found, but some of the parameters used in Reference [7] can be used for some preliminary analysis.

3.1 Audi parameters

Audi's cycle is 5 ms, each cycle has 62 static slots, and the slot is used to transmit a 42-byte payload frame with a static segment of 4.03 ms. There are 8 ECUs that transmit a total of 220 Protocol Data Units (PDUs). These PDUs are combined and finally transmitted in 27 slots. According to the provided periodic distribution, there are 8 5 ms messages, 1 10 ms messages, 7 20 ms messages, and 6 40 ms messages. The remaining longer period messages are ignored first.

The payload can be calculated to use a frame length of 500 bits. Assuming a bit error rate of ber=1×10-7 (this is quite good in copper), then the frame error rate is fer=5×10-5. /frame.

The number of frames per hour that can be calculated from the period is n=7.92×105frame/h. Assuming that the communication is transmitted simultaneously with 2 channels, the probability of simultaneous failure is fer2=2.5×10-9/frame. The probability that all frames are successfully transmitted within 1 hour is: P = (1-fer2)n.

The probability of having one or more errors in one hour is 1-P≈fer2×n=2.5×10-9×7.92×105/h=1.98×10-3/h. The security level requirement of SIL2 is that the system failure probability is 10-7/h, and the communication is 10-9/h, which shows that there is a huge gap.

3.2 Parameters of BMW

Reference [7] also indirectly gives the parameters of BMW: cycle is 5 ms, each cycle has 91 static slots, slot is used to transmit 16-byte payload frames, the actual payload used is 8 bytes, a total of 227 PDUs. It is known that 4 ms of 2.5 ms messages and 10 slots are used, and these PDUs are not merged. According to the provided periodic distribution, there are 62 5 ms messages, 45 10 ms messages, 80 20 ms messages, and 38 40 ms messages. The remaining longer period messages are ignored first.

The payload length of each message is different. From this distribution, the frame length and frame error rate can be calculated when the error rate is ber=1×10-7, and the average frame error rate fer=1.51×10-5 is calculated. /frame. Assuming that the communication is transmitted simultaneously with 2 channels, the probability of simultaneous failure is fer2=228×10-9/frame. The number of transmission frames calculated from the period is n = 2.79 × 106 / h. Similarly, the probability of having one or more errors in one hour is 1-P≈fer2×n=2.28×10-9×2.79×106/h=6.36×10-3/h, which is also much larger than the requirement that SIL2 is assigned to communication. .

4 Feasibility of active retransmission scheme

Two authors suggested a scheme for active retransmission, which is found in reference [8]. Active retransmission is conceptually redundant in time. Frames are retransmitted not only on different physical channels, but also on different time periods. This analyzes the two cases in Section 3.

4.1 Audi

When each frame is scheduled to be transmitted with 2 static slots, the 2 channels will have 4 transmissions, and the probability of failure will be much smaller, which is fer4=6.25×10-17/frame. The actual number of frames transmitted is doubled, but the content is not doubled, so the calculation is still performed by n. The probability of having one or more errors within 1 h is 1-P≈fer4×n=6.25×10-17×7.92×105/h= 4.95×10-11/h. This can meet the requirements that SIL2 assigns to communications.

In theory, the original application occupies 27/62 of the static slot, and now doubled to 27/31 is also sufficient, but due to the limitation of the message delivery time limit, scheduling will become very difficult, and whether there is a solution is inconclusive. The space left for future expansion and upgrade is small, and it has shown that FlexRay has insufficient bandwidth.

4.2 BMW

When active retransmission is used once, the probability of having more than one error within 1 h is 1-P≈fer4×n=5.19×10-18×2.79×106/h=1.45×10-11/h. This can meet the requirements that SIL2 assigns to communications.

However, the original BMW has taken up 2/3 of the static slot, and there is not enough free slot for active retransmission. For example, the static segment of BMW is 3 ms, and a total of 0.5/3×91=15 slots can be arranged in 2.5~3 ms. Its 2.5 ms message has occupied 10 slots, and it is impossible to make it redundant. Transfer. This also shows that FlexRay has insufficient bandwidth.

5 Comparison with CAN bus

The BMW system data in reference [7], if transmitted using the CAN standard frame, can be inferred to require a bandwidth of at least 2.8 Mb/s, which clearly shows that the CAN bus bandwidth is insufficient. However, the error auto-retransmission mechanism of the CAN bus makes the communication reliability of the system far better than that of FlexRay.

For example, when ber=1×10-7, the CAN bus frame length is 108 bits, and the frame error rate is fer=1.08×10-5/frame. When the number of transmitted frames is n=2.79×106/h (assuming that multiple buses are used to satisfy the bandwidth), there are 31 frames of errors. When 31 frames are retransmitted twice, the probability of total error is 31×fer3=31×1. 26×10-15=3.9×10-14, much smaller than the share allocated by SIL2 to the communication.

Moreover, if the original scheduling analysis leaves enough automatic error retransmission time of 2 frames, it can be calculated that the impact on the delivery time is small. The difference in delivery time is low priority messages, which have little impact on high priority messages. For example, the delivery time of 10 2.5 ms periodic messages is about 1.2 ms (taking into account the padding and service interval), and automatically resending 2 messages in 2.5 ms will only increase the delivery time to about 1.5 ms.

The error auto-retransmission mechanism of the CAN bus requires less bandwidth than the active retransmission scheme, which is almost one ten-thousandth of the latter.

6 Error Tracking of FlexRay Bus

The strength of the CRC test is discussed in Ref. [12]. 2-k is the upper bound of the undetected error, assuming k is the checksum length, and k=24, 2-24=5.9×10-8 for FlexRay. If the dislocations are not correlated, the probability strength is multiplied by (ber x frame length) HD, where HD is the Hamming distance of the CRC polynomial. Multiply the number of frames in 1 h when calculating in 1 h. According to the standard [3], HD=6 when the payload is less than 248 bytes. According to this calculation, the frame length = 256 bytes = 2 560 bits. Considering the idle time, the number of frames per hour is calculated as 2 600 μs per frame, so there are 3 600/260×106=1.38×107 frames per hour. . The total missing frame per hour is 1.38 × 107 × 5.9 × 10 - 8 × (ber × 2 560) 6 = 0.81 × (ber × 2 560) 6. When ber=10-7, it is 2.27×10-22, and when ber=10-5, it is 2.27×10-10. The interference is not very strong, and when the frame is also short, the error frame missed part of FlexRay can still meet the requirements allocated by SIL2 for communication.

7 Summary

The calculations show that the functional safety level of FlexRay communication is far from the requirement when ber=1×10-7, and there are also problems such as small group error and clock drift. In addition, since FlexRay does not have a simple and efficient error reporting mechanism such as the CAN bus, if there is no active retransmission scheme, the probability of failure due to a Byzantine error caused by a local error between receiving nodes increases. From this point of view, FlexRay has a lot of work to do to fully realize its design goals. In the longer term, it is necessary to solve the existing problems of FlexRay while achieving 100 Mb/s speed with Industrial Ethernet.

The methods discussed in this paper also apply to other fieldbus or industrial Ethernet. Many protocols are based on time-triggered approaches like FlexRay, and their security relies on high-level "security protocols." These "black channel"-based "safety protocols" generally add some error-correcting measures according to the European standard EN501592, such as repetition, frame loss, insertion, sequence error, data error, delay and Counterfeit errors use methods such as adding a water number, timestamp, timer, identifier, address, and additional signature. Other security protocols only consider hardware link failures and recovery, but are a form of communication failure. However, these measures are still insufficient and do not cover all branches of the fault tree. For other forms, such as the Byzantine failure after a partial error, the cessation of service after the raging error, and the partial stop service after the small group error occurred, they were not processed. Some errors can be found in the application, but due to the time characteristics of the host where the application is located, the time limit may have been missed and the error cannot be corrected. At the communication level, they seriously affect the diagnostic coverage and directly affect the SIL level. Even in the process industry, the message cycle is long, and the active retransmission scheme can reduce the error results (some applications may not have done so now), and some errors (such as Byzantine error) are still unbearable, especially involving some The transmission of logic signals.

IDC Series Centronic Connector

IDC Series Centronic Connector

Current Rating:5A
Dielectric Withstanding Voltage:1000V for one minute
Insulation Resistance:1000MΩ Min.(at 500V DC)
Contact Resistance:35mΩ Max.
Temperature:-55°C to +105°C

IDC Series Centronic Connector

ShenZhen Antenk Electronics Co,Ltd , https://www.antenkelec.com