In this 6th article in our SOTIF series, we turn our focus to the human and technology considerations for creating, maintaining, and utilizing software in functionally safe systems.
Safety Of The Intended Functionality (SOTIF) is defined in ISO 21448:2022, “Road vehicles — Safety of the intended functionality”, as the absence of unreasonable risk due to hazards resulting from functional insufficiencies of the intended functionality or by reasonably foreseeable misuse by persons. In contrast, ISO 26262, "Road vehicles – Functional safety", is an international standard for the functional safety of electrical and/or electronic systems that are installed in serial production road vehicles. SOTIF helps ensure that a system remains in a safe state regardless of whether things unfold as predicted. ISO 26262 works in close concert with SOTIF and addresses functional insufficiencies in the safe performance of the system: the sensors, processors, software, and actuators, and addresses them through complete and correct requirements.
It can be argued that, despite advancements and continuous improvements in every aspect of automotive design and production, the automotive industry, in general, is still trying to wrap its arms around the challenge of finding the optimal way of integrating sotif organically into the design process from the start. The industry has reached the point where they acknowledge that safety requirements must be achieved and maintained. And yes, the question of what defines a safe state, has been clarified through standards that define what must be achieved. However, the best step-by-step process for how to weave SOTIF into a company’s product development workflow is still up for debate. In turn, this impacts how a company’s existing software teams interact with SOTIF requirements, as it is not uncommon for software teams to be in place doing other work before a company adopts SOTIF for the first time.
While some inroads have been made, an overall industry approach to adopting SOTIF has not yet been formalized and formally adopted to a significant degree. It is as if the SOTIF train is already pulling out of the station, and the automotive industry can’t agree on the best way to hop on board. With no best practices defined for the SOTIF on-ramp, there is variation in adoption, and where there is variation, there is waste.
SOTIF drives inputs to the ISO26262 process at the top in the form of additional safety goals or functional safety requirements, but it does not place mandates on software developers directly. SOTIF places requirements on system engineers, who then may pass requirements to software developers. But they might also choose to apply the requirements to a subsystem, hardware, or even a mechanical solution.
How should we in the industry onboard SOTIF into our own companies in a manner that ensures from the start a safe design and the safe function of those systems? That question could be scoped down to the software development teams themselves. It could be argued that the software developers may be at an advantage because they have the functional safety requirements and cybersecurity requirements defined for them, but it is only an advantage if they are already experienced in these areas.
Compared to many of the other automotive roles, their industry best practices are typically more robust and mature, in no small measure due to the sheer magnitude of the software development realm in general. Plus, the increasing focus on cybersecurity highlights new threat vectors and safeguards every day, so best practices are constantly evolving. That is a lot of data points that can be drawn upon, and a lot of lessons learned.
Given that the demands of SOTIF require complete traceability and the granular breakdown of safety risks to their most basic elements, at a minimum, code written to address safety requirements must be of the highest level of hygiene. It should be clean and elegant code that professionals would be proud to put their name on. No slop, no shortcuts. This is due in part to the nature of the safety tasks. SOTIF is focused on the intended functionality, so the code in turn should reflect clear intent. This is true at all System levels:
SOTIF also defines the safe state for unintended consequences. Abandoned legacy code that is thought to be inert but still lurking in the software, carries a risk of being the source for all sorts of unforeseen consequences. The more junk in your code, the greater the risk of something unexpected impacting safety. Clean up the code, and the variables are reduced. Reduce the variables, and you reduce the risk.
To be fair, the industry has made significant strides in the past decade in formalizing the development of software, including the management of how software is created, maintained, revised, and validated. But if you ask the people on the inside who perform this work every day, and they are being brutally honest, some might admit that a surprising amount of the industry software development is still being conducted using processes not that far removed from high school or college projects.
A typical software developer’s meeting conversation might sound something like this: A problem is identified that needs to be solved by software. A developer is assigned the task and writes code. A few other people look at it and it looks like it works, so they go with it. But where is it? Where is that piece of code being stored? “Oh, it's on someone’s hard drive or network drive.” Well, what version are we using? “You know, I think we labeled it with “_r2-final-final” in that zip file after the last meeting. But that software engineer was on vacation last week, so we assigned the rest of the work to someone else, and they’ve revised it since then.” What is the file name now, and where is it?
Does that sound familiar?
We take pride in our work and tend to want to assume the best in our industries. But we can’t turn a blind eye to what we now know firsthand: Poor revision control is often a factor in automotive recalls.
A recall is a significant and expensive admission of failure. It is what a company must do when, somehow, safety was not properly addressed and validated comprehensively. Somehow, the flaw made it all the way through design, testing, and manufacturing, without being properly addressed. It made it all the way out into the hands of the public, where they unwittingly discovered the safety risk the hard way, by becoming victims. And then, only when the issue is forced, is the safety properly addressed. That is the least efficient, least trustworthy, least safe, and most expensive way to do business.
A recall is a reactive answer to a problem that never should have been allowed to see the light of day in the first place. In our industry, think about how many recalls have we seen in just the last five to ten years. How many of those got to the point of recall because of inadequate processes? And how many were the result of one or more humans along the way taking shortcuts? How many additional millions of dollars have been expended to solve problems that were completely preventable in the first place? And in how many of those issues did poor quality software play a role?
This is where software, when created and maintained properly, can help prevent problems from making it into the hands of the public. Software does not care about pride of authorship or hurt feelings. The only things it requires to be utilized in an optimal manner are adequate processor power and human discipline.
Software, properly written, notated, structured, vetted, and maintained, can have a huge positive impact on safety. SOTIF defines what the safe state is. Sensors provide inputs. Actuators execute the outputs. Processors do the thinking. But software tells the system how to think. Software is how the present state is detected and classified, and how the safe state is implemented. It is a pivotal system component that is worth doing right the first time.
A fundamental truth of business is that metrics are neither good nor bad, but they do drive behavior. However, too often, the driven behavior ends up being far removed from the initial intent, resulting in unanticipated consequences. An example of the right metrics that are important to software development, is illustrated by the number of bugs caught during:
As a bug slips from 1 through 4 it become more expensive to fix. Despite this, most bugs are caught in 3 and 4. This happens because typically, very little formal peer code review is done in 1 and 2. Meeting a schedule is communicated as a higher priority than saving costs. Thus, the least expensive way to correct bugs is not used.
Automotive is a cyclical business culture built upon model years. Our business thinks and acts in model year increments. Those intervals are subdivided into quarters. At these milestones, reports are given, and people are held accountable, usually in a manner that carries rather high visibility. It is a deeply ingrained practice.
Too often, the humans working in this realm allow the calendar to take precedence over the needs of safety. Shortcuts happen when the immediate payoff from pain avoidance is greater than the assumed possibility of pain down the road. It is typically not intentional, but rather the product of good people making unfortunate decisions, blinded by short-term metrics viewed from within a silo. The metrics become the priority, and achieving short-term goals becomes a form of product to be made, rather than the creation of a functionally safe vehicle.
How many software development managers have reached the point where, a few days out from having to present their team report in the quarterly review, they find themselves sitting in their team meeting and thinking to themselves that they are in trouble:
“We were recently tasked with updating this code from a previous vendor. We don't know what has been done in this code up to now, or how this has been developed before being handed to us. We have had one of our developers analyze it, and they have discovered that this code has issues every way you look at it. It is poorly written, sloppy code, and we have been given no guidelines as to how it was created. Obviously, multiple different people have worked on it because certain styles are used to do some things and other styles are used to do other things, but none of it is documented. So clearly, there was little to no discipline upstream of us.
“This trash code filled with little bugs has been tossed over the fence at us, and I have been told to make something work. And we must have it done by the end of the quarter. High exposure, high pressure, and no excuses. Are new features required? We might have the team code a quick fix and wedge it in there. Cleanup? Forget it. We have not been given the time or staff or funding to strip out the old stuff. The old stuff is still in there, but it is now dead code so hopefully, it won’t cause a problem down the road.”
… until it does.
Far too often, this scenario is the norm. It isn’t intended that way. It is not pretty. Many don’t want to admit it. But most know that this reality is far too common. There isn’t the time or money to do it right the first time, but there always seems to be the time and money to do it over. But the money alone does not reflect the true cost. How much avoidable human suffering has been caused, and many millions of dollars in corrective action expended, to correct errors that at their root were caused by little more than otherwise-good people trying to save face?
One of the foundation elements for autonomous vehicles is going to have to be transparency. Some companies are trying to pursue a proven in-use argument. “Hey, we racked up millions of hours, and we've never had an accident.” But they are pushing an update out daily, if not several times a day. So, the clock resets to zero every time that is done. You don't have millions of hours of driving data, you have a day's worth, maybe. It depends on how fast the company is pushing out updates. And you don't have a proven in-use state until you lock that system data down and never update it.
These types of companies are going to have to come to terms with their systems and make them safe. They must stop pushing updates if they are going to pursue a proven end-use argument.
Safety-related software is not only found on board the systems but it is also utilized for testing those systems. Simulations hold the key to being able to test the broad spectrum of scenarios that must be analyzed to apply SOTIF to the known and unknown. Among other things, it is a matter of bandwidth. There are so many scenarios that need to be tested, that real-world testing can’t accommodate them all, and even if it could, it would be far too costly and time intensive.
The process starts with a pure simulation environment where the scenarios are completely controlled. In essence, it is a sim game. The ECU and the algorithms can be a part of that game. Teams try to generate as many scenarios as they can. If everything works as expected, and it is confirmed that the algorithms run properly on the intended hardware, the environment is simulated further with more real-world data utilized to create realistic environmental simulations.
Hardware-In-the-Loop (HIL) provides a real-world system with the ability to perceive a simulated environment. A real computer and real sensors are connected to the HIL device, with only the scenarios being simulated. The sensors are mounted in the spots on the vehicle where they are going to be located in production. This setup enables engineers to check and confirm that the sensors are able to receive information and respond without issues.
Ultimately, some level of road testing will be required, but the goal is to reduce road testing as much as is practical. So, if you start with 1000 scenarios, by the time you're at the vehicle level, you might be able to use software and simulations to reduce the number of road tests to 10 scenarios that need to be tested. The level of confidence is high that it is going to work well.
It can be misleading to think of cybersecurity as being a subset under the SOTIF umbrella. Cybersecurity is a robust and significant stand-alone set of processes and specifications with its own requirements and standards, not unlike the relationship between SOTIF and ISO 26262. It is a world in a constant state of change, with bad actors trying to break things in the real world, white hats trying to break things in sandboxed environments so the results can be shared, and developers trying to implement fixes and just keep up. See our series of blogs on cybersecurity for further details.
Software is the lifeblood of safety systems. It is instruction and learning, source and recipient. It is a world where best practices are known but not always applied. It has played a pivotal role in wonderful solutions and terrible missteps. And as software inherits an increasingly important role in creating functionally safe systems, the demands of SOTIF are challenging software developers to raise the bar on the quality and hygiene of their code. Software, more than any other element, has the potential to raise the quality of the entire system. Our functionally safe future is counting on it.
What is SOTIF(Safety of the Intended Functionality) ISO/PAS 21448