On December 21, 2022, as well as the top holiday season, Southwest Airlines underwent a cascading series of failures, originally caused by heavy winter weather in the Denver. However, problems have expanded through their network and over 2 million passengers have ended over the next 10 days of the crisis and caused $ 750 million losses for airline.
How did the localized weather weather end up and cause such a widespread failure? Scientists at MIT examined this widely reported failure of ample of cases that work smoothly for most of the time, suddenly disintegrate and cause a domino effect of failure. They have now developed a computing system for using a combination of thin data for rare failure events, combined with much more extensive data on normal operations to work back and try to determine the root causes of failure and a promising future.
The findings, which we presented at the International Conference on Teaching Representations (ICLR), which took place in Singapore from 24 to 28 April MIT Charles Dawson, Professor of Aviation and Astronautics Chuchu Fan and colleagues at the University of Harvard and University of Michigan.
“The motivation of this work is that it is really frustrating when it was interact with these complicated systems, where it is really difficult to understand what is behind the scenes that create the thesis or failure we observe.
The new work is based on previous research from the fans’ laboratory, where they look at problems involving hypothetical failure problems, such as groups of robots working on task or complex systems such as power grid, looking for ways to anticipate, how to fail. “The goal of this project,” says Fan, “it was really a transformation into a diagnostic from Treache, which we could use on a real world system.”
The intention was to provide a way that someone could “give us data from the time when this system in the real world had an outfit or failure,” says Dawson, “and we can try to diagnose root causes and provide a little appearance for the curtain for this complexity.
The intention is that the method they have developed, “work for a love general class of cyber physical problems,” he says. These are problems in which “you have an automated decision -making component interacting with the clutter of the real world,” he explains. There are tools for testing software systems that work in themselves, but complexity arises when the software has been interacting with physical entities concerning their activities in a real physical environment, whether plane planning, autonomous movement interactions, and control of inputs and outputs on electric grid. In such systems that often happens, “software could make a decision that looks good at first, but then has all these domino, knock-on effects that make things more incorrect and much more insecure.”
One of the key differences, however, is that in systems like robot teams, unlike aircraft planning, “we have access to the model in the world of robotics,” says Fan, who is the main investigator in the MIT laboratory for information and decision systems (human). “We have a good understanding of physics for robotics and we have ways to create a model” that represents their activities with reasonable accuracy. However, airlines planning includes processes and systems that are profile business information, and therefore scientists had to find ways to judge, what was for decisions, and use only relatively sparse publicly available information that basically consisted only of the actual arrival and departure time.
“We have made all these flight data, but there is the whole planning system behind it and we do not know how the system works,” says Fan. And the amount of data concerning actual failure has only a few days compared to the years of data on normal flight operations.
The impact of weather events in Denver during the week of planning crisis in the southwest has clearly appeared in the data of flight, only from longer than normal blocking times between landing and take -off at the airport in Denver. The way in which cascade affects, although the system was less surrounded and required more analysis. The key turned out that he has to do with the concept of reserve aircraft.
Airlines usually maintain some aircraft in reserve at different airports, so if there are problems with one plane that is Scherud for flight, another aircraft can be replaced quickly. Southwest uses only one type of plane, so they are all interchangeable, which makes such substitution easier. Most airlines, however, are working on a system of hubs and beams, but several designated airports that can increase from these reserve aircraft can be more dispersed in their network. And the way these aircraft were deployed has shown that it plays a major role in the emerging crisis.
“The challenge is that there are no public data in terms of where the aircraft is located in the southwest network,” says Dawson. “What we are able to find your method is, a share in public information and delay we can use our method to postpone what the hidden parameters of these aircraft reserves could be to explain the observations we saw.”
They found that how the reserve was deployed was the “leader of the indicator” of the problem that cascaded in the national crisis. Some parts of the network that we directly influenced the weather could recover quickly and return as planned. “But when we look at other areas in the network, we saw that these reserves were simply not agar and things were getting worse.”
For example, the data showed that Denver’s reserves were rapidly shrinking due to weather delays, but “it also allowed us to trace this failure from day to Las Vegas,” he says. Although there was no serious weather, “our method still shows us a constant decline in the number of aircraft that were able to serve flights from Las Vegas.
He says “What we found out were these aircraft circulation in the southwest network where the plane could start the day in California and then fly to Denver and then end the day in Las Vegas.” In the case of this storm, the cycle was interrupted. As a result, “this storm in Denver interrupts the cycle and suddenly the reserves in Las Vegas, which are not affected by the weather, begin to deteriorate.”
Finally, Southwest was forced to take a drastic measure to solve the problem: they had to carry out “hard resetting” of their Entitre system, cancel all flights and fly empty aircraft by spilling the country to re -balance reserved.
In cooperation with air transport experts, scientists have developed a model of how the planning system is to work. Then: “What is our method for the method, we basically try to lower the model backwards.” Looking at the observed results, the model allows them to work back to find out what types of initial conditions could cause these results.
While data on real failures were sparse, extensive data on typical operations help in teaching the computational model “What is possible, what is possible, what is the area of physical possibility,” says Dawson. “This gives us the knowledge of the domain to say, in this extreme event, given what is possible, what is the most likely explanation” for failure.
This could lead to a real -time monitoring system, says where normal operations are constantly compared to current data and determines what this trend looks like. “Are we normal or are we heading towards extreme events?” Displaying symptoms of threatening problems could allow expected measures, such as relating reserve aircraft in advance to expected problems.
Her laboratory is working on the development of such systems, says Fan. Meanwhile, they created a tool to analyze the Calnf failure that is available for anyone to use. Meanwhile, Dawson, who acquired a doctorate last year, works as postdoc to apply the development of the method in this work to understand the failure in energy networks.
The research team also included Max Li of the University of Michigan and Van Tran from the University of Harvard. The work was supported by NASA, Air Force Office for Scientific Research and Mitta program.
(Tagstranslate) MIT human