| rfc9940v1.txt | rfc9940.txt | |||
|---|---|---|---|---|
| Internet Engineering Task Force (IETF) N. Davis, Ed. | Internet Engineering Task Force (IETF) N. Davis, Ed. | |||
| Request for Comments: 9940 Ciena | Request for Comments: 9940 Ciena | |||
| Category: Informational A. Farrel, Ed. | Category: Informational A. Farrel, Ed. | |||
| ISSN: 2070-1721 Old Dog Consulting | ISSN: 2070-1721 Old Dog Consulting | |||
| T. Graf | T. Graf | |||
| Swisscom | Swisscom | |||
| Q. Wu | Q. Wu | |||
| Huawei | ||||
| C. Yu | C. Yu | |||
| Huawei Technologies | Huawei | |||
| February 2026 | February 2026 | |||
| Some Key Terms for Network Fault and Problem Management | Some Key Terms for Network Fault and Problem Management | |||
| Abstract | Abstract | |||
| This document sets out some terms that are fundamental to a common | This document sets out some terms that are fundamental to a common | |||
| understanding of network fault and problem management within the | understanding of network fault and problem management within the | |||
| IETF. | IETF. | |||
| skipping to change at line 85 ¶ | skipping to change at line 84 ¶ | |||
| Successful operation of large networks depends on effective network | Successful operation of large networks depends on effective network | |||
| management. This requires a virtuous circle of network control, | management. This requires a virtuous circle of network control, | |||
| network observability, network analytics, network assurance, and back | network observability, network analytics, network assurance, and back | |||
| to network control. Network fault and problem management [RFC6632] | to network control. Network fault and problem management [RFC6632] | |||
| is an important aspect of network management and control solutions. | is an important aspect of network management and control solutions. | |||
| It deals with the detection, reporting, inspection, isolation, | It deals with the detection, reporting, inspection, isolation, | |||
| correlation, and management of events within the network. The | correlation, and management of events within the network. The | |||
| intention of this document is to focus on those events that have a | intention of this document is to focus on those events that have a | |||
| negative effect on the network's ability to forward traffic according | negative effect on the network's ability to forward traffic according | |||
| to expected behavior and so deliver services, the ability to control | to expected behaviors that may reduce the network's ability to | |||
| and operate the network, and other faults that reduce the quality or | deliver services. Such events may also impact the ability to control | |||
| reliability of the delivered service. The concept of fault and | and operate the network. The document also considers other faults | |||
| problem management extends to include actions taken to determine the | that reduce the quality or reliability of the delivered service. The | |||
| causes of problems and to work toward recovery of expected network | concept of fault and problem management extends to include actions | |||
| behavior. | taken to determine the causes of problems and to work toward recovery | |||
| of expected network behavior. | ||||
| A number of work efforts within the IETF seek to provide components | A number of work efforts within the IETF seek to provide components | |||
| of a fault management system, such as YANG data models or management | of a fault management system, such as YANG data models or management | |||
| protocols. It is important that a common terminology be used so that | protocols. It is important that a common terminology be used so that | |||
| there is a clear understanding of how the elements of the management | there is a clear understanding of how the elements of the management | |||
| and control solutions fit together and how faults and problems will | and control solutions fit together and how faults and problems will | |||
| be handled. | be handled. | |||
| This document sets out some terms that are fundamental to a common | This document sets out some terms that are fundamental to a common | |||
| understanding of network fault and problem management. While | understanding of network fault and problem management. While | |||
| skipping to change at line 178 ¶ | skipping to change at line 178 ¶ | |||
| process of collecting operational network data categorized | process of collecting operational network data categorized | |||
| according to the network plane (e.g., Layer 3, Layer 2, and Layer | according to the network plane (e.g., Layer 3, Layer 2, and Layer | |||
| 1) from which it was derived. Data collected through the Network | 1) from which it was derived. Data collected through the Network | |||
| Telemetry process does not contain any data related to service | Telemetry process does not contain any data related to service | |||
| definitions (i.e., "intent" per Section 3.1 of [RFC9315]). | definitions (i.e., "intent" per Section 3.1 of [RFC9315]). | |||
| Network Monitoring: This is the process of keeping a continuous | Network Monitoring: This is the process of keeping a continuous | |||
| record of functions related to a network topology. It involves | record of functions related to a network topology. It involves | |||
| tracking various aspects such as traffic patterns, device health, | tracking various aspects such as traffic patterns, device health, | |||
| performance metrics, and overall network behavior. This approach | performance metrics, and overall network behavior. This approach | |||
| differentiates network monitoring from resource or device | differentiates Network Monitoring from resource or device | |||
| monitoring, which focuses on individual resources or components | monitoring, which focuses on individual resources or components | |||
| (Section 3.2). | (Section 3.2). | |||
| Network Analytics: This is the process of deriving analytical | Network Analytics: This is the process of deriving analytical | |||
| insights from operational network data. A process could be | insights from operational network data. A process could be | |||
| executed by a piece of software, a system, or a human that | executed by a piece of software, a system, or a human that | |||
| analyzes operational data and outputs new analytical data related | analyzes operational data and outputs new analytical data related | |||
| to the operational data -- for example, a symptom. | to the operational data -- for example, a symptom. | |||
| Network Observability: This is the process of enabling network | Network Observability: This is the process of enabling network | |||
| behavioral assessment through analysis of observed operational | behavioral assessment through analysis of observed operational | |||
| network data (logs, alarms, traces, etc.) with the aim of | network data (logs, alarms, traces, etc.) with the aim of | |||
| detecting symptoms of network behavior, and to identify anomalies | detecting symptoms of network behavior, and identifying anomalies | |||
| and their causes. Network Observability begins with information | and their causes. Network Observability begins with information | |||
| gathered using Network Monitoring tools and that may be further | gathered using Network Monitoring tools. That information may be | |||
| enriched with other operational data. The expected outcome of the | further enriched with other operational data. The expected | |||
| observability processes is identification and analysis of | outcome of the observability processes is identification and | |||
| deviations in observed state versus the expected state of a | analysis of deviations in observed state versus the expected state | |||
| network. | of a network. | |||
| Thus, there is a cascaded sequence where the following relationships | Thus, there is a cascaded sequence where the following relationships | |||
| apply: | apply: | |||
| * Network Telemetry is the process of collecting operational data | * Network Telemetry is the process of collecting operational data | |||
| from a network. | from a network. | |||
| * Network Monitoring is the process of creating/keeping a record of | * Network Monitoring is the process of creating/keeping a record of | |||
| data gathered in Network Telemetry. | data gathered in Network Telemetry. | |||
| skipping to change at line 231 ¶ | skipping to change at line 231 ¶ | |||
| Resource: An element of a network system. | Resource: An element of a network system. | |||
| * Resource is a recursive concept so that a Resource may be a | * Resource is a recursive concept so that a Resource may be a | |||
| collection of other Resources (for example, a network node | collection of other Resources (for example, a network node | |||
| comprises a collection of network interfaces). | comprises a collection of network interfaces). | |||
| Characteristic: Observable or measurable aspect or behavior | Characteristic: Observable or measurable aspect or behavior | |||
| associated with a Resource. | associated with a Resource. | |||
| * A Characteristic may be considered to be built on facts (see | * A Characteristic may be considered to be built on facts (see | |||
| 'Value', below) and the contexts and descriptors that identify | "Value", below) and the contexts and descriptors that identify | |||
| and give meaning to the facts. | and give meaning to the facts. | |||
| * The term "Metric" [RFC9417] is another word for a measurable | * The term "Metric" (see "metric" in [RFC9417]) is another word | |||
| Characteristic which may also be thought of as analogous to a | for a measurable Characteristic, which may also be thought of | |||
| 'variable'. | as analogous to a "variable". | |||
| Value: A measure of a Characteristic associated with a Resource. It | Value: A measure of a Characteristic associated with a Resource. It | |||
| may be in the form of a categorization (e.g., high or low), an | may be in the form of a categorization (e.g., high or low), an | |||
| integer (e.g., a count or gauge), or a reading of a continuous | integer (e.g., a count or gauge), or a reading of a continuous | |||
| variable (e.g., an analog measurement), etc. | variable (e.g., an analog measurement), etc. | |||
| Change: In the context of Network Monitoring, the variation in the | Change: In the context of Network Monitoring, the variation in the | |||
| Value of a Characteristic associated with a Resource. A Change | Value of a Characteristic associated with a Resource. A Change | |||
| may arise over a period of time. | may arise over a period of time. | |||
| skipping to change at line 284 ¶ | skipping to change at line 284 ¶ | |||
| may determine the State of the router, such as shortage of memory. | may determine the State of the router, such as shortage of memory. | |||
| * While a State may be observed at a specific moment in time, it | * While a State may be observed at a specific moment in time, it | |||
| is actually determined by summarizing measurement over time in | is actually determined by summarizing measurement over time in | |||
| a process sometimes called State compression. | a process sometimes called State compression. | |||
| * It may be helpful to qualify this as "Resource State" to make | * It may be helpful to qualify this as "Resource State" to make | |||
| clear the distinction between this and other uses of "state" | clear the distinction between this and other uses of "state" | |||
| such as "protocol state". | such as "protocol state". | |||
| * This term may be contrasted with "Operational State" as used in | * This term may be contrasted with "operational state" as used in | |||
| [RFC8342]. For example, the state of a link might be up/down/ | [RFC8342]. For example, the state of a link might be up/down/ | |||
| degraded, but the operational state of the link would include a | degraded, but the operational state of the link would include a | |||
| collection of Values of Characteristics of the link. | collection of Values of Characteristics of the link. | |||
| Detect (hence Detected, Detection): To notice the presence of | Detect (hence Detected, Detection): To notice the presence of | |||
| something (State, Change, Event, activity, etc.) and hence also to | something (State, Change, Event, activity, etc.) | |||
| notice a Change (from the perspective of an observer such as a | ||||
| monitoring system). | * Also to notice a Change (from the perspective of an observer | |||
| such as a monitoring system). | ||||
| Relevance: Consideration of an Event, State, or Value (through the | Relevance: Consideration of an Event, State, or Value (through the | |||
| application of policy, relative to a specific perspective, intent, | application of policy, relative to a specific perspective or | |||
| and in relation to other Events, States, and Values) to determine | intent, and in relation to other Events, States, and Values) to | |||
| whether it is of note to the system that controls or manages the | determine whether it is of note to the system that controls or | |||
| network. Note, for example, that not all Changes are Relevant. | manages the network. Note, for example, that not all Changes are | |||
| Relevant. | ||||
| * This term may also be used as "Relevant Event", "Relevant | * This term may also be used as "Relevant Event", "Relevant | |||
| State", or "Relevant Value". | State", or "Relevant Value". | |||
| Occurrence: A Relevant Event or a particular Relevant Change. | Occurrence: A Relevant Event or a particular Relevant Change. | |||
| * An Occurrence may be an aggregation or abstraction of multiple | * An Occurrence may be an aggregation or abstraction of multiple | |||
| fine-grained Events or Changes. | fine-grained Events or Changes. | |||
| * An Occurrence may occur at any macro or micro scale because | * An Occurrence may occur at any macro or micro scale because | |||
| Resources are a recursive concept, and may be perceived, | Resources are a recursive concept. An Occurrence may be | |||
| depending on the scope of observation (i.e., according to the | perceived, depending on the scope of observation (i.e., | |||
| level of Resource recursion that is examined). That is, | according to the level of Resource recursion that is examined). | |||
| Occurrences, themselves, are a recursive concept. | That is, Occurrences, themselves, are a recursive concept. | |||
| Fault: An Occurrence (i.e., an Event or a Change) that is not | Fault: An Occurrence (i.e., an Event or a Change) that is not | |||
| desired/required (as it may be indicative of a current or future | desired/required (as it may be indicative of a current or future | |||
| undesired State). Thus, a Fault happens at a moment in time. A | undesired State). Thus, a Fault happens at a moment in time. A | |||
| Fault can potentially be associated with a Cause. See [RFC8632] | Fault can potentially be associated with a Cause. See [RFC8632] | |||
| for a more detailed discussion of network faults. | for a more detailed discussion of network faults. | |||
| * Note that there is a distinction between a Fault and a Problem | * Note that there is a distinction between a Fault and a Problem | |||
| that depends on context. For example, in a connectivity | that depends on context. For example, in a connectivity | |||
| service where redundancy is present, a link down is a Problem, | service where redundancy is present, a link down is a Problem, | |||
| skipping to change at line 461 ¶ | skipping to change at line 463 ¶ | |||
| Change at a time Change over time Change over time | Change at a time Change over time Change over time | |||
| Figure 2: Characteristics and Changes | Figure 2: Characteristics and Changes | |||
| Figure 3 shows the workflow progress for Events. As noted above, an | Figure 3 shows the workflow progress for Events. As noted above, an | |||
| Event is a Change in the Value of a Characteristic at a time. The | Event is a Change in the Value of a Characteristic at a time. The | |||
| Event may be evaluated (considering policy, relative to a specific | Event may be evaluated (considering policy, relative to a specific | |||
| perspective, with a view to intent, and in relation to other Events, | perspective, with a view to intent, and in relation to other Events, | |||
| States, and Values) to determine if it is an Occurrence and possibly | States, and Values) to determine if it is an Occurrence and possibly | |||
| to indicate a Change of State. An Occurrence may be undesirable (a | to indicate a Change of State. An Occurrence may be undesirable (a | |||
| Fault) and that can cause an Alert to be generated, may be evidence | Fault), which might cause an Alert to be generated. Or, an | |||
| of a Problem and could directly indicate a Cause. In some cases, an | Occurrence may be evidence of a Problem and could directly indicate a | |||
| Alert may give rise to an Alarm highlighting the potential or actual | Cause. In some cases, an Alert may give rise to an Alarm | |||
| presence of a Problem. | highlighting the potential or actual presence of a Problem. | |||
| Alert - - - > Alarm | Alert - - - > Alarm | |||
| ^ | ^ | |||
| | | | | |||
| | -----> Cause | | -----> Cause | |||
| | | | | | | |||
| |----------> Problem | |----------> Problem | |||
| | | | | |||
| | | | | |||
| Fault | Fault | |||
| skipping to change at line 500 ¶ | skipping to change at line 502 ¶ | |||
| progress for States. As shown in Figure 2, Change noted at a | progress for States. As shown in Figure 2, Change noted at a | |||
| particular time gives rise to State. The State may be deemed to have | particular time gives rise to State. The State may be deemed to have | |||
| Relevance considering policy, relative to a specific perspective, | Relevance considering policy, relative to a specific perspective, | |||
| with a view to intent, and in relation to other Events, States, and | with a view to intent, and in relation to other Events, States, and | |||
| Values. A Relevant State may be deemed a Problem, or it may indicate | Values. A Relevant State may be deemed a Problem, or it may indicate | |||
| a Problem or potential Problem. | a Problem or potential Problem. | |||
| Problems may be considered based on Symptoms and may map directly or | Problems may be considered based on Symptoms and may map directly or | |||
| indirectly to Causes. An Incident results from one or more Problems. | indirectly to Causes. An Incident results from one or more Problems. | |||
| An Alarm may be raised as the result of a Problem, and the transition | An Alarm may be raised as the result of a Problem, and the transition | |||
| to an Alarmed state may give rise to an Alert. | to an alarmed State may give rise to an Alert. | |||
| Alarm - - -> Alert | Alarm - - -> Alert | |||
| ^ | ^ | |||
| | ------> Incident | | ------> Incident | |||
| | | | | | | |||
| | | ---> Cause | | | ---> Cause | |||
| | | | | | | | | |||
| Problem---------> Symptom | Problem---------> Symptom | |||
| ^ | ^ | |||
| | | | | |||
| skipping to change at line 560 ¶ | skipping to change at line 562 ¶ | |||
| Events and States (and the Alerts that they might give rise to) must | Events and States (and the Alerts that they might give rise to) must | |||
| be treated with caution to dampen any "flapping" (so that consistent | be treated with caution to dampen any "flapping" (so that consistent | |||
| States may be observed) and to avoid overwhelming management | States may be observed) and to avoid overwhelming management | |||
| processes or systems. Analog Values may be read or notified from the | processes or systems. Analog Values may be read or notified from the | |||
| Resource and could transition a threshold, be deemed Relevant Values, | Resource and could transition a threshold, be deemed Relevant Values, | |||
| or be evaluated over time. Events may be counted, and the Count may | or be evaluated over time. Events may be counted, and the Count may | |||
| cross a threshold or reach a Relevant Value. | cross a threshold or reach a Relevant Value. | |||
| The Threshold Process may be implementation specific and subject to | The Threshold Process may be implementation specific and subject to | |||
| policies. When a threshold is crossed and any other conditions are | policies. When a threshold is crossed and any other conditions are | |||
| matched, an Event may be determined and may be treated like any other | matched, an Event may be determined and treated like any other Event. | |||
| Event. | ||||
| Occurrence | Occurrence | |||
| ^ | ^ | |||
| | | | | |||
| |---------------------> State | |---------------------> State | |||
| | | | | |||
| | ------- Relevance | | ------- Relevance | |||
| |------>| Count |-----------------------------> Value | |------>| Count |-----------------------------> Value | |||
| | ------- | ^ | | ------- | ^ | |||
| | | | | | | | | | | |||
| skipping to change at line 689 ¶ | skipping to change at line 690 ¶ | |||
| [RFC9417] Claise, B., Quilbeuf, J., Lopez, D., Voyer, D., and T. | [RFC9417] Claise, B., Quilbeuf, J., Lopez, D., Voyer, D., and T. | |||
| Arumugam, "Service Assurance for Intent-Based Networking | Arumugam, "Service Assurance for Intent-Based Networking | |||
| Architecture", RFC 9417, DOI 10.17487/RFC9417, July 2023, | Architecture", RFC 9417, DOI 10.17487/RFC9417, July 2023, | |||
| <https://www.rfc-editor.org/info/rfc9417>. | <https://www.rfc-editor.org/info/rfc9417>. | |||
| Acknowledgments | Acknowledgments | |||
| The authors would like to thank Med Boucadair, Wanting Du, Joe | The authors would like to thank Med Boucadair, Wanting Du, Joe | |||
| Clarke, Javier Antich, Benoit Claise, Christopher Janz, Sherif | Clarke, Javier Antich, Benoit Claise, Christopher Janz, Sherif | |||
| Mostafa, Kristian Larsson, Dirk Hugo, Carsten Bormann, Hilarie Orman, | Mostafa, Kristian Larsson, Dirk Von Hugo, Carsten Bormann, Hilarie | |||
| Stewart Bryant, Bo Wu, Paul Kyzivat, Jouni Korhonen, Reshad Rahman, | Orman, Stewart Bryant, Bo Wu, Paul Kyzivat, Jouni Korhonen, Reshad | |||
| Rob Wilton, Mahesh Jethanandani, Tim Bray, Paul Aitken, and Deb | Rahman, Rob Wilton, Mahesh Jethanandani, Tim Bray, Paul Aitken, and | |||
| Cooley for their helpful comments. | Deb Cooley for their helpful comments. | |||
| Special thanks to the team that met at a side meeting at IETF 120 to | Special thanks to the team that met at a side meeting at IETF 120 to | |||
| discuss some of the thorny issues: | discuss some of the thorny issues: | |||
| * Benoit Claise | * Benoit Claise | |||
| * Watson Ladd | * Watson Ladd | |||
| * Brad Peters | * Brad Peters | |||
| * Bo Wu | * Bo Wu | |||
| * Georgios Karagiannis | * Georgios Karagiannis | |||
| * Olga Havel | * Olga Havel | |||
| skipping to change at line 740 ¶ | skipping to change at line 741 ¶ | |||
| Qin Wu | Qin Wu | |||
| Huawei | Huawei | |||
| 101 Software Avenue, Yuhua District | 101 Software Avenue, Yuhua District | |||
| Nanjing | Nanjing | |||
| Jiangsu, 210012 | Jiangsu, 210012 | |||
| China | China | |||
| Email: bill.wu@huawei.com | Email: bill.wu@huawei.com | |||
| Chaode Yu | Chaode Yu | |||
| Huawei Technologies | Huawei | |||
| Email: yuchaode@huawei.com | Email: yuchaode@huawei.com | |||
| End of changes. 17 change blocks. | ||||
| 43 lines changed or deleted | 44 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. | ||||