Network Working Group T. Przygienda Internet-Draft T. Li Intended status: Experimental Juniper Networks Expires: 9 November 2025 8 May 2025 ISIS Hierarchical SNPs draft-prz-lsr-hierarchical-snps-00 Abstract The document introduces an optional, new type of SNPs called Hierarchical SNP (HSNP) that can compress the traditional CSNPs exchange into a variant of merkle tree, hence allowing to support very large databases and adjacency numbers. Such an approach should lead in case of inconsistencies to much faster re-synchronization since only a subset of packets compared to full scale CSNP exchange is necessary to correct the entropy present. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 9 November 2025. Copyright Notice Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved. Przygienda & Li Expires 9 November 2025 [Page 1] Internet-Draft ISIS Hierarchical SNPs May 2025 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Suggested Dynamic Leaf Partitioning . . . . . . . . . . . . . 3 2.1. Repacking . . . . . . . . . . . . . . . . . . . . . . . . 4 3. Fast, incremental, self-inverse hashing function . . . . . . 5 4. HSNP PDU Format . . . . . . . . . . . . . . . . . . . . . . . 8 5. Procedures . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.1. Maximum supported level negotiation in IIH . . . . . . . 9 5.2. Advertising HSNPs . . . . . . . . . . . . . . . . . . . . 9 6. Further Considerations . . . . . . . . . . . . . . . . . . . 10 6.1. Impact of Packet Losses . . . . . . . . . . . . . . . . . 10 6.2. Decompression and Caching/Comparison Optimizations . . . 10 7. Security Considerations . . . . . . . . . . . . . . . . . . . 10 8. IANA Section . . . . . . . . . . . . . . . . . . . . . . . . 10 9. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 11 10. Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 11 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 11.1. Normative References . . . . . . . . . . . . . . . . . . 11 11.2. Informative References . . . . . . . . . . . . . . . . . 11 Appendix A. Reference Implementation of Fragment Hashing . . . . 11 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 1. Introduction The document introduces an optional, new type of SNPs called Hierarchical SNP (HSNP) that can compress the traditional CSNPs exchange into a variant of merkle tree [MERKLE] and skip lists [SKIP], hence allowing to support very large databases and adjacency numbers. Such an approach should lead in case of inconsistencies to much faster re-synchronization since only a subset of packets compared to full scale CSNP exchange is necessary to correct the entropy present. Przygienda & Li Expires 9 November 2025 [Page 2] Internet-Draft ISIS Hierarchical SNPs May 2025 Although the scheme can be applied recursively to the point where a single merkle hash represents the whole database, for practical purposes a two level tree allows compression for LSDB sizes of the 1E5 order and hence the document limits itself in examples to such magnitudes. Further considerations mentioned in Section 6 seem to make this limit practically prudent. We call CSNP entries of LSPs the zero level merkle hash (where we basically use lsp id, seq# and checksum as "hash") and the hash summarizing a set of those first level merkle hash and so on recursively. 2. Suggested Dynamic Leaf Partitioning Practically speaking, the most interesting problem is the correct subdivision of the database into first level collection of leaf nodes (CSNP entries) that on one hand can provide a good compression, on the other hand the subdivision changes as little as possible. Otherwise the neighbors receiving the information may have to recompute the hashes rather than relying on a cache representing its own merkle tree. The subdivision should also produce enough "bins" no matter the distribution of the fragment IDs in the network. This is important to prevent such things as packing of all the fragments into a single checksum when e.g. a hash function degenerates. And ideally, a hash mismatch should produce not more than a single packet or two with lower level checksums or CSNPs to optimize re-convergence while minimizing amount of packets exchanged. To start with, in IS-IS networks we can fit into the prevailing 1500 bytes somewhat less than 100 of zero level entries for the foreseeable future. This is the consequence of CSNP entries consuming LSP ID + Fragment + Seq# + CSUM + Lifetime length which amounts to 8 + 2 + 2 + 2 = 14 bytes each. Subsequently, first and higher level hashes will occupy (as shown in Figure 2) the length of 7 + 7 + 4 = 18 bytes per hash and hence around 80 of those hashes fit into a packet. Those considerations lead to the suggested partitioning and packing scheme below. To start with, as stated, it is desirable to produce one packet on a miss on the merkle hash of first level leaf and hence such leaves will pack initially 80 LSP fragments with exceptions following later. We do not try to maximize the "initial packing". LSDB may grow and to maximize the chances of same "leaf packing" on both sides of an adjacency, even during flooding transitions, some "slack" is advisable. Przygienda & Li Expires 9 November 2025 [Page 3] Internet-Draft ISIS Hierarchical SNPs May 2025 At higher levels, to begin with 60 hashes should be summarized by a higher level hash for the same reason. The packing will always put all fragments of a system into same leaf (which of course can exceed the advisable 80 fragments sometimes) and a first order leaf will be considered "full" if addition of next System ID fragments would exceed this size (except obviously, when the leaf is empty). 2.1. Repacking During flooding transitions the databases on different nodes may be obviously different for some period of time. As stated, even in such situations it is beneficial for efficient implementations using caching to agree on both sides on the ranges of the Merkle hashes advertised. Or alternately, have a fast way to reconstruct the internal merkle hash for a range. E.g., in the suggested packing scheme, since the nodes will always agree on the "system-ID" boundary, a merkle hash per system ID can be easily kept and if the received ranges do not agree with cached ranges the necessary merkle hash recomputed very quickly. As example, in first level the worst case of 2 leaves containing ~80 nodes each and "intersecting" the cached internal ranges an implementation will have to merkle hash about 160 hashes (one per System ID) to validate the received value. Obviously, it is still highly preferred for the ranges on advertised merkle hashes to agree on their system ID ranges precisely and it is of most importance at the top level. Under stable conditions those are the merkle hashes reducing the CSNP exchange to minimal amount of packets and processing effort. To settle on the same ranges in HSNPs an implementation of the suggested packing should let a leaf that drops under 50% occupancy "start robbing" system IDs from "left" of the next leaf until the current leaf meets the "full condition". This is of course a recursive action that may ultimately generate less leaves, remove some and in a recursive fashion lead to the same "greedy robbery from the left" in the next level up. The "left" is colloquial here for starting with lowest System IDs under normal sorting criteria. On the other end of the spectrum a leaf that holds more than 150% of usual capacity (i.e. 80 * 1.5 LSPs or 60 * 1.5 Hashes) should be preferably split into two leaves unless it holds a single System ID with more than 80 fragments. Splitting the leaves may cause a repacking at a higher level again in a recursive fashion. The rebalancing of ranges to agree across all nodes and hence reduce hashing load is a trade-off in terms of possibly large recomputation vs. suffering a penalty of recomputing some hashes on disagreeing Przygienda & Li Expires 9 November 2025 [Page 4] Internet-Draft ISIS Hierarchical SNPs May 2025 ranges on every exchange. Other solutions are of course possible such as internal caches that keep the recomputed hashes for the neighbor's ranges. Precise splitting/merging algorithms agreed upon increase the likelihood of nodes ending up on precisely same ranges. A possibly simpler idea to discuss is to simply "repack" the whole thing on some balance violations or periodically. Another idea is to simply use ISIS fragment sliding but this may lead in worst case to first level checksumming a single fragment over time. Overall, different partitioning and packing approaches are possible but if system ID as natural partition is not used, this will likely change the packet format since the partition boundaries will necessarily reflect which of the fragments are covered by the hashes. Although, given that ordering of fragments has to be preserved it is hard to imagine anything else but start and end consisting of fragment IDs. 3. Fast, incremental, self-inverse hashing function Since we need to generate massive amount of hashes over sets of ever- changing fragments it is very desirable to use a specialized hash function that provides means for very fast incremental adjustment of hash result on fragments arriving, aging out or changing their checksums. Also, splitting or merging of leaves should allow to generate first and higher order hashes very quickly. Deeper considerations on such hashes can be found in [HASHES] but our design space is simplified due to irrelevance of security involved. We use a large prime number constant to prime all hashes. The hash function is a very fast XOR operation which incorporates the system IDs, checksums and numbering of fragments. It is 64 bits long for simplicity. Symbol << is used for rotation (not shift) to the left. A hash of a first order leaf is nothing but (where all used values are converted to 64 bit values first with any remaining high order bits zero'ed out) the usual XOR(hash, (system-id-of-fragment + 1) << fragment#) followed by XOR(hash, (fragment# + 1) << 32 | (fragment- seq# + 1) << 16 | (fragment-csum + 1)) for all contained fragments . Obviously, such an operation can be easily reversed to yield the previous hash with the fragment removed which allows for very fast removal, addition of fragments and updates of the relevant fields. Appendix A includes reference code of the introduced hashing. Przygienda & Li Expires 9 November 2025 [Page 5] Internet-Draft ISIS Hierarchical SNPs May 2025 Example of adding multiple fragments and removing those (in different order) which yields ultimately the same initial hash is found below: Initial Hash: CDDC72CF Adding Fragment 0100.0000.0000.00 #: 0 CSUM: 1 SEQNR: 0 Hash after Addition 10001CDDD72CC Adding Fragment 0100.0000.0000.00 #: 1 CSUM: 2 SEQNR: 1 Hash after Addition 1010003CDDF72CE Adding Fragment 0200.0000.0000.00 #: 1 CSUM: 3 SEQNR: 2 Hash after Addition 3010001CDDC72CB Removing Fragment 0100.0000.0000.00 #: 0 CSUM: 1 SEQNR: 0 Hash after Removal 3000000CDDD72C8 Removing Fragment 0100.0000.0000.00 #: 1 CSUM: 2 SEQNR: 1 Hash after Removal 2000002CDDF72CA Removing Fragment 0200.0000.0000.00 #: 1 CSUM: 3 SEQNR: 2 Hash after Removal CDDC72CF Figure 1 A hash of second order leaf is nothing but XOR of all the hashes of all the contained first order leaves. Again, this allows to update the hash when adding, removing a leaf or changing its checksum in a very fast and simple manner. The packing of second order leaves is determined by how many first order hashes can be fitted into a PDU normally and based on Section 4 we can set a second order leaf size to 70. This will cause a mismatch in the hash of a 2nd order leaf to advertise roughly a PDU full of first order leaves. Ultimately, and fairly obviously, third order hash uses second order hash logic to keep its hash. This all means that every time a first order leaf changes the contained system IDs for some reason the merkle hashes will have to be readjusted recursively in according 2nd and third order leaves. This is in itself nothing particular since _any_ change on first order leaf hash forces change on second order and consequently third order leaf hash. This is how Merkle trees work after all. A first example will serve well here. We limit ourselves in the examples to consideration of a LSDB with 512 nodes with system identifiers of 1000.0000.00 <2 digits node-id> each holding 32 fragments numbered 0 to 31. We leave the uneven node identifiers out to have some "holes" in the numbering to hit some corner cases in further examples. We disregard the pseudo node byte as simply another byte of system identifier since it does not contribute further details to the scheme and use value 0 in further text. Przygienda & Li Expires 9 November 2025 [Page 6] Internet-Draft ISIS Hierarchical SNPs May 2025 In a stable state we can expect the following 128 first order leaves (each holding 2 systems worth of fragments) generating three packets. And as logical consequence two single second order leaves. First of the three first order packets will look roughly like this ... +--------------------------------------------+ | Start System ID: 0000.0000.0000.00 | +--------------------------------------------+ | End System ID: 0000.0000.00A0.00 | // 80 hashes for 160 systems +--------------------------------------------+ | HSNP Level: 0 | +--------------------------------------------+ | Start System ID: 1000.0000.0000.00 | +--------------------------------------------+ | End System ID: 1000.0000.0002.00 | // 64 fragments over 2 systems +--------------------------------------------+ | Merkle Hash | +--------------------------------------------+ .. | Start System ID: 1000.0000.008E.00 | +--------------------------------------------+ | End System ID: 1000.0000.00A0.00 | +--------------------------------------------+ | Merkle Hash | +--------------------------------------------+ Figure 2 The 2nd order packet will look like this Przygienda & Li Expires 9 November 2025 [Page 7] Internet-Draft ISIS Hierarchical SNPs May 2025 ... +--------------------------------------------+ | Start System ID: 0000.0000.0000.00 | +--------------------------------------------+ | End System ID: FFFF.FFFF.FFFF.FF | +--------------------------------------------+ | HSNP Level: 1 | +--------------------------------------------+ | Start System ID: 1000.0000.0000.00 | +--------------------------------------------+ | End System ID: 1000.0000.00A0.00 | +--------------------------------------------+ | Merkle Hash | +--------------------------------------------+ | Start System ID: 1000.0000.00A2.00 | +--------------------------------------------+ | End System ID: 1000.0000.0200.00 | +--------------------------------------------+ | Merkle Hash | +--------------------------------------------+ Figure 3 4. HSNP PDU Format HSNP PDU Format follows closely CSNP format where instead of CSNP entries the according merkle hashes are propagated. ... +--------------------------------------------+ | PDU Length | +--------------------------------------------+ | Source ID | +--------------------------------------------+ | Start System ID | +--------------------------------------------+ | End System ID | +--------------------------------------------+ | HSNP Level | +--------------------------------------------+ | Variable Length Fields | +--------------------------------------------+ Start and End System IDs are or the usual ID Length + 1 byte length and indicate, just like CSNP do, the range that the HSNP covers. Przygienda & Li Expires 9 November 2025 [Page 8] Internet-Draft ISIS Hierarchical SNPs May 2025 HSNP level consists of 1 byte of level with 0 indicating "first level hashes". The variable length fields are a sorted sequence of covered ranges in the following format +--------------------------------------------+ | Start System ID | +--------------------------------------------+ | End System ID | +--------------------------------------------+ | Merkle Hash | +--------------------------------------------+ End System ID LSPs are included in the hash. Merkle hash consists of 4 bytes of the 64-bit computed hash with upper and lower 4 bytes XOR'ed together. This makes an entry in typical deployment scenarios 7 + 7 + 4 = 18 bytes long. 5. Procedures 5.1. Maximum supported level negotiation in IIH IIH of nodes supporting this extension MUST include in IIH a new TLV that will indicate support for reception of HSNPs. Additionally, the TLV will carry the maximum level of HSNPs that the node will advertise. Each node MUST pick the minimum of advertised levels on the adjacency and use that for the HSNPs it will advertise. All nodes on the adjacency MUST advertise the TLV on their IIHs, otherwise HSNPs are not used. 5.2. Advertising HSNPs Advertising normal CSNPs is replaced with advertisement of HSNPs at highest negotiated level. Under normal stable network condition this will be enough to maintain database integrity across all nodes with a minimum of transmission and processing. In case a node receives HSNPs where the merkle hash ranges are not the same, the node MUST either to compute and verify the hashes over the ranges indicated or disaggregate the overlapped ranges to a lower level HSNPs. The disaggregation is less preferred since in case of range mismatches over all levels and both sides using this strategy this can lead to a 'ping-pong' ending with CSNPs ultimately. Przygienda & Li Expires 9 November 2025 [Page 9] Internet-Draft ISIS Hierarchical SNPs May 2025 A node receiving an HSNP where the hash received does not match on recomputation or comparison the result on its own LSDB SHOULD send immediately HSNPs of one level below with the Merkle hashes for the ranges where the hash mismatch was detected. Alternately, a node MAY choose to immediately send according CSNPs, PSNPs or flood the LSPs that have been detected as not matching the merkle hashes. Sending CSNPs may be preferable if the mismatch covers relatively few LSPs. In case a node detects that it holds Merkle hashes for LSPs that are not covered by the received HSNP (start and end range and "holes" between ranges can be used to detect that condition), it MUST trigger the same behavior as triggered by CSNP with this condition, i.e. flood the missing LSPs. 6. Further Considerations 6.1. Impact of Packet Losses Many levels of aggregation will be more susceptible to packet losses since each loss covers a much larger part of the LSDB. Additionally, during disaggregation where multiple HSNP levels must be triggered, the loss will force a delay until the higher level HSNP is regenerated. 6.2. Decompression and Caching/Comparison Optimizations As mentioned above a node may apply many strategies to speed up decompression. LSPs missing in HSNPs as not covered by ranges are clearly "missing in action" and can be reflooded, small ranges where merkle mismatched can generate CSNPs, PSNPs or lead to flooding. Caching of hashes can be applied at many levels since the merkle hashes suggested here are easily computed, even if certain elements must be removed from them, so e.g. on receiving a range while the node already holds < A -- B & next-after-B> can be simply adjusted by removing 'next-after-B' merkle hash defined in this document from the cached result. 7. Security Considerations TBD 8. IANA Section TBD Przygienda & Li Expires 9 November 2025 [Page 10] Internet-Draft ISIS Hierarchical SNPs May 2025 9. Contributors TBD 10. Acknowledgement The discussions about "compressing CSNPs" go very long way back, allegedly to the times Radia Perlman was stalking the halls with insomniac Dave Katz. Recent efforts to push the scale of the protocol to new heights made the effort worth the effort (pun intended) to codify it into a standardized, practical engineering solution. 11. References 11.1. Normative References [HASHES] Phan, C.-W., "Security considerations for incremental hash functions based on pair block chaining", Computers and Security 25 , 2006. [MERKLE] Merkle, R.C., "A Digital Signature Based on a Conventional Encryption Function", Advances in Cryptology – CRYPTO '87 , 1988. [SKIP] Phan, C.-W., "Skip lists: A probabilistic alternative to balanced trees", Communications of the ACM , 1990. 11.2. Informative References Appendix A. Reference Implementation of Fragment Hashing Przygienda & Li Expires 9 November 2025 [Page 11] Internet-Draft ISIS Hierarchical SNPs May 2025 #[derive(Eq, PartialEq, Hash)] struct IncrementalHash(u64); const SEED: u64 = 0xCDDC72CF; impl Default for IncrementalHash { fn default() -> Self { IncrementalHash(SEED) } } impl IncrementalHash { // takes arbitrary number of byte arrays fn hash_in(&mut self, add: &[u64]) { for f in add { self.0 ^= f; } } fn hash_in_fragment(&mut self, sid: &SystemID, fragment_number: FragmentNr, fragment: &Arc) { let len = 8.min(sid.id.len()); let mut sidc: [u8; 8] = [0; 8]; sidc[(8-len) ..].copy_from_slice(&sid.id); sidc.rotate_left(fragment_number as _); self.hash_in(&[u64::from_be_bytes(sidc) + 1]); let frnr = fragment_number as u64 + 1; let frsn = fragment.seqnr as u64 + 1; let frcs = fragment.csum as u64 + 1; let hin = frnr << 32 | frsn << 16 | frcs; self.hash_in(&[hin]); } } Authors' Addresses Tony Przygienda Juniper Networks Email: prz@juniper.net Tony Li Juniper Networks Przygienda & Li Expires 9 November 2025 [Page 12] Internet-Draft ISIS Hierarchical SNPs May 2025 Email: tli@juniper.net Przygienda & Li Expires 9 November 2025 [Page 13]