February 23, 2017

General Network Challenges, and IP/TCP/UDP Operations

Table of Contents

Having fundamental knowledge of what affects TCP, UDP, and IP itself helps you to better troubleshoot the network when things go wrong. I feel like most of the lower-level network-oriented certifications barely touch on these topics, if at all. However, the current Cisco CCNP and CCIE Routing & Switching exams do expect you to know this. This post is geared toward Cisco’s implementation and defaults regarding the various topics. However, whether you are studying for a certification or not, this is all good information to have.

Please expand the Table of Contents for an overview of the covered topics.

General Network Challenges

Unicast Flooding:

Unicast flooding occurs when there is no entry in a device’s Layer 2 table (CAM), and so the incoming unicast frame is flooded to all ports within the VLAN. This is normal for when a frame is first received on a device, such as a switch, and the destination address is not yet known. The switch floods the unicast frame out all ports in the VLAN (except the received port), and when the destination address is discovered, an entry is created in the CAM table. Because both ARP and CAM entries age out, one solution to this problem is the make sure ARP entries expire before CAM entries, so that the ARP table can be re-populated before the CAM entry expires, which will reduce flooding.

You can block unicast flooding on a port with switchport block unicast which will prevent unknown unicasts from being flooded to ports configured with this command.

Supporting links:

Out of Order Packets:

Also referred to as “out of sequence” packets. This is described as packets arriving in a different order from which they were sent. There are many potential causes for this, such as asymmetric routing and packet loss among diversified paths in the network. TCP includes sequence numbers in attempt to alleviate this issue by allowing the receiver to cache received TCP segments and reassemble them in the correct order. However, packets that arrive out of order typically dramatically inhibit network performance. For example, the TCP receiver could send duplicate ACKs to trigger the fast retransmit algorithm. The TCP sender, upon receiving the duplicate ACKs, assumes packets were lost in transit and reduces the TCP window size, which reduces the TCP throughput.

Forwarding schemes that implement per-packet load distribution often result in out-of-order packets being received at the destination.

Supporting links:

Asymmetric Routing:

Routing is considered asymmetric when the traffic return path is different than the transmit path. “Hot potato” routing, where traffic traversing a network egresses at the nearest exit point, and link redundancy/alternate network paths, are two common causes of asymmetric routing. This can also be caused by NAT and firewall traversal. When the number of routers that traffic must pass through increases, the likelihood of asymmetric routing also increases.

Asymmetric routing is not normally an issue unless the traffic passes through a device that is expecting to see both sending and return traffic from a single IP flow. For example, if traffic is flowing through a firewall, it will generally expect to see traffic returned to it. If the traffic is sent to a different firewall where the session has not been established, the traffic generally will be dropped.

Supporting links:

Impact of Microbursts:

Microbursts are small spikes of traffic in the network and are typically encountered when traffic enters a higher-speed interface and exits a lower- speed interface, or in “fan-in” scenarios where traffic from multiple ports are trying to reach a single port. The impact to network traffic is usually observed as jitter, latency, and packet drops. Packet drops can be mitigated with larger buffers in the network devices at the cost of increasing the packet latency.

The impact of microbursts can also be observed when the networking device is unable to fully process all of the traffic at line rate. For example, a software-based router may have gigabit interfaces, but may be unable to handle a sudden burst of traffic at a full gigabit of speed. After the packet buffers have filled, packets will begin to drop, which can be observed on interface statistics as output drops.

Supporting links:

IP Operations

ICMP Unreachables & Redirects:

The ICMP Destination Unreachable message acts as a feedback mechanism for a router to let a source device know that it has no method to communicate with the desired destination. However, it is up to the sending device, upon receiving the ICMP Destination Unreachable message, whether or not to take any particular action. The ICMP Destination Unreachable (Type 3) message has six code types to indicate what type of failure is present:

Code 0 “Network Unreachable” means the packet could not be delivered to the network specified
Code 1 “Host Unreachable” means the packet can be routed to the destination network, but the specified host does not exist on the destination network.
Code 2 “Protocol Unreachable” is sent when the router receives a nonbroadcast message destined for itself that uses an unknown protocol.
Code 3 “Port Unreachable”
Code 4 “Fragmentation need and the DF bit is set”
Code 5 “Source Route Failed”

ICMPv6 uses Type 1 messages.

ICMP Redirect messages (ICMP Type 5) are used when there are multiple gateways on the same network segment, and a host sends a packet to one of the gateways, but a different gateway on the same network segment has a better metric to the destination. When the gateway configured on the host receives the packet, but determines via its routing table that a different gateway on the same network segment is closer, it forwards the packet to the better gateway, and sends an ICMP Redirect message back to the host indicating that it should use the better gateway to reach the destination for future packets. ICMPv6 uses Type 137.

Defaults:

ICMP unreachable messages are disabled by default for security. However, disabling the unreachable messages also disables IP Path MTU Discovery because it works by having the software send unreachable messages.
ICMP unreachable messages are limited to one message per 500ms by default.
ICMP redirects are disabled by default if HSRP is configured on the interface
ICMP redirects are sent by Cisco routers by default when all of the following conditions are met:
- The ingress interface is also the egress interface for routing
- The subnet of the source IP address is on the same subnet of the next-hop IP address of the routed packet
- The packet is not source-routed
- The kernel is configured to send redirects (which is a default in itself)
ICMP redirects are of the “subnet” type by default (as opposed to “host” type).

Configuration:

ICMP unreachable messages are disabled by default, enable on individual interfaces:
- ip unreachables
ICMP redirects are enabled by default, disable on individual interfaces:
- no ip redirects
Change the ICMP redirect type from the default “subnet” to “host” globally:
- ip icmp redirect host
- Some hosts do not understand the ICMP redirect subnet type and must use the host type.
Control the rate of sending ICMP unreachable messages globally:
- ip icmp rate-limit unreachable

Verification:

View ICMP messages in realtime:
- debug ip icmp

Supporting links:

IPv4 Options and IPv6 Extension Headers:

IP options provide additional flexibility in how packets are handled. All devices using IPv4 must be able to handle IP packets with options. IP packets may contain zero or more options, which makes the total length of the Options field in the IPv4 header variable.

Each option can be either a single byte, or multiple bytes, depending on how much information the option needs to convey. When multiple options are present, they appear together in the options field, including any necessary padding to make the options field a multiple of 32 bits.

Individual options are generally structured as TLVs, except for options where the option type itself indicates all the required information, in which case the option length and option data subfields are disabled. The option type octet has three subfields:

Copied flag, where a value of 1 means that the option is copied into all the fragments upon packet fragmentation
Option class, where 0 indicates the control class, 2 indicates debugging and measurement, and 1 & 3 are reserved
Option number, which is a 5-bit field to indicate 32 different options (defined by IANA)

Common options are:

Record Route, which allows the source to create an empty list of IPv4 addresses and requests each router along the path to add its IPv4 address to the list
Strict Source Route, where the complete path the datagram must follow to its destination is specified
Loose Source Route, where the path is specified, and all routers in the list must be traversed, but additional routers may also be in the path
Router Alert, which causes each router along the path to examine the packet, even if the router is not the ultimate destination

For IPv6, compared to IPv4, most of the options have been removed or altered, and are placed after the main IPv6 header in one or more Extension Headers. By doing this, the main IPv6 packet header remains a fixed size of 40 bytes, which increases the speed of packet processing. Extension headers are not examined or processed by any node along the packet’s delivery path, until the packet reaches the node(s) identified in the Destination Address (DA) field of the IPv6 header, with the exception of “Hop-by-Hop Options”, which must be the very first extension header if it is present and is examined by every node along the path.

Within the main IPv6 header, the “Next Header” field indicates the type of header to follow, based on the header code. All extension header types include a Next Header field which logically links together all of the extension headers, with the final extension header’s Next Header field pointing to the payload itself.

Common Next Header value codes are:

0: Hop-by-Hop Options: this special option is examined by all devices along the path, unlike the other options
43: Routing: used similarly to the loose source routing option in IPv4
44: Fragment: includes fragment offset, identification, and more fragments fields
50: ESP: Encapsulating Security Payload
51: AH: Authentication Header
60: Destination Options: options intended to be examined only by the destination node.

When multiple extension headers are present, they should be placed into the following order:

Hop-By-Hop Options
Destination Options (for options to be processed by the destination as well as devices specified in a Routing header)
Routing
Fragmentation
AH
ESP
Destination Options (for options to be processed by the final destination only)

Whether in the main IPv6 header, or the final IPv6 Extension Header, the payload itself is referred to by its IANA-assigned protocol number. For example, TCP is protocol number 6, UDP is 17, and so on.

Supporting links:

IPv4 and IPv6 Fragmentation:

When the payload of an IP packet is larger than the MTU of the data link, it must be fragmented, unless the DF (Don’t Fragment) bit is set, in which case the packet is dropped. When a packet is dropped because fragmentation is not allowed, an ICMP Destination Unreachable message may be returned.

The intermediate devices must keep track of all the fragments to determine the proper order for reassembly at the destination. IP packets have a “More Fragments” and a “Fragment Offset” field to help keep track of everything. If an IP packet is fragmented, the MF bit is set on all fragments except the final one. The FO field is 13-bits with each increment representing 8 bytes. For example, if the FO field has a value of 200, it means that this fragment begins 200x8 bytes (1600 bytes) into the payload.

When a packet is fragmented, the header of the original packet is transformed into the header of the first fragmented packet, and new headers are created for the additional fragments, each containing the same identification value, but with different FO values.

A packet may need to be fragmented multiple times during transmission if the MTU decreases multiple times along the path. Routers along the path do not perform fragmentation reassembly, even when a fragment is fragmented again due to an even lower MTU along the path. It is up to the TCP/IP stack in the end device to reassemble the fragments. Fragmentation in the network introduces extra overhead, since only the ultimate destination device can re-assemble the fragments.

Part of the reason for leaving reassembly to the end device is because fragments may take different paths in the network, and therefore the intermediary routers may not see all fragments. The ultimate destination device uses a buffer to collect and reassemble the fragments back into the original payload. However, if any of the fragments are not received when the reassembly timer expires, the entire packet is dropped, and it is up to the higher-layer protocol (such as TCP) to inform the sender that the packet was not received.

Cisco IOS supports a feature called “IP Virtual Fragment Reassembly” where all of the fragments are collected and reassembled for further special processing (such as with ACLs), which is needed for some applications like NAT and security-related processing. After the processing is performed, the fragments are forwarded like normal. VFR is disabled by default, but is enabled automatically when needed, such as when using NAT.

With IPv6, only the source device may fragment IP packets. Intermediary devices such as routers do not fragment IPv6 packets. When fragments are present, IPv6 uses a Fragmentation extension header. Some extension headers are considered unfragmentable and must be present in each fragment header (such as hop-by-hop options). Other extension headers (such as AH and ESP) may be fragmented along with the payload.

Defaults:

Cisco IOS IP virtual fragment reassembly is not enabled by default.
When enabled, VFR has the following defaults:
- Max number of IP datagrams that can be reassembled at any given time is 16
- Max fragments allowed per IP datagram is 32
- IP datagram is dropped if all fragments are not received within 3 seconds

Configuration:

VFR is configured per-interface:
- ip virtual-reassembly

Verification:

Display the VFR configuration and statistical information for an interface:
- show ip virtual-reassembly

Supporting links:

TTL:

The 8-bit Time-To-Live field in the IP header was originally used to limit the amount of time in seconds that a packet could exist on the internetwork. Today, it is used as a hop count, where each router decrements the value by 1 as it passes through. When the TTL reaches 0, the packet is dropped, and an ICMP Time Exceeded message may be generated in response.

Since packets are dropped when they reach a TTL of 0, this acts as a method to break logical loops in the network, because IP packets cannot be present in the network indefinitely, unlike Layer 2 Ethernet frames, which have no TTL field. Layer 2 Ethernet frames have the potential to loop around the network forever.

Traceroute uses TTL to determine the path through the network by sending UDP packets with a low TTL, causing routers along the path to drop the packets and send back ICMP Time Exceeded messages. The device performing the traceroute sends a UDP packet with a TTL of 1 toward the destination. The first-hop router decrements the TTL by 1, and since the TTL reaches 0 at that point, a Time Exceeded message is returned.

Then the device performing the traceroute sends a UDP packet with a TTL of 2 toward the destination, which the first-hop router decrements by 1 and passes on to the next hop. The next hop decrements the TTL, where it becomes 0 again, and the router returns a Time Exceeded message. This process repeats with increasing TTL values until the final destination is reached, upon which time the destination will usually report back an ICMP Destination Unreachable message, due to a random UDP port number having been chosen by the device initiating the traceroute for the traceroute probe.

When performing a traceroute from Cisco devices, which send three probes to each hop by default, the second probe in the final hop usually times out. This is due to the default ICMP rate limiting of Cisco IOS. The error messages returned from the intermediate routers are “TTL Exceeded”, whereas the message returned by the ultimate destination is “Destination Unreachable”.

The way traceroute truly works is platform-dependent. For example, some platforms use ICMP Echo messages instead of UDP probes, but the general concept of gradually increasing the TTL of the sent packets remains the same.

Supporting links:

IP MTU:

The Maximum Transmission Unit is the total maximum number of bytes supported in the payload of the transmission. While MTU is often associated with the data- link layer, it also applies to other protocol layers. The higher-layer MTU must fit within the lower-layer MTU. With the data link layer, each technology (Ethernet, serial DS3, etc.) has its own frame format and its own supported MTU.

For example, the default Ethernet MTU is usually 1500 bytes in most implementations. When an IP packet carrying a TCP segment needs to be sent, 20 bytes are used for the IP header, and 20 for the TCP header, which leaves 1460 bytes left for the actual data payload. When setting the MTU, some platforms (like Classic IOS) do not take into account the Layer 2 header, while others (like IOS-XR) do. The default MTU of 1500 for an Ethernet interface on Classic IOS is equivalent to the default Ethernet MTU 1514 on IOS-XR.

If the data to be sent is larger than the supported MTU on an interface, it must be either fragmented or dropped.

Larger MTU values reduce protocol overhead at the expense of having to re- transmit more data when data is lost or corrupted during transport.

MTU can be an issue for IP when different tunneling protocols are used on top of IP. For example, IP-in-IP adds another 20 bytes of overhead, effectively reducing the MTU of the payload by 20.

Defaults:

The default MTU for Ethernet and serial interfaces on Cisco devices is 1500 bytes.

Configuration:

Cisco routers support changing the MTU per interface, and per-xconnect sub- interface:
- mtu BYTES
The Catalyst access switch platforms do not support setting different MTU values per-interface, and must be configured globally.
- Change the MTU for all gigabit and 10-gigabit switched interfaces:
  - system mtu jumbo BYTES
  - Note: the switch must be reloaded for the new MTU to take effect.
- Change the MTU for routed ports:
  - system mtu routing BYTES
  - Note: This takes effect immediately, and is used by routing protocols such as OSPF in their advertisements.

Verification:

Display the system MTU settings:
- show system mtu

Supporting links:

TCP Operations

IPv4 and IPv6 Path MTU:

To avoid fragmentation when two devices need to communicate over an IPv4/IPv6 network, the MTU value must either be that of the smallest MTU link along the entire path, or the MTU must be set to the minimum allowed MTU of the protocol, which is 576 bytes for IPv4, and 1280 bytes for IPv6.

Path MTU Discovery for IPv4 works by setting the DF bit in the IP header, and any device along the path that cannot support the MTU will drop the packet, but should send back ICMP Type 3 Code 4 (Fragmentation Needed) with its MTU size.

Path MTU Discovery for IPv6 is implemented in the sending device, which starts with the assumption that the path MTU is that of the sending device’s connected link. If a device along the path has a smaller MTU, it will drop the packet and send back an ICMPv6 Type 2 (Packet Too Big) message, containing the MTU.

Most end devices using pMTUd will periodically send new probes to see if the MTU has increased. The default of most implementations, as recommended in RFC 1191, is 10 minutes.

A drawback of pMTUd is that different packets may take different paths in the network, each with their own different MTU sizes.

Supporting links:

MSS:

The TCP Maximum Segment Size represents the amount of data a host will accept in a single TCP/IP datagram. If the MSS is larger than the MTU plus protocol overhead, the datagram must be fragmented at the IP layer.

The MSS of a host is sent in the TCP SYN. However, hosts do not negotiate the MSS, and will normally use the lowest of the two values. Most hosts will take this a step further and use the outgoing interface’s MTU as part of the MSS calculation. For example, with a typical Ethernet MTU of 1500, the MSS is calculated as 1460 bytes because of the 20 bytes of IP header overhead, and 20 bytes of TCP header overhead. This is done to attempt to avoid fragmentation at the IP layer.

Likewise, when implementing tunneling, the TCP MSS is often adjusted to avoid fragmentation at the IP layer because of the overhead associated with the tunneling protocol(s).

Cisco IOS supports changing the MSS of TCP SYN packets that are sent through the router. This is commonly used with PPPoE, which supports an MTU of 1492 bytes.

The standard TCP Default MSS is 536 bytes (576 bytes for the minimum IP MTU, minus 20 bytes for the IP header, minutes 20 bytes for the TCP header). However, most implementations using Ethernet-based networks set the default TCP MSS to 1460 bytes.

Defaults:

The MSS is determined by the originating host by default. The router does not adjust the MSS unless configured to do so.

Configuration:

The TCP MSS is adjusted on the interface:
- ip tcp adjust-mss MSS
  - Where the MSS can be from 500 - 1460.
The TCP MSS adjustment is often configured with the MTU of the interface:
- ip mtu MTU
With a typical PPPoE deployment, the MTU is set for 1492 while the MSS is set for 1452.

Verification:

Display the IP-related settings of an interface:
- show ip interface INTERFACE

Supporting links:

TCP Latency:

TCP latency is often defined by the RTT Round Trip Time, the length of time it takes to receive back a response from a TCP message. For example, establishing a new TCP session involves sending a SYN and expecting to receive a SYN/ACK in response.

Latency begins with the propagation delay, which is no faster than the speed of light. Serialization delay, and intermediary device processing also add to the overall latency.

TCP has an inverse relationship between latency and throughput – when the latency increases, the throughput is decreased. When the latency is increased, the sender may be idle while waiting for acknowledgements. Packet loss combined with latency further compounds the effect on the overall throughput.

UDP traffic does not suffer from these same throughput issues in the presence of latency because it is connectionless and is not expecting to receive back acknowledgements.

Supporting links:

Windowing:

TCP is a reliable, connection-oriented protocol, which acknowledges the successful receipt of packets. However, if TCP had to send an acknowledgement for each individual packet, the overhead would be increased, and the performance would be decreased, which is why windowing is implemented.

Windowing allows a single acknowledgement to refer to multiple TCP packets. The window size specifies how many bytes may be sent before an ACK is required. TCP uses cumulative acknowledgement, which means a single value is sent to acknowledge a range of data (without making use of the selective acknowledgement feature). For example, if the window size is 1000 and bytes 1 - 1000 have been received successfully, an ACK with the value 1001 is sent, indicating the starting point for the next set of data.

The sliding window refers to a reference point within the entire TCP stream. For example, if the window size is 1000, and 900 contiguous bytes were acknowledged, the window referencing the entire TCP stream can be shifted 900 bytes to the right, and 900 more bytes can now be sent.

The window size can be adjusted by the both sender and recipient based on how much data they are willing to accept. For example, a device will only have so much room in its buffers, which must be cleared (whether partially or completely) before more data can be accepted.

The window size is indicated by a 16-bit integer in the TCP header. 16 bits limits the window to 64KB. However, through the use of a TCP option, this number can be scaled to 32-bits, for a range up to 4GB. TCP window scaling is often used as a method to alleviate the symptoms of networks containing a large bandwidth delay product (LFNs, Long Fat Networks), such as high-speed high- delay satellite links.

Supporting links:

Bandwidth Delay Product:

The Bandwidth Delay Product refers to the amount of data that can be in transit at any time between hosts and is calculated by multiplying the capacity of the link in bits per second by its round-trip delay time in seconds.

Networks with large bandwidth delay products are known as LFNs Long Fat Networks. An example is a high-speed satellite link: the link may have high bandwidth capacity, but it also has a larger delay, which may cause issues with TCP windowing. TCP window scaling is used to alleviate the issue. This can also occur in ultra-high-speed networks.

Supporting links:

Global Synchronization:

Global synchronization refers to multiple TCP streams on a link gradually expanding their window sizes until the link is congested and starts dropping traffic, which causes all TCP senders to reduce their window sizes and repeat the process. This results in a sawtooth-shaped graph of bandwidth utilization on the link.

Global synchronization results as a combination of how TCP uses slow-start and windowing, combined with tail-drop queuing on the router. One way to alleviate these symptoms is to use Random Early Detection queuing, where packets in a queue approaching congestion are randomly discarded, which causes the individual TCP stream to reduce its window size temporarily. By perform this action randomly on individual TCP streams, instead of all at once on all TCP streams (tail drop), the bandwidth of the link is used more efficiently.

Supporting links:

TCP Options:

Within the TCP header is an Options field that can be of variable length from 0 to 320 bits, aligned to 32-bit boundaries. The first byte is the Option-Kind which identifies the type of option, and by association whether or not the option consists of more than a single byte. Multi-byte options also include a 1- byte Option-Length field, and a variable-length Option-Data field.

Many options are present only in the initial SYN request packet. Common options are:

0: End of Option List
2: MSS Value
3: Window Scale
4: SACK Permitted
5: Blocks of selectively-acknowledged data
8: Timestamp

Supporting links:

UDP Operations

Starvation:

TCP Starvation / UDP Dominance occurs when TCP and UDP streams occupy the same queue. When congestion begins to occur and packets are dropped, TCP reacts by reducing its window size and thereby decreasing its transmission speed temporarily. UDP does not have a similar method of traffic control inherent to the protocol, and as the bandwidth utilization by TCP decreases, the UDP traffic increases. When congestion occurs again, TCP is further starved, and UDP further dominates.

The solution to this issue is to create separate queues for TCP and UDP traffic.

Supporting links:

Cisco: Enterprise QoS SRND Guide: MPLS VPN QoS Design

UDP Latency:

Unlike TCP, UDP does not expect to receive acknowledgements and does not suffer from the same throughput-related issues that TCP experiences in the presence of latency and packet loss.

UDP latency can affect application performance, such as voice quality issues in VoIP systems. However, UDP does not implement windowing, as TCP does, so overall throughput is not lost in the presence of latency in the same way that effects TCP.

UDP is commonly used for real-time applications. These applications can be sensitive to both latency, and jitter, which is the variation of latencies. These application often attempt to alleviate the symptoms of latency and jitter through the use of buffers, which collect the data and then present it to the application.

Supporting links:

Smutz.us: Network Latency

RTP / RTCP Concepts:

RTP Real-time Transport Protocol provides end-to-end delivery services for real- time data like interactive audio and video. RTP typically uses UDP for the underlying transport but can use other protocols. RTP provides additional capabilities over other transport protocols, such as UDP, by including sequence numbers and timestamps which can be used by the recipient to properly re-order packets if necessary. RTP also conveys the format of the payload carried in the stream.

RTCP Real-time Transport Control Protocol is a companion protocol that provides additional information and statistics about the stream, such as quality and synchronization. This is provided through report messages.

RTP sessions are established for each multimedia stream, which are commonly negotiated with a separate signaling protocol like SIP. Normally, RTP data is sent on even-numbered UDP ports, with the companion RTCP messages being sent on the next-higher odd-numbered port.

Supporting links:

RFC 3550: RTP: A Transport Protocol for Real-Time Applications (July 2003)