Understanding Transmission Control Protocol Fundamentals
TCP is the most commonly used transport protocol for applications that run on enterprise networks and the Internet today. TCP provides the following functions:
- Connection-oriented service between application processes on two nodes that are exchanging data
- Guaranteed delivery of data between these two processes
- Bandwidth discovery and congestion avoidance to fairly utilize available bandwidth based on utilization and WAN capacity
Before data can be sent between two disparate application processes on two disparate nodes, a connection must first be established. Once the connection is established, TCP provides guaranteed reliable delivery of data between the two application processes.
Connection-Oriented Service
The TCP connection is established through a three-way handshake agreement that occurs between two sockets on two nodes that wish to exchange data. A socket is defined as the network identifier of a node coupled with the port number that is associated with the application process that desires to communicate with a peer. The use of TCP sockets is displayed in Figure 6-2.
Figure 6-2 TCP Socket
During the establishment of the connection, the two nodes exchange information relevant to the parameters of the conversation. This information includes
- Source and destination TCP ports: The ports that are associated with application processes on each of the nodes that would like to exchange application data.
- Initial sequence number: Each device notifies the other what sequence number should be used for the beginning of the transmission.
- Window size: The advertised receive window size; that is, the amount of data that the advertising node can safely hold in its socket receive buffer.
- Options: Optional header fields commonly used for extensions to TCP behavior. For instance, this could include features such as window scaling and selective acknowledgment that were not included as part of the TCP RFC but can augment TCP behavior (an authoritative list of TCP options can be found at http://www.iana.org/assignments/tcp-parameters).
For instance, if an Internet user wants to use Internet Explorer to access http://www.cisco.com, the user's computer would first have to resolve the name www.cisco.com to an IP address, and then attempt to establish a TCP connection to the web server that is hosting www.cisco.com using the well-known port for HTTP (TCP port 80) unless a port number was specified. If the web server that is hosting www.cisco.com is accepting connections on TCP port 80, the connection would likely establish successfully. During the connection establishment, the server and client would tell one another how much data they can receive into their socket buffer (window size) and what initial sequence number to use for the initial transmission of data. As data is exchanged, this number would increment to allow the receiving node to know the appropriate ordering of data. During the life of the connection, TCP employs checksum functionality to provide a fairly accurate measure of the integrity of the data.
Once the connection is established between two nodes (IP addresses) and two application process identifiers (TCP ports), the application processes using those two ports on the two disparate nodes can begin to exchange application layer data. For instance, once a connection is established, a user can submit a GET request to the web server that it is connected to in order to begin downloading objects from a web page, or a user can begin to exchange control messages using SMTP or POP3 to transmit or receive an e-mail message. TCP connection establishment is shown in Figure 6-3.
Figure 6-3 TCP Connection Establishment
Guaranteed Delivery
Once transmission commences, application data is drained from application buffers on the transmitting process into the node's socket buffer. TCP then negotiates the transmission of data from the socket transmission buffer to the recipient node (that is, the draining of the buffer) based on the availability of resources on the recipient node to receive the data as dictated by the initially advertised window size and the current window size. Given that application data blocks may be quite large, TCP performs the task of breaking the data into segments, each with a sequence number that identifies the relative ordering of the portions of data that have been transmitted. If the node receives the segments out of order, TCP can reorder them according to the sequence number. If TCP buffers become full for one of the following reasons, a blocking condition could occur:
- TCP transmit buffer becomes full: The transmit buffer on the transmitting node can become full if network conditions prevent delivery of data or if the recipient is overwhelmed and cannot receive additional data. While the receiving node is unable to receive more data, applications may be allowed to continue to add data to the transmit buffer to await service. With the blockade of data waiting in the transmit buffer, unable to be transmitted, applications on the transmitting node may become blocked (that is, momentary or prolonged pauses in transmission). In such a situation, new data cannot be written into the transmit buffer on the transmitting node unless space is available in that buffer, which generally cannot be freed until the recipient is able to receive more data or the network is able to deliver data again.
- TCP receive buffer becomes full: Commonly caused by the receiving application not being able to extract data from the socket receive buffer quickly enough. For instance, an overloaded server, i.e. one that is receiving data at a rate greater than the rate at which it can process data, would exhibit this characteristic. As the receive buffer becomes full, new data cannot be accepted from the network for this socket and must be dropped, which indicates a congestion event to the transmitting node.
Figure 6-4 shows how TCP acts as an intermediary buffer between the network and applications within a node.
Figure 6-4 TCP Buffering Between the Network and Applications
When data is successfully placed into a recipient node TCP receive buffer, TCP generates an acknowledgment (ACK) with a value relative to the tail of the sequence that has just been received. For instance, if the initial sequence number was "1" and 1 KB of data was transmitted, when the data is placed into the recipient socket receive buffer, the recipient TCP stack will issue an ACK with a value of 1024. As data is extracted from the recipient node's socket receive buffer by the application process associated with that socket, the TCP stack on the recipient will generate an ACK with the same acknowledgment number but will also indicate an increase in the available window capacity. Given that applications today are generally able to extract data immediately from a TCP receive buffer, it is likely that the acknowledgment and window relief would come simultaneously.
The next segment that is sent from the transmitting node will have a sequence number equal to the previous sequence number plus the amount of data sent in the previous segment (1025 in this example), and can be transmitted only if there is available window capacity on the recipient node as dictated by the acknowledgments sent from the recipient. As data is acknowledged and the window value is increased (data in the TCP socket buffer must be extracted by the application process, thereby relieving buffer capacity and thus window capacity), the sender is allowed to continue to send additional data to the recipient up to the maximum capacity of the recipient's window (the recipient also has the ability to send dynamic window updates indicating increases or decreases in window size).
This process of transmitting data based on the recipient's ability to receive and previously acknowledged segments is commonly referred to as the TCP sliding window. In essence, as the recipient continues to receive and acknowledge or otherwise notify of an increase in window size, the window on the transmitting node shifts to allow new data to be sent. If at any point buffers become full or the window is exhausted, the recipient must first service data that has been previously received and acknowledge the sender before any new data can be sent. An example of this process is shown in Figure 6-5.
Figure 6-5 TCP Operation
Additionally, TCP provides a flag that allows an application process to notify TCP to send data immediately rather than wait for a larger amount of data to accumulate in the socket buffer. This flag, called a push, or PSH, instructs TCP to immediately send all data that is buffered for a particular destination. This push of data also requires that previous conditions be met, including availability of a window on the recipient node. When the data is transmitted, the PSH flag in the TCP header is set to a value of 1, which also instructs the recipient to send the data directly to the receiving application process rather than use the socket receive buffer.
Nodes that are transmitting also use the acknowledgment number, sequence number, and window value as a gauge of how long to retain data that has been previously transmitted. Each segment that has been transmitted and is awaiting acknowledgment is placed in a retransmission queue and is considered to be unacknowledged by the recipient application process. When a segment is placed in the retransmission queue, a timer is started indicating how long the sender will wait for an acknowledgment. If an acknowledgment is not received, the segment is retransmitted. Given that the window size is generally larger than a single segment, many segments are likely to be outstanding in the network awaiting acknowledgment at any given time.
From a transport layer perspective, the loss of a segment might not prevent transmission from continuing. However, given that the application layer is really dictating transport layer behavior (for instance, an upper-layer protocol acknowledgment), the loss of a segment may indeed prevent transmission from continuing.
The purpose of this retransmission queue is twofold:
- It allows the transmitting node to allocate memory capacity to retain the segments that have been previously transmitted. If a segment is lost (congestion, packet loss), it can be transmitted from the retransmission queue, and remains there until acknowledged by the recipient application process (window update).
- It allows the original segment, once placed in the retransmission queue, to be removed from the original transmission queue. This in effect allows TCP to be continually extracting data from the local transmitting application process while not compromising the transmitting node's ability to retransmit should a segment become lost or otherwise unacknowledged.
An example of TCP retransmission management is shown in Figure 6-6.
Figure 6-6 TCP Retransmission Management
Bandwidth Discovery
The TCP sliding window can act as a throttling mechanism to ensure that transmission of data is done in such a way that it aligns with the available buffer capacity and window of the two devices exchanging data. There are also mechanisms in TCP that allow it to act as a throttling mechanism based on the capacity of the network and any situations encountered in the network.
In some cases, the nodes exchanging data are able to send more data than the network can handle, and in other cases (which is more prominent today), the nodes are not able to send as much data as the network can handle. In the case of nodes being able to transmit more data than the network is prepared to handle, congestion and loss occur. TCP has mechanisms built in to account for these situations. In this way, TCP is adaptive in that it bridges the gap between transmission requirements and limitations, receive requirements and limitations, congestion, and loss, for both the application process and the network. This throttling mechanism also provides a balancing act between applications and the network and works to continually leverage what capacity in the network is available while attempting to maximize application throughput.
Discovering the steady state for an application process, meaning the balance between available network capacity and the rate of data exchange between application processes and socket buffers, is wholly subjective because application behavior is largely dictated by the function of input, meaning the rate at which the application attempts to send or receive data from the socket buffers. Network throughput is generally more deterministic and objective based on a variety of factors (which would otherwise make it appear nondeterministic and fully subjective), including bandwidth, latency, congestion, and packet loss.
The terms congestion and loss are used separately here, even though congestion and loss are generally married. In some situations, congestion can simply refer to a delay in service due to saturated buffers or queues that are not completely full, but full enough to slightly delay the delivery of a segment of data. These factors are deterministic based on network utilization and physics rather than a set of input criteria, as would be the case with an application. In any case, applications are driving the utilization of network bandwidth, so the argument could be made that they go hand in hand.
TCP provides capabilities to respond to network conditions, thus allowing it to perform the following basic but critical functions:
- Initially find a safe level at which data can be transmitted and continually adapt to changes in network conditions
- Respond to congestion or packet loss events through retransmission and make adjustments to throughput levels
- Provide fairness when multiple concurrent users are contending for the same shared resource (bandwidth)
These capabilities are implemented in two TCP functions that are discussed in the next two sections: TCP slow start and TCP congestion avoidance.
TCP Slow Start
TCP is responsible for initially finding the amount of network capacity available to the connection. This function, as introduced in Chapter 2, "Barriers to Application Performance," is provided in a mechanism found in TCP called slow start (also known as bandwidth discovery) and is also employed in longer-lived connections when the available window falls below a value known as the slow-start threshold. Slow start is a perfect name for the function, even though it may appear upon further examination to be a misnomer.
Slow start uses an exponential increase in the number of segments that can be sent per successful round trip, and this mechanism is employed at the beginning of a connection to find the initial available network capacity. In a successful round trip, data is transmitted based on the current value of the congestion window (cwnd) and an acknowledgment is received from the recipient. (The cwnd correlates to the number of segments that can be sent and remain unacknowledged in the network and is a dynamic value that cannot exceed the window size of the recipient.) The cwnd value is bound to an upper threshold defined by the receiver window size and the transmission buffer capacity of the sender.
With TCP slow start, the transmitting node starts by sending a single segment and waits for acknowledgment. Upon acknowledgment, the transmitting node doubles the number of segments that are sent and awaits acknowledgment. This process occurs until one of the following two scenarios is encountered:
- An acknowledgment is not received, thereby signaling packet loss or excessive congestion
- The number of segments that can be sent (cwnd) is equal to the window size of the recipient or equal to the maximum capacity of the sender
The first case is only encountered in the following circumstances:
- The capacity of the network is less than the transmission capabilities of the sender and receive capabilities of the receiver
- A loss event is detected (no acknowledgment received within the time allocated to the transmitted segment as it is placed in the retransmission queue)
- Congestion delays the delivery of the segment long enough to allow the retransmission queue timer for the transmitted segment to expire
The second case is encountered only when the capacity of the network is equal to (with no loss or congestion) or greater than the transmission capabilities of the sender and the receive capabilities of the receiver. In this case, the sender or the receiver cannot fully capitalize on the available network capacity based on window capacity or buffer sizes.
The result of TCP slow start is an exponential increase in throughput up to the available network capacity or up to the available transmit/receive throughput of the two nodes dictated by buffer size or advertised window size. This is a relatively accurate process of finding the initially available bandwidth, and generally is skewed only in the case of a network that is exhibiting high packet loss characteristics, which may cut the slow-start process short. When one of the previously listed two cases presented is encountered, the connection exits TCP slow start, knowing the initially available network capacity, and enters the congestion avoidance mode. Slow start is never entered again unless the cwnd of the connection falls below the slow-start threshold value.
If the first case is encountered—loss of a packet, or excessive delay causing the retransmission timer to expire—standard TCP implementations immediately drop the cwnd value by 50 percent. If the second case is encountered—no loss—cwnd is in a steady state that is equal to the receiver's advertised window size. TCP slow start is shown in Figure 6-7.
Figure 6-7 TCP Slow Start
TCP Congestion Avoidance
Once the TCP connection exits slow start (bandwidth discovery), it then enters a mode known as congestion avoidance. Congestion avoidance is a mechanism that allows the TCP implementation to react to situations encountered in the network. Packet loss and delay signal congestion in the network, which could be indicative of a number of factors:
- Bandwidth allocation change: For instance, change in a variable bandwidth circuit can result in the network being able to service more or less overall throughput based on the direction and nature of the change.
- Network oversubscription: When a shared network connection between upstream devices is used by multiple concurrent users, it can become congested to the point of loss or delay.
- Congestion of device queues: Similar to network oversubscription, a shared device such as a router can have its queues exhausted to the point of not being able to accept new packets. This could also be equated to a QoS configuration that dictates maximum bandwidth utilization of a specific traffic class or drop policies for that class when congestion is encountered.
- Overload of destination: Destination socket buffers can become full because of an application's inability to drain data from them in a timely fashion, potentially because of the server being overwhelmed.
This list only begins to scratch the surface of why packets could be lost or otherwise delayed. The good news is that TCP was designed to work in a network that is unreliable and lossy (high levels of packet loss) and uses these events to its advantage to adequately throttle transmission characteristics and adapt to network conditions.
While in congestion avoidance mode, standard TCP continually increments the number of segments that can be transmitted without acknowledgment by one for every successful round trip, up to the point where cwnd has parity with the recipient's advertised window size. Any time a loss is detected (the retransmission timer for a segment expires), standard TCP reduces cwnd by 50 percent, thereby minimizing the amount of data that can be unacknowledged in the network (in other words, in transit), which can directly impact the ability of the sender to transmit data. TCP congestion avoidance is shown in Figure 6-8.
Figure 6-8 TCP Congestion Avoidance
Through the use of slow start and congestion avoidance mode, TCP can discover available network capacity for a connection and adapt the transmission characteristics of that connection to the situations encountered in the network.