From 2ffc960e67d7fe130cc9092428909c86b6935e52 Mon Sep 17 00:00:00 2001
From: Michael Brown <mcb30@etherboot.org>
Date: Fri, 27 May 2005 11:44:46 +0000
Subject: [PATCH] Added diatribe about the mismatch between the PXE spec and
 the TFTP protocol, and how we will work around it.

---
 src/interface/pxe/pxe_tftp.c | 105 ++++++++++++++++++++++++++++-------
 1 file changed, 86 insertions(+), 19 deletions(-)

diff --git a/src/interface/pxe/pxe_tftp.c b/src/interface/pxe/pxe_tftp.c
index 3f07aef4..029f7f42 100644
--- a/src/interface/pxe/pxe_tftp.c
+++ b/src/interface/pxe/pxe_tftp.c
@@ -30,14 +30,14 @@
  * @v tftp_open				Pointer to a struct s_PXENV_TFTP_OPEN
  * @v s_PXENV_TFTP_OPEN::ServerIPAddress TFTP server IP address
  * @v s_PXENV_TFTP_OPEN::GatewayIPAddress Relay agent IP address, or 0.0.0.0
- * @v s_PXENV_TFTP_OPEN::Filename	Name of file to open
+ * @v s_PXENV_TFTP_OPEN::FileName	Name of file to open
  * @v s_PXENV_TFTP_OPEN::TFTPPort	TFTP server UDP port
  * @v s_PXENV_TFTP_OPEN::PacketSize	TFTP blksize option to request
  * @ret #PXENV_EXIT_SUCCESS		File was opened
  * @ret #PXENV_EXIT_FAILURE		File was not opened
  * @ret s_PXENV_TFTP_OPEN::Status	PXE status code
- * @ret s_PXENV_TFTP_OPEN::PacketSize	Negotiated 
- * @err .......				..........
+ * @ret s_PXENV_TFTP_OPEN::PacketSize	Negotiated blksize
+ * @err #PXENV_STATUS_TFTP_INVALID_PACKET_SIZE Requested blksize too small
  *
  * Opens a TFTP connection for downloading a file a block at a time
  * using pxenv_tftp_read().
@@ -46,11 +46,21 @@
  * routing will take place.  See the relevant
  * @ref pxe_routing "implementation note" for more details.
  *
- * s_PXENV_TFTP_OPEN::PacketSize must be at least 512.
+ * The blksize negotiated with the TFTP server will be returned in
+ * s_PXENV_TFTP_OPEN::PacketSize, and will be the size of data blocks
+ * returned by subsequent calls to pxenv_tftp_read().  The TFTP server
+ * may negotiate a smaller blksize than the caller requested.
+ *
+ * Some TFTP servers do not support TFTP options, and will therefore
+ * not be able to use anything other than a fixed 512-byte blksize.
+ * The PXE specification version 2.1 requires that the caller must
+ * pass in s_PXENV_TFTP_OPEN::PacketSize with a value of 512 or
+ * greater.
  *
  * You can only have one TFTP connection open at a time, because the
- * PXE API requires the PXE stack to keep state about the open TFTP
- * connection (rather than letting the caller do so).
+ * PXE API requires the PXE stack to keep state (e.g. local and remote
+ * port numbers, data block index) about the open TFTP connection,
+ * rather than letting the caller do so.
  *
  * It is unclear precisely what constitutes a "TFTP open" operation.
  * Clearly, we must send the TFTP open request to the server.  Since
@@ -65,7 +75,15 @@
  * solution to this problem.
  *
  * 
-
+ * @note If you pass in a value less than 512 for
+ * s_PXENV_TFTP_OPEN::PacketSize, Etherboot will attempt to negotiate
+ * this blksize with the TFTP server, even though such a value is not
+ * permitted according to the PXE specification.  If the TFTP server
+ * ends up dictating a blksize larger than the value requested by the
+ * caller (which is very probable in the case of a requested blksize
+ * less than 512), then Etherboot will return the error
+ * #PXENV_STATUS_TFTP_INVALID_PACKET_SIZE.
+ *
  * @note According to the PXE specification version 2.1, this call
  * "opens a file for reading/writing", though how writing is to be
  * achieved without the existence of an API call %pxenv_tftp_write()
@@ -253,44 +271,48 @@ file" operations.  The problem is the unreliable nature of UDP
 transmissions and the lock-step mechanism employed by TFTP to
 guarantee file transfer.  The lock-step mechanism requires that if we
 time out waiting for a packet to arrive, we must trigger its
-retransmission by retransmitting our previously transmitted packet.
+retransmission by retransmitting our own previously transmitted
+packet.
 
 For example, suppose that pxenv_tftp_read() is called to read the
 first data block of a file from a server that does not support TFTP
 options, and that no data block is received within the timeout period.
-In order to trigger the retransmission of this data block
+In order to trigger the retransmission of this data block,
 pxenv_tftp_read() must retransmit the TFTP open request.  However, the
 information used to build the TFTP open request is not available at
-this time; it was provided only to the pxenv_tftp_open() call.
+this time; it was provided only to the pxenv_tftp_open() call.  Even
+if we were able to retransmit a TFTP open request, we would have to
+allocate a new local port number (and be prepared for data to arrive
+from a new remote port number) in order to avoid violating the TFTP
+protocol specification.
 
 The question of when to transmit the ACK packets is also awkward.  At
 a first glance, it would seem to be fairly simple: acknowledge a
 packet immediately after receiving it.  However, since the ACK packet
 may itself be lost, the next call to pxenv_tftp_read() must be
-prepared to re-acknowledge the packet.
+prepared to retransmit the acknowledgement.
 
 Another problem to consider is that the pxenv_tftp_open() API call
 must return an indication of whether or not the TFTP open request
 succeeded.  In the case of a TFTP server that doesn't support TFTP
 options, the only indication of a successful open is the reception of
 the first data block.  However, the pxenv_tftp_open() API provides no
-way to return this data block at this time.  Pretending that we lost
-the data block and requesting retransmission is problematic, because
-the only way to request retransmission of the first data block in such
-a case is to reissue the TFTP open request, which has side effects
-such as requiring the allocation of a new local port number.
+way to return this data block at this time.
 
 At least some PXE stacks (e.g. NILO) solve this problem by violating
 the TFTP protocol and never bothering with retransmissions, relying on
 the TFTP server to retransmit when it times out waiting for an ACK.
-This approach is dubious at best.
+This approach is dubious at best; if, for example, the initial TFTP
+open request is lost then NILO will believe that it has opened the
+file and will eventually time out and give up while waiting for the
+first packet to arrive.
 
 The only viable solution seems to be to allocate a buffer for the
 storage of the first data packet returned by the TFTP server, since we
 may receive this packet during the pxenv_tftp_open() call but have to
 return it from the subsequent pxenv_tftp_read() call.  This buffer
 must be statically allocated and must be dedicated to providing a
-temporary home to TFTP packets.  There is nothing in the PXE
+temporary home for TFTP packets.  There is nothing in the PXE
 specification that prevents a caller from calling
 e.g. pxenv_undi_transmit() between calls to the TFTP API, so we cannot
 use the normal transmit/receive buffer for this purpose.
@@ -334,6 +356,51 @@ acknowledgement packet.)
 In order to set up this invariant condition for the first call to
 pxenv_tftp_read(), pxenv_tftp_open() must do the following:
 
-  - 
+  - Construct and transmit the TFTP open request.
+
+  - Retransmit the TFTP open request (using a new local port number as
+    necessary) until a response (DATA, OACK, or ERROR) is received.
+
+  - If the response is an OACK, acknowledge the OACK and retransmit
+    the acknowledgement until the first DATA packet arrives.
+
+  - If we have a DATA packet, store it in a buffer ready for the first
+    call to pxenv_tftp_read().
+
+This approach has the advantage of being fully compliant with both
+RFC1350 (TFTP) and RFC2347 (TFTP options).  It avoids unnecessary
+retransmissions.  The cost is approximately 1500 bytes of
+uninitialised storage.  Since there is demonstrably no way to avoid
+paying this cost without either violating the protocol specifications
+or introducing unnecessary retransmissions, we deem this to be a cost
+worth paying.
+
+A small performance gain may be obtained by adding a single extra
+"send ACK" in both pxenv_tftp_open() and pxenv_tftp_read() immediately
+after receiving the DATA packet and copying it into the internal
+buffer.   The sequence of events for pxenv_tftp_read() then becomes:
+
+  - Copy the data packet from our buffer to the caller's buffer.
+
+  - If this was the last data packet, return immediately.
+
+  - Check to see if a TFTP data packet is waiting.  If not, send an
+    ACK for the data packet that we have just copied, and retransmit
+    this ACK until the next data packet arrives.
+
+  - Copy the packet into our internal buffer, ready for the next call
+    to pxenv_tftp_read().
+
+  - Send a single ACK for this data packet.
+
+Sending the ACK at this point allows the server to transmit the next
+data block while our caller is processing the current packet.  If this
+ACK is lost, or the DATA packet it triggers is lost or is consumed by
+something other than pxenv_tftp_read() (e.g. by calls to
+pxenv_undi_isr()), then the next call to pxenv_tftp_read() will not
+find a TFTP data packet waiting and will retransmit the ACK anyway.
+
+Note to future API designers at Intel: try to understand the
+underlying network protocol first!
 
 */