david/ipxe - ipxe - git.socialnerds.org

david

ipxe

Archived

Author	SHA1	Message	Date
Michael Brown	75bb948008	[tcp] Use correct length for memset() Signed-off-by: Michael Brown <mcb30@ipxe.org>	2017-03-22 15:11:05 +02:00
Michael Brown	188789eb3c	[tcp] Send TCP keepalives on idle established connections In some circumstances, intermediate devices may lose state in a way that temporarily prevents the successful delivery of packets from a TCP peer. For example, a firewall may drop a NAT forwarding table entry. Since iPXE spends most of its time downloading files (and hence purely receiving data, sending only TCP ACKs), this can easily happen in a situation in which there is no reason for iPXE's TCP stack to generate any retransmissions. The temporary loss of connectivity can therefore effectively become permanent. Work around this problem by sending TCP keepalives after a period of inactivity on an established connection. TCP keepalives usually send a single garbage byte in sequence number space that has already been ACKed by the peer. Since we do not need to elicit a response from the peer, we instead send pure ACKs (with no garbage data) in order to keep the transmit code path simple. Originally-implemented-by: Ladi Prosek <lprosek@redhat.com> Debugged-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2016-06-13 09:58:32 +01:00
Michael Brown	fef8e34b6f	[tcp] Guard against malformed TCP options Signed-off-by: Michael Brown <mcb30@ipxe.org>	2016-01-27 23:06:50 +00:00
Michael Brown	9546b0c17b	[tcp] Ensure FIN is actually sent if connection is closed while idle Signed-off-by: Michael Brown <mcb30@ipxe.org>	2015-07-22 21:16:40 +01:00
Michael Brown	38afcc51ea	[tcp] Gracefully close connections during shutdown We currently do not wait for a received FIN before exiting to boot a loaded OS. In the common case of booting from an HTTP server, this means that the TCP connection is left consuming resources on the server side: the server will retransmit the FIN several times before giving up. Fix by initiating a graceful close of all TCP connections and waiting (for up to one second) for all connections to finish closing gracefully (i.e. for the outgoing FIN to have been sent and ACKed, and for the incoming FIN to have been received and ACKed at least once). Signed-off-by: Michael Brown <mcb30@ipxe.org>	2015-07-04 12:51:23 +01:00
Michael Brown	c117b25e0b	[tcp] Do not shrink window when discarding received packets We currently shrink the TCP window permanently if we are ever forced (by a low-memory condition) to discard a previously received TCP packet. This behaviour was intended to reduce the number of retransmissions in a lossy network, since lost packets might potentially result in the entire window contents being retransmitted. Since commit `e0fc8fe` ("[tcp] Implement support for TCP Selective Acknowledgements (SACK)") the cost of lost packets has been reduced by around one order of magnitude, and the reduction in the window size (which affects the maximum throughput) is now the more significant cost. Remove the code which reduces the TCP maximum window size when a received packet is discarded. Reported-by: Wissam Shoukair <wissams@mellanox.com> Tested-by: Wissam Shoukair <wissams@mellanox.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2015-06-25 10:20:48 +01:00
Michael Brown	e0fc8fe781	[tcp] Implement support for TCP Selective Acknowledgements (SACK) The TCP Selective Acknowledgement option (specified in RFC2018) provides a mechanism for the receiver to indicate packets that have been received out of order (e.g. due to earlier dropped packets). iPXE often operates in environments in which there is a high probability of packet loss. For example, the legacy USB keyboard emulation in some BIOSes involves polling the USB bus from within a system management interrupt: this introduces an invisible delay of around 500us which is long enough for around 40 full-length packets to be dropped. Similarly, almost all 1Gbps USB2 devices will eventually end up dropping packets because the USB2 bus does not provide enough bandwidth to sustain a 1Gbps stream, and most devices will not provide enough internal buffering to hold a full TCP window's worth of received packets. Add support for sending TCP Selective Acknowledgements. This provides the sender with more detailed information about which packets have been lost, and so allows for a more efficient retransmission strategy. We include a SACK-permitted option in our SYN packet, since experimentation shows that at least Linux peers will not include a SACK-permitted option in the SYN-ACK packet if one was not present in the initial SYN. (RFC2018 does not seem to mandate this behaviour, but it is consistent with the approach taken in RFC1323.) We ignore any received SACK options; this is safe to do since SACK is only ever advisory and we never have to send non-trivial amounts of data. Since our TCP receive queue is a candidate for cache discarding under low memory conditions, we may end up discarding data that has been reported as received via a SACK option. This is permitted by RFC2018. We follow the stricture that SACK blocks must not report data which is no longer held by the receiver: previously-reported blocks are validated against the current receive queue before being included within the current SACK block list. Experiments in a qemu VM using forced packet drops (by setting NETDEV_DISCARD_RATE to 32) show that implementing SACK improves throughput by around 400%. Experiments with a USB2 NIC (an SMSC7500) show that implementing SACK improves throughput by around 700%, increasing the download rate from 35Mbps up to 250Mbps (which is approximately the usable bandwidth limit for USB2). Signed-off-by: Michael Brown <mcb30@ipxe.org>	2015-03-11 23:14:39 +00:00
Michael Brown	2f020a8df3	[legal] Relicense files under GPL2_OR_LATER_OR_UBDL These files cannot be automatically relicensed by util/relicense.pl since they either contain unusual but trivial contributions (such as the addition of __nonnull function attributes), or contain lines dating back to the initial git revision (and so require manual knowledge of the code's origin). Signed-off-by: Michael Brown <mcb30@ipxe.org>	2015-03-02 16:35:29 +00:00
Michael Brown	d28bb51f44	[tcp] Defer sending ACKs until all received packets have been processed When running inside a virtual machine (or when using the UNDI driver), transmitting packets can be expensive. When we receive several packets in one poll (e.g. because a slow BIOS timer interrupt routine has caused us to fall behind in processing), we can safely send just a single ACK to cover all of the received packets. This reduces the time spent transmitting and allows us to clear the backlog much faster. Various RFCs (starting with RFC1122) state that there should be an ACK for at least every second segment. We choose not to enforce this rule. Under normal operation each poll should find at most one received packet, and we will then not delay any ACKs. We delay (i.e. omit) ACKs only when under sufficiently heavy load that we are finding multiple packets per poll; under these conditions it is important to clear the backlog quickly since any delay may lead to dropped packets. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2014-05-12 17:19:26 +01:00
Michael Brown	767f2acb98	[tcp] Profile transmit and receive datapaths Signed-off-by: Michael Brown <mcb30@ipxe.org>	2014-04-28 12:30:57 +01:00
Michael Brown	859664ea2a	[tcp] Update window even if ACK does not acknowledge new data iPXE currently ignores ACKs which do not acknowledge any new data. (In particular, it does not stop the retransmission timer; this is done to prevent an immediate retransmission if a duplicate ACK is received while the transmit queue is non-empty.) If a peer provides a window size of zero and later sends a duplicate ACK to update the window size, this update will therefore be ignored and iPXE will never be able to transmit data. Fix by updating the window size even for ACKs which do not acknowledge new data. Reported-by: Wissam Shoukair <wissams@mellanox.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2014-03-07 17:30:01 +00:00
Michael Brown	e191298a1d	[tcp] Calculate correct MSS from peer address iPXE currently advertises a fixed MSS of 1460, which is correct only for IPv4 over Ethernet. For IPv6 over Ethernet, the value should be 1440 (allowing for the larger IPv6 header). For non-Ethernet link layers, the value should reflect the MTU of the underlying network device. Use tcpip_mtu() to calculate the transport-layer MTU associated with the peer address, and calculate the MSS to allow for an optionless TCP header as per RFC 6691. As a side benefit, we can now fail a connection immediately with a meaningful error message if we have no route to the destination address. Reported-by: Anton D. Kachalov <mouse@yandex-team.ru> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2014-03-04 13:23:29 +00:00
Michael Brown	9f324cf9a5	[tcp] Add AF_INET6 socket opener Signed-off-by: Michael Brown <mcb30@ipxe.org>	2013-10-21 14:34:02 +01:00
Michael Brown	6bf36f57a0	[tcpip] Pass through network device to transport layer protocols NDP requires knowledge of the network device on which a packet was received. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2013-09-03 02:02:58 +01:00
Michael Brown	252d28f098	[tcpip] Allow binding to unspecified privileged ports (below 1024) Originally-implemented-by: Marin Hannache <git@mareo.fr> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2013-08-06 15:56:54 +01:00
Michael Brown	d4f8e56bb4	[tcp] Fix comment to match code behaviour Reported-by: Thomas Miletich <thomas.miletich@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2013-07-12 11:15:42 +02:00
Michael Brown	18d0818f94	[tcp] Do not send RST for unrecognised connections On large networks with substantial numbers of monitoring agents, unwanted TCP connection attempts may end up flooding iPXE's ARP cache. Fix by silently dropping packets received for unrecognised TCP connections. This should not cause problems, since many firewalls will also silently drop any such packets. Reported-by: Jarrod Johnson <jarrod.b.johnson@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2013-07-12 03:20:05 +02:00
Michael Brown	a5d16a91af	[tcp] Truncate TCP window to prevent future packet discards Whenever memory pressure causes a queued packet to be discarded (and so retransmitted), reduce the maximum TCP window to a size that would have prevented the discard. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2012-07-09 10:13:47 +01:00
Michael Brown	024247317d	[arp] Try to avoid discarding ARP cache entries Discarding the active ARP cache entry in the middle of a download will substantially disrupt the TCP stream. Try to minimise any such disruption by treating ARP cache entries as expensive, and discarding them only when nothing else is available to discard. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2012-07-09 10:08:38 +01:00
Michael Brown	55f52bb77a	[tcp] Avoid potential NULL pointer dereference Commit `ea61075` ("[tcp] Add support for TCP window scaling") introduced a potential NULL pointer dereference by referring to the connection's send window scale before checking whether or not the connection is known. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2012-06-30 19:03:07 +01:00
Michael Brown	49ac629821	[tcp] Use a zero window size for RST packets Signed-off-by: Michael Brown <mcb30@ipxe.org>	2012-06-30 19:00:05 +01:00
Michael Brown	ea61075c60	[tcp] Add support for TCP window scaling The maximum unscaled TCP window (64kB) implies a maximum bandwidth of around 300kB/s on a WAN link with an RTT of 200ms. Add support for the TCP window scaling option to remove this upper limit. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2012-06-29 15:05:33 +01:00
Michael Brown	5482b0abb6	[tcp] Mark any unacknowledged transmission as a pending operation Signed-off-by: Michael Brown <mcb30@ipxe.org>	2012-06-09 18:56:07 +01:00
Michael Brown	8a0331c29b	[tcp] Discard all TCP connections on shutdown Allow detection of genuine memory leaks by ensuring that all TCP connections are freed on shutdown. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2012-05-08 12:49:01 +01:00
Michael Brown	52dd4bacad	[tcp] Fix potential NULL pointer dereference Detected using Valgrind. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2012-05-08 12:49:01 +01:00
Michael Brown	469bd11f39	[tcp] Allow sufficient headroom for TCP headers TCP currently neglects to allow sufficient space for its own headers when allocating I/O buffers. This problem is masked by the fact that the maximum link-layer header size (802.11) is substantially larger than the common Ethernet link-layer header. Fix by allowing sufficient space for any TCP headers, as well as the network-layer and link-layer headers. Reported-by: Scott K Logan <logans@cottsay.net> Debugged-by: Scott K Logan <logans@cottsay.net> Tested-by: Scott K Logan <logans@cottsay.net> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2011-09-19 15:52:54 +01:00
Michael Brown	c68bf14559	[tcp] Send xfer_window_changed() when window opens Signed-off-by: Michael Brown <mcb30@ipxe.org>	2011-06-28 14:45:08 +01:00
Michael Brown	c4369eb6c2	[tcp] Update ts_recent whenever window is advanced Commit `3f442d3` ("[tcp] Record ts_recent on first received packet") failed to achieve its stated intention. Fix this (and reduce the code size) by moving the ts_recent update to tcp_rx_seq(). This is the code responsible for advancing the window, called by both tcp_rx_syn() and tcp_rx_data(), and so the window check is now redundant. Reported-by: Frank Weed <zorbustheknight@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2011-04-03 00:44:22 +01:00
Michael Brown	3f442d3f60	[tcp] Record ts_recent on first received packet Commit `6861304` ("[tcp] Handle out-of-order received packets") introduced a regression in which ts_recent would not be updated until the first packet is received in the ESTABLISHED state, i.e. the timestamp from the SYN+ACK packet would be ignored. This causes the connection to be dropped by strictly-conforming TCP peers, such as FreeBSD. Fix by delaying the timestamp window check until after processing the received SYN flag. Reported-by: winders@sonnet.com Signed-off-by: Michael Brown <mcb30@ipxe.org>	2011-03-26 15:02:41 +00:00
Michael Brown	d012f87018	[tcp] Use MAX_LL_NET_HEADER_LEN instead of defining our own MAX_HDR_LEN Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-11-19 16:08:05 +00:00
Michael Brown	67dc832d15	[tcp] Set PSH flag only on packets containing data Suggested-by: Yelena Kadach <klenusik@hotmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-11-11 01:14:05 +00:00
Michael Brown	ea631f6fb8	[list] Add list_first_entry() There are several points in the iPXE codebase where list_for_each_entry() is (ab)used to extract only the first entry from a list. Add a macro list_first_entry() to make this code easier to read. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-11-08 03:15:28 +00:00
Michael Brown	28934eef81	[retry] Hold reference while timer is running and during expiry callback Guarantee that a retry timer cannot go out of scope while the timer is running, and provide a guarantee to the expiry callback that the timer will remain in scope during the entire callback (similar to the guarantee provided to interface methods). Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-09-03 21:28:43 +01:00
Piotr Jaroszyński	02e6092cd5	[tcp] Fix a 64bit compile time error Signed-off-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-07-22 21:25:40 +01:00
Michael Brown	1d3b6619e5	[tcp] Allow out-of-order receive queue to be discarded Allow packets in the receive queue to be discarded in order to free up memory. This avoids a potential deadlock condition in which the missing packet can never be received because the receive queue is occupying all of the memory available for further RX buffers. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-07-21 12:01:50 +01:00
Michael Brown	68613047f0	[tcp] Handle out-of-order received packets Maintain a queue of received packets, so that lost packets need not result in retransmission of the entire TCP window. Increase the TCP window to 8kB, in order that we can potentially transmit enough duplicate ACKs to trigger Fast Retransmission at the sender. Using a 10MB HTTP download in qemu-kvm with an artificial drop rate of 1 in 64 packets, this reduces the download time from around 26s to around 4s. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-07-21 00:00:38 +01:00
Michael Brown	f033694356	[tcp] Treat ACKs as sent only when successfully transmitted iPXE currently forces sending (i.e. sends a pure ACK even in the absence of fresh data to send) only in response to packets that consume sequence space or that lie outside of the receive window. This ignores the possibility that a previous ACK was not actually sent (due to, for example, the retransmission timer running). This does not cause incorrect behaviour, but does cause unnecessary retransmissions from our peer. For example: 1. Peer sends final data packet (ack 106 seq 521..523) 2. We send FIN (seq 106..107 ack 523) 3. Peer sends FIN (ack 106 seq 523..524) 4. We send nothing since retransmission timer is running for our FIN 5. Peer ACKs our FIN (ack 107 seq 524..524) 6. We send nothing since this packet consumes no sequence space 7. Peer retransmits FIN (ack 107 seq 523..524) 8. We ACK peer's FIN (seq 107..107 ack 524) What should happen at step (6) is that we should ACK the peer's FIN, since we can deduce that we have never sent this ACK. Fix by maintaining an "ACK pending" flag that is set whenever we are made aware that our peer needs an ACK (whether by consuming sequence space or by sending a packet that appears out of order), and is cleared only when the ACK packet has been transmitted. Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-07-15 19:59:57 +01:00
Michael Brown	75505942ac	[tcp] Merge boolean flags into a single "flags" field Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-07-15 19:59:57 +01:00
Michael Brown	c57e26381c	[tcp] Use a dedicated timer for the TIME_WAIT state iPXE currently repurposes the retransmission timer to hold the TCP connection in the TIME_WAIT state (i.e. waiting for up to 2*MSL in case we are required to re-ACK our peer's FIN due to a lost ACK). However, the fact that this timer is running will prevent such an ACK from ever being sent, since the logic in tcp_xmit() assumes that a running timer indicates that we ourselves are waiting for an ACK and so blocks the transmission. (We always wait for an ACK before sending our next packet, to keep our transmit data path as simple as possible.) Fix by using an entirely separate timer for the TIME_WAIT state, so that packets can still be sent. Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-07-15 19:59:34 +01:00
Guo-Fu Tseng	1e7e4c9a61	[tcp] Randomise local TCP port Signed-off-by: Guo-Fu Tseng <cooldavid@cooldavid.org> Modified-by: Michael Brown <mcb30@ipxe.org> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-07-13 17:29:54 +01:00
Michael Brown	73e3672468	[tcp] Fix typos by changing ntohl() to htonl() where appropriate Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-07-13 17:19:37 +01:00
Michael Brown	43450342a9	[tcp] Store local port in host byte order Every other scalar integer value in struct tcp_connection is in host byte order; change the definition of local_port to match. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-07-13 17:15:57 +01:00
Michael Brown	68c2f07f15	[tcp] Fix potential use-after-free when accessing timestamp option Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-07-07 12:57:08 +01:00
Michael Brown	4327d5d39f	[interface] Convert all data-xfer interfaces to generic interfaces Remove data-xfer as an interface type, and replace data-xfer interfaces with generic interfaces supporting the data-xfer methods. Filter interfaces (as used by the TLS layer) are handled using the generic pass-through interface capability. A side-effect of this is that deliver_raw() no longer exists as a data-xfer method. (In practice this doesn't lose any efficiency, since there are no instances within the current codebase where xfer_deliver_raw() is used to pass data to an interface supporting the deliver_raw() method.) Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-06-22 15:50:31 +01:00
Michael Brown	5fa6775b61	[retry] Use start_timer_fixed() instead of direct timeout manipulation Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-06-22 14:32:49 +01:00
Michael Brown	c760ac3022	[retry] Add timer_init() wrapper function Standardise on using timer_init() to initialise an embedded retry timer, to match the coding style used by other embedded objects. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-06-22 14:30:20 +01:00
Michael Brown	4bfd5b52c1	[refcnt] Add ref_init() wrapper function Standardise on using ref_init() to initialise an embedded reference count, to match the coding style used by other embedded objects. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-06-22 14:26:40 +01:00
Michael Brown	9ff8229693	[tcp] Update received sequence number before delivering received data iPXE currently updates the TCP sequence number after delivering the data to the application via xfer_deliver_iob(). If the application responds to the received data by transmitting more data, this would result in a stale ACK number appearing in the transmitted packet, which potentially causes retransmissions and also gives the undesirable appearance of violating causality (by sending a response to a message that we claim not to have yet received). Reported-by: Guo-Fu Tseng <cooldavid@cooldavid.org> Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-05-22 00:45:49 +01:00
Michael Brown	8406115834	[build] Rename gPXE to iPXE Access to the gpxe.org and etherboot.org domains and associated resources has been revoked by the registrant of the domain. Work around this problem by renaming project from gPXE to iPXE, and updating URLs to match. Also update README, LOG and COPYRIGHTS to remove obsolete information. Signed-off-by: Michael Brown <mcb30@ipxe.org>	2010-04-19 23:43:39 +01:00
Michael Brown	5552a1b202	[tcp] Avoid printf format warnings on some compilers In several places, we currently use size_t to represent a difference between TCP sequence numbers. This can cause compiler warnings relating to printf format specifiers, since the result of (uint32_t+size_t) may be an unsigned long on some compilers. Fix by using uint32_t for all variables that represent a difference between TCP sequence numbers. Tested-by: Joshua Oreman <oremanj@xenon.get-linux.org>	2009-08-02 22:44:57 +01:00

1 2 3

132 Commits