802.11 multicast hashing is the same as standard Ethernet hashing, so
just expose and use eth_mc_hash().
Signed-off-by: Joshua Oreman <oremanj@rwcr.net>
The recent change to process_add() to detect duplicate process
additions relies on the fact that all processes will be initialized
using process_init_stopped() before being passed to that function.
The autoassociation process was not initialized in this fashion, so
process_add() erroneously detected it as a duplicate.
Fix by using process_init_stopped() to initialize the autoassociation
process instead of setting the step member directly.
Signed-off-by: Michael Brown <mcb30@etherboot.org>
For IPoIB, the chaddr field is too small (16 bytes) to contain the
20-byte IPoIB link-layer address. RFC4390 mandates that we should
pass an empty chaddr field and rely on the DHCP client identifier
instead. This has many problems, not least of which is that a client
identifier containing an IPoIB link-layer address is not very useful
from the point of view of creating DHCP reservations, since the QPN
component is assigned at runtime and may vary between boots.
Leave the DHCP client identifier as-is, to avoid breaking existing
setups as far as possible, but expose the real hardware address (the
port GUID) via the DHCP chaddr field, using the broadcast flag to
instruct the DHCP server not to use this chaddr value as a link-layer
address.
This makes it possible (at least with ISC dhcpd) to create DHCP
reservations using host declarations such as:
host duckling {
fixed-address 10.252.252.99;
hardware unknown-32 00:02:c9:02:00:25:a1:b5;
}
IPoIB has a 20-byte link-layer address, of which only eight bytes
represent anything relating to a "hardware address".
The PXE and EFI SNP APIs expect the permanent address to be the same
size as the link-layer address, so fill in the "permanent address"
field with the initial link layer address (as generated by
register_netdev() based upon the real hardware address).
The hardware address is an intrinsic property of the hardware, while
the link-layer address can be changed at runtime. This separation is
exposed via APIs such as PXE and EFI, but is currently elided by gPXE.
Expose the hardware and link-layer addresses as separate properties
within a net device. Drivers should now fill in hw_addr, which will
be used to initialise ll_addr at the time of calling
register_netdev().
There is diagnostic value in being able to disambiguate between the
various reasons why an IB CM has rejected a connection attempt. In
particular, reason 8 "invalid service ID" can be used to identify an
incorrect SRP service_id root-path component, and reason 28 "consumer
reject" corresponds to a genuine SRP login rejection IU, which can be
passed up to the SRP layer.
For rejection reasons other than "consumer reject", we should not pass
through the private data, since it is most likely generated by the CM
without any protocol-specific knowledge.
Generate errors within individual MAD transaction consumers such as
ib_pathrec.c and ib_mcast.c, rather than within ib_mi.c. This allows
for more meaningful error messages to eventually be displayed to the
user.
SRP is the SCSI RDMA Protocol. It allows for a method of SAN booting
whereby the target is responsible for reading and writing data using
Remote DMA directly to the initiator's memory. The software initiator
merely sends and receives SCSI commands; it never has to touch the
actual data.
The minimal-surprise behaviour, when no explicit SRP initiator device
is specified, will probably be to use the most recently opened
Infiniband device. This matches our behaviour with using the most
recently opened net device for PXE, iSCSI, AoE, NBI, etc.
SRP over Infiniband uses a protocol whereby data is sent via a
combination of the CM private data fields and the RC queue pair
itself. This seems sufficiently generic that it's worth having
available as a separate protocol.
We will terminate our transaction as soon as we receive the first CM
REP, since that provides all the state that we need. However, the
peer may resend the REP if it didn't see our RTU, and if we don't
respond with another RTU we risk being disconnected. (This protocol
appears not to handle retries gracefully.)
Fix by adding a management agent that will listen for these duplicate
REPs and send back an RTU.
When a probe found no results, the list head of beacons would not be
freed, leaking 16 bytes of memory per probe.
Signed-off-by: Michael Brown <mcb30@etherboot.org>
Some cards (such as ath5k) always need to tune to a particular channel
when they are reset; the reset may happen upon open(), which is before
the channels array would be set up (in prepare_probe()). Avoid tuning
the card to an inconsistent state by copying the hardware
supported-channels array to the 802.11 device's allowable-channels
array even before channels are "properly" set up.
Signed-off-by: Michael Brown <mcb30@etherboot.org>
The prior net80211 model of physical-layer behavior for drivers was
overly simplistic and limited the drivers that could be written. To
be more flexible, split the driver-provided list of supported rates by
band, and add a means for specifying a list of supported channels.
Allow drivers to specify a hardware channel value that will be tied to
uses of the channel.
Expose net80211_duration() to drivers, and make the rate it uses in
its computations configurable, so that it can be used in calculating
durations that must be set in hardware for ACK and CTS packets. Add
net80211_cts_duration() for the common case of calculating the
duration for a CTS packet.
Signed-off-by: Michael Brown <mcb30@etherboot.org>
A management interface is the component through which both local and
remote management agents are accessed.
This new implementation of a management interface allows for the user
to react to timed-out transactions, and also allows for cancellation
of in-progress transactions.
The IBA specification refers to management "interfaces" and "agents".
The interface is the component that connects to the queue pair and
sends and receives MADs; the agent is the component that constructs
the reply to the MAD.
Rename the IB_{QPN,QKEY,QPT} constants as a first step towards making
this separation in gPXE.
In several places, we currently use size_t to represent a difference
between TCP sequence numbers. This can cause compiler warnings
relating to printf format specifiers, since the result of
(uint32_t+size_t) may be an unsigned long on some compilers.
Fix by using uint32_t for all variables that represent a difference
between TCP sequence numbers.
Tested-by: Joshua Oreman <oremanj@xenon.get-linux.org>
This is required for all modern 802.11 devices, and allows drivers
to be written for them with minimally more effort than is required
for a wired NIC.
Signed-off-by: Michael Brown <mcb30@etherboot.org>
Modified-by: Michael Brown <mcb30@etherboot.org>
Queue pairs are now assumed to be created in the INIT state, with a
call to ib_modify_qp() required to bring the queue pair to the RTS
state.
ib_modify_qp() no longer takes a modification list; callers should
modify the relevant queue pair parameters (e.g. qkey) directly and
then call ib_modify_qp() to synchronise the changes to the hardware.
The packet sequence number is now a property of the queue pair, rather
than of the device.
Each queue pair may have an associated address vector. For RC queue
pairs, this is the address vector that will be programmed in to the
hardware as the remote address. For UD queue pairs, it will be used
as the default address vector if none is supplied to ib_post_send().
Now that MAD handlers no longer return a status code, we can allow
them to return a pointer to a MAD structure if and only if they want
to send a response. This provides a more natural and flexible
approach than using a "response method" field within the handler's
descriptor.
MAD handlers have to set the status fields within the MAD itself
anyway, in order to provide a meaningful response MAD; the additional
gPXE return status code is just noise.
Note that we probably don't need to ever explicitly set the status to
IB_MGMT_STATUS_OK, since it should already have this value from the
request. (By not explicitly setting the status in this way, we can
safely have ib_sma_set_xxx() call ib_sma_get_xxx() in order to
generate the GetResponse MAD without worrying that ib_sma_get_xxx()
will clear any error status set by ib_sma_set_xxx().)
Most IB hardware seems not to allow allocation of the genuine QPNs 0
and 1, so allow for the externally-visible QPN (as constructed and
parsed by ib_packet, where used) to differ from the real
hardware-allocated QPN.
The queue key is stored as a property of the queue pair, and so can
optionally be added by the Infiniband core at the time of calling
ib_post_send(), rather than always having to be specified by the
caller.
This allows IPoIB to avoid explicitly keeping track of the data queue
key.
Generalise the subnet management agent into a general management agent
capable of sending and responding to MADs, including support for
retransmissions as necessary.
Currently, all Infiniband users must create a process for polling
their completion queues (or rely on a regular hook such as
netdev_poll() in ipoib.c).
Move instead to a model whereby the Infiniband core maintains a single
process calling ib_poll_eq(), and polling the event queue triggers
polls of the applicable completion queues. (At present, the
Infiniband core simply polls all of the device's completion queues.)
Polling a completion queue will now implicitly refill all attached
receive work queues; this is analogous to the way that netdev_poll()
implicitly refills the RX ring.
Infiniband users no longer need to create a process just to poll their
completion queues and refill their receive rings.
IPoIB and the SMA have separate constants for the packet size to be
used to I/O buffer allocations. Merge these into the single
IB_MAX_PAYLOAD_SIZE constant.
(Various other points in the Infiniband stack have hard-coded
assumptions of a 2048-byte payload; we don't currently support
variable MTUs.)
IPoIB has a link-layer broadcast address that varies according to the
partition key. We currently go through several contortions to pretend
that the link-layer address is a fixed constant; by making the
broadcast address a property of the network device rather than the
link-layer protocol it will be possible to simplify IPoIB's broadcast
handling.
Move the icky call to step() from aoe.c to ata.c; this takes it at
least one step further away from where it really doesn't belong.
Unfortunately, AoE has the ugly aoe_discover() mechanism which means
that we still have a step() loop in aoe.c for now; this needs to be
replaced at some future point.
Expand the NETDEV_LINK_UP bit into a link_rc status code field,
allowing specific reasons for link failure to be reported via
"ifstat".
Originally-authored-by: Joshua Oreman <oremanj@rwcr.net>
Commit 558c1a4 ("[tcp] Improve robustness in the presence of duplicated
received packets") introduced a regression in that an old duplicate
ACK received while in the ESTABLISHED state would pass through normal
ACK processing, including updating tcp->snd_seq.
Fix by ensuring that ACK processing ignores all duplicate ACKs.
All TCP errors or unusual events should now generate a debugging
message at DBGLVL_LOG, with enough information (SEQ and ACK numbers)
to be able to identify the corresponding packet (or missing packet) in
a network trace from the remote end.
This makes it possible to leave TCP debugging enabled in order to see
interesting TCP events, without flooding the console with at least one
message per packet.
In order to construct outgoing link-layer frames or parse incoming
ones properly, some protocols (such as 802.11) need more state than is
available in the existing variables passed to the link-layer protocol
handlers. To remedy this, add struct net_device *netdev as the first
argument to each of these functions, so that more information can be
fetched from the link layer-private part of the network device.
Updated all three call sites (netdevice.c, efi_snp.c, pxe_undi.c) and
both implementations (ethernet.c, ipoib.c) of ll_protocol to use the
new argument.
Signed-off-by: Michael Brown <mcb30@etherboot.org>
gPXE responds to duplicated ACKs with an immediate retransmission,
which can lead to a sorceror's apprentice syndrome. It also responds
to out-of-range (or old duplicate) ACKs with a RST, which can cause
valid connections to be dropped.
Fix the sorceror's apprentice syndrome by leaving the retransmission
timer running (and so inhibiting the immediate retransmission) when we
receive a potential duplicate ACK. This seems to match the behaviour
of Linux observed via wireshark traces.
Fix the RST issue by sending RST only on out-of-range ACKs that occur
before the connection is fully established, as per RFC 793.
These problems were exposed during development of the 802.11 wireless
link layer; the 802.11 protocol has a failure mode that can easily
cause duplicated packets. The fixes were tested in a controlled way
by faking large numbers of duplicated packets in the rtl8139 driver.
Originally-fixed-by: Joshua Oreman <oremanj@rwcr.net>
If the ProxyDHCPOFFER already includes PXE options (i.e. option 60 is
set to "PXEClient" and option 43 is present) then assume that the
ProxyDHCPREQUEST can be sent to port 67, rather than port 4011. This
is a reasonable assumption, since in that case the ProxyDHCP server
has already demonstrated by responding to the DHCPDISCOVER that it is
listening on port 67. (If the ProxyDHCP server were not listening on
port 67, then the standard DHCP server would have been configured to
respond with option 60 set to "PXEClient" but no option 43 present.)
The PXE specification is ambiguous on this point; the specified
behaviour covers only the cases in which option 43 is *not* present in
the ProxyDHCPOFFER. In these cases, we will continue to send the
ProxyDHCPREQUEST to port 4011.
This change is required in order to allow us to interoperate with
dnsmasq, which listens only on port 67. (dnsmasq relies on
unspecified behaviour of the Intel PXE stack, which it seems will
retain the ProxyDHCPOFFER as an options source and never issue a
ProxyDHCPREQUEST, thereby enabling dnsmasq to omit listening on port
4011.)
IBM Tivoli PXE Server 5.1.0.3 is reported to send trailing garbage
bytes at the end of the OACK packet, which causes gPXE to reject the
packet and abort the TFTP transfer.
Work around the problem by processing as much as possible of the OACK,
and treating name/value parsing errors as non-fatal.
Reported-by: Shao Miller <Shao.Miller@yrdsb.edu.on.ca>
We currently send all boot server discovery requests to port 4011.
Section 2.2.1 of the PXE spec states that boot server discovery
packets should be "sent broadcast (port 67), multicast (port 4011), or
unicast (port 4011)". Adjust our behaviour so that any boot server
discovery packets that are sent to the broadcast address are directed
to port 67 rather than port 4011.
This is required for operation with dnsmasq as a PXE server, since
dnsmasq listens only on port 67, and relies upon this (specified)
behaviour.
This change may break some setups using the (itself very broken) Linux
PXE server from kano.org.uk. This server will, in its default
configuration, listen only on port 4011. It never constructs a boot
server list (PXE_BOOT_SERVERS, option 43.8), and uses the wrong
definitions for the discovery control bits (PXE_DISCOVERY_CONTROL,
option 43.6). The upshot is that it will always instruct the client
to perform multicast and broadcast discovery only. In setups lacking
a valid multicast route on the server side, this used to work because
gPXE would eventually give up on the (non-responsive) multicast
address and send a broadcast request to port 4011, which the Linux PXE
server would respond to. Now that gPXE correctly sends this broadcast
request to port 67 instead, it is never seen by the Linux PXE server,
and the boot fails. The fix is to either (a) set up a multicast route
correctly on the server side before starting the PXE server, or (b)
edit /etc/pxe.conf to contain the server's unicast address in the
"multicast_address" field (a hack that happens to work).
Suggested-by: Simon Kelley <simon@thekelleys.org.uk>
This prevents gPXE from wasting time attempting to contact a ProxyDHCP
server on port 4011 if the DHCP response already contains the relevant
PXE options. This behaviour is hinted at (though not explicitly
specified) in the PXE spec, and seems to match what the Intel client
does.
Suggested-by: Simon Kelley <simon@thekelleys.org.uk>
Eliminate the potential for mismatches between table names and the
table entry data type by incorporating the data type into the
definition of the table, rather than specifying it explicitly in each
table accessor method.
Intel's C compiler (icc) chokes on the zero-length arrays that we
currently use as part of the mechanism for accessing linker table
entries. Abstract away the zero-length arrays, to make a port to icc
easier.
Introduce macros such as for_each_table_entry() to simplify the common
case of iterating over all entries in a linker table.
Represent table names as #defined string constants rather than
unquoted literals; this avoids visual confusion between table names
and C variable or type names, and also allows us to force a
compilation error in the event of incorrect table names.
Some firewall devices seem to regard SYN,PSH as an invalid flag
combination and reject the packet. Fix by setting PSH only if SYN is
not set.
Reported-by: DSE Incorporated <dseinc@gmail.com>
Avoid passing credentials in the iBFT that were available but not
required for login. This works around a problem in the Microsoft
iSCSI initiator, which will refuse to initiate sessions if the CHAP
password is fewer than 12 characters, even if the target ends up not
asking for CHAP authentication.
It is a programming error, not a runtime error, if we attempt to use
block ciphers with an incorrect blocksize, so use an assert() rather
than an error status return.
The various types of cryptographic algorithm are fundamentally
different, and it was probably a mistake to try to handle them via a
single common type.
pubkey_algorithm is a placeholder type for now.
The documentation in xfer.h and xfer.c does not say that the metadata
parameter is optional in calls such as xfer_deliver_iob_meta() and the
deliver_iob() method. However, some code in net/ is prepared to
accept a NULL pointer, and xfer_deliver_as_iob() passes a NULL pointer
directly to the deliver_iob() method.
Fix this mess of conflicting assumptions by making everything assume
that the metadata parameter is mandatory, and fixing
xfer_deliver_as_iob() to pass in a dummy metadata structure (as is
already done in xfer_deliver_iob()).
Various combinations of options 43.6, 43.7 and 43.8 dictate which
servers we send Boot Server Discovery requests to, and which servers
we should accept responses from. Obey these options, and remove the
explicit specification of a single Boot Server from start_pxebs() and
dependent functions.
There are many functions that take ownership of the I/O buffer they
are passed as a parameter. The caller should not retain a pointer to
the I/O buffer. Use iob_disown() to automatically nullify the
caller's pointer, e.g.:
xfer_deliver_iob ( xfer, iob_disown ( iobuf ) );
This will ensure that iobuf is set to NULL for any code after the call
to xfer_deliver_iob().
iob_disown() is currently used only in places where it simplifies the
code, by avoiding an extra line explicitly setting the I/O buffer
pointer to NULL. It should ideally be used with each call to any
function that takes ownership of an I/O buffer. (The SSA
optimisations will ensure that use of iob_disown() gets optimised away
in cases where the caller makes no further use of the I/O buffer
pointer anyway.)
If gcc ever introduces an __attribute__((free)), indicating that use
of a function argument after a function call should generate a
warning, then we should use this to identify all applicable function
call sites, and add iob_disown() as necessary.
A TFTP DATA packet with a block number of zero (representing a
negative offset within the file) could potentially cause problems.
Fixed by explicitly rejecting such packets.
Identified by Stefan Hajnoczi <stefanha@gmail.com>.
The DHCP client code now implements only the mechanism of the DHCP and
PXE Boot Server protocols. Boot Server Discovery can be initiated
manually using the "pxebs" command. The menuing code is separated out
into a user-level function on a par with boot_root_path(), and is
entered in preference to a normal filename boot if the DHCP vendor
class is "PXEClient" and the PXE boot menu option exists.
Try to qualify relative names in the DNS resolver using the DHCP Domain
Name. For example:
DHCP Domain Name: etherboot.org
(Relative) Name: www
yields:
www.etherboot.org
Only names with no dots ('.') will be modified. A name with one or more
dots is unchanged.
pxe_tftp.c assumes that the first seek on its data-transfer interface
represents the block size. Apart from being an ugly hack, this will
also screw up file size calculation for files smaller than one block.
The proper solution would be to extend the data-transfer interface to
support the reporting of stat()-like data. This is not going to
happen until the cost of adding interface methods is reduced (a fix I
have planned since June 2008).
In the meantime, abuse the xfer_window() method to return the block
size, since it is not being used for anything else and is vaguely
justifiable.
Astonishingly, having returned the incorrect TFTP blocksize via
PXENV_TFTP_OPEN for almost a year seems not to have affected any of
the test cases run during that time; this bug was found only when
someone tried running the heavily-patched version of pxegrub found in
OpenSolaris.
PXE dictates a mechanism for boot menuing, involving prompting the
user with a variable message, waiting for a predefined keypress,
displaying a boot menu, and waiting for a selection.
This breaks the currently desirable abstraction that DHCP is a process
that can happen in the background without any user interaction.
Remove the lazy assumption that ProxyDHCP == "DHCP with option 60 set
to PXEClient", and explicitly separate the notion of ProxyDHCP from
the notion of packets containing PXE options.
It is possible to configure a DHCP server to hand out PXE options
without a ProxyDHCP server present. This requires setting option 60
to "PXEClient", which will cause gPXE to attempt ProxyDHCP.
We assume in several places that dhcp->proxydhcpack is set to the
DHCPACK packet containing option 60 set to "PXEClient". When we
transition into ProxyDHCPREQUEST, set dhcp->proxydhcpack=dhcp->dhcpack
so that this assumption holds true.
We ought to rename several references to "proxydhcp" to something more
accurate, such as "pxedhcp". Treating a single DHCP response as
potentially both DHCPOFFER and ProxyDHCPOFFER does make the code
smaller, but the variable names get confusing.
Pick out the first boot menu item from the boot menu (option 43.9) and
pass it to the boot server as the boot menu item (option 43.71).
Also improve DHCP debug messages to include more details of the
packets being transmitted.
Apparently this can cause a major speedup on some iSCSI targets, which
will otherwise wait for a timer to expire before responding. It
doesn't seem to hurt other simple TCP test cases (e.g. HTTP
downloads).
Problem and solution identified by Shiva Shankar <802.11e@gmail.com>
Some PXE configurations require us to perform a third DHCP transaction
(in addition to the real DHCP transaction and the ProxyDHCP
transaction) in order to retrieve information from a "Boot Server".
This is an experimental implementation, since the actual behaviour is
not well specified in the PXE spec.