MC/S vs MPIO
MC/S (Multiple Connections per Session) is a feature of iSCSI protocol, which allows to combine several connections inside a single session for performance and failover purposes. Let's consider what practical value this feature has comparing with OS level multipath (MPIO) and try to answer why none of Open Source OS'es neither still support it, despite of many years since iSCSI protocol started being actively used, nor going to implement it in the future.
MC/S is done on the iSCSI level, while MPIO is done on the higher level. Hence, all MPIO infrastructure is shared among all SCSI transports, including Fibre Channel, SAS, etc.
MC/S was designed at time, when most OS'es didn't have standard OS level multipath. Instead, each vendor had its own implementation, which created huge interoperability problems. So, one of the goals of MC/S was to address this issue and standardize the multipath area in a single standard. But nowadays almost all OS'es has OS level multipath implemented using standard SCSI facilities, hence this purpose of MC/S isn't valid anymore.
Usually it is claimed, than MC/S has the following 2 advantages over MPIO:
- Faster failover recovery.
- Better performance.
Let's look how realistic those claims are.
Failover recovery time
Let's consider a single target exporting a single device over 2 links.
For MC/S failover recovery is quite simple: all outstanding SCSI commands reassigned to another connection. No other actions are necessary, because session (i.e. I_T Nexus) remains the same. Consequently, all reservations and other SCSI states as well as other initiators connected to the device remain unaffected.
For MPIO failover recovery is much more complicated. This is because it involves transfer of all outstanding commands and SCSI states from one I_T Nexus to another. The first thing, which initiator will do for that is to abort all outstanding commands on the faulted I_T Nexus. There are 2 approaches for that: CLEAR TASK SET and LUN RESET task management functions.
CLEAR TASK SET function aborts all commands on the device. Unfortunately, it has limitations: it isn't always supported by device and having single task set shared over initiators isn't always appropriate for application.
LUN RESET function resets the device.
Both CLEAR TASK SET and LUN RESET functions can somehow harm other initiators, because all commands from all initiators, not only from one doing the failover recovery, will be aborted. Additionally, LUN RESET resets all SCSI settings for all connected initiators to the initial state and, if device had reservation from any initiator, it will be cleared.
But the harm is minimal:
- With TAS bit set on Control Mode page, all the aborted commands will be returned to all affected initiators with TASK ABORTED status, so they can simply immediately retry them. For CLEAR TASK SET if TAS isn't set all affected initiators will be notified by Unit Attention COMMANDS CLEARED BY ANOTHER INITIATOR, so they also can immediately retry all outstanding commands.
- In case of the device reset the affected initiators will be notified via the corresponding Unit Attention about reset of all SCSI settings to the initial state. Then the initiators can do necessary recovery actions. Usually no recovery actions are needed, except for the reservation holder, whose reservation was cleared. For it recovery might be not trivial. But Persistent Reservations solve this issue, because they are not cleared by the device reset.
Thus, with Persistent Reservations or using CLEAR TASK SET function additional failover recovery time, which MPIO has comparing to MC/S, is time to wait for reset or commands abort finished and time to retry all the aborted commands. On a properly configured system it should be less than few seconds, which is well acceptable on practice. If Linux storage stack improved to allow to abort all submitted to it commands (currently only wait for their completion is possible), then time to abort all the commands can be decreased to a fraction of second.
At first, neither MC/S, nor MPIO can improve performance if there is only one SCSI command sent to target at time. For instance, in case of tape backup and restore. Both MC/S and MPIO work on the commands level, so can't split data transfers for a single command over several links. Only bonding (also known as NIC teaming or Link Aggregation) can improve performance in this case, because it works on the link level.
MC/S over several links preserves commands execution order, i.e. with it commands executed in the same order as they were submitted. MPIO can't preserve this order, because it can't see, which command on which link was submitted earlier. Delays in links processing can change commands order in the place where target receives them.
Since initiators usually send commands in the optimal for performance order, reordering can somehow hurt performance. But this can happen only with naive target implementation, which can't recover the optimal commands execution order. Currently Linux is not naive and quite good on this area. See, for instance, section "SEQUENTIAL ACCESS OVER MPIO" in those measurements. Don't look at the absolute numbers, look at %% of performance improvement using the second link. The result equivalent to 200 MB/s over 2 1Gbps links, which is close to possible maximum.
If free commands reorder is forbidden for a device, either by use of ORDERED tag, or if the Queue Algorithm Modifier in the Control Mode Page is set to 0, then MPIO will have to maintain commands order by sending commands over only a single link. But on practice this case is really rare and 99.(9)% of OS'es and applications allow free commands reorder and it is enabled by default.
From other side, strictly preserving commands order as MC/S does has a downside as well. It can lead to so called "commands ordering bottleneck", when newer commands have to wait before one or more older commands get executed, although it would be better for performance to reorder them. As result, MPIO sometimes has better performance, than MC/S, especially in setups, where maximum IOPS number is important. See, for instance, here.
When MC/S is better than MPIO
For sake of completeness, we should mention that there are marginal cases, where MPIO can't be used or will not provide any benefit, but MC/S can be successful:
- When strict commands order is required.
- When aborted commands can't be retried.
For disks both of them are always false. However for some tape drives and backup applications one or both can be true. But on practice:
- There are neither known tape drives, nor backup applications, which can use multiple outstanding commands at time. All them support and use only one single outstanding command at time. MC/S can't increase performance for them, only bonding can. So, in this case there no difference between MC/S and MPIO.
- The lack of ability to retry commands is rather a limitation of legacy tape drives, which support only implicit address commands, not of MPIO. Modern tape drives and backup applications can use explicit address commands, which you can abort and then retry, hence they are compatible with MPIO.
- Cost to develop MC/S is high, but benefits of it are marginal and with future MPIO improvements can be fully eliminated.
- MPIO allows to utilize existing infrastructure for all transports, not only iSCSI.
- All transports can benefit from improvements in MPIO.
- With MPIO there is no need to create multiple layers doing very similar functionality.
- MPIO doesn't have commands ordering bottleneck, which MC/S has.
Simply, MC/S is rather a workaround done on the wrong level for some deficiencies of existing SCSI standards used for MPIO, namely the lack of possibility to group several I_T Nexuses with ability to reassign commands between them and preserve commands order among them. If in future those features added in the SCSI standards, MC/S will not be needed at all, hence, all investments in it will be voided. No surprise then that no Open Source OS'es neither support, nor going to implement it. Moreover, when back to 2005 there was an attempt to add MC/S capable iSCSI initiator in Linux, it was rejected. See for more details here and here.