Commit 4f6aaade authored by Tony Battersby's avatar Tony Battersby Committed by Martin K. Petersen
Browse files

scsi: qla2xxx: Fix lost interrupts with qlini_mode=disabled



When qla2xxx is loaded with qlini_mode=disabled,
ha->flags.disable_msix_handshake is used before it is set, resulting in
the wrong interrupt handler being used on certain HBAs
(qla2xxx_msix_rsp_q_hs() is used when qla2xxx_msix_rsp_q() should be
used).  The only difference between these two interrupt handlers is that
the _hs() version writes to a register to clear the "RISC" interrupt,
whereas the other version does not.  So this bug results in the RISC
interrupt being cleared when it should not be.  This occasionally causes
a different interrupt handler qla24xx_msix_default() for a different
vector to see ((stat & HSRX_RISC_INT) == 0) and ignore its interrupt,
which then causes problems like:

qla2xxx [0000:02:00.0]-d04c:6: MBX Command timeout for cmd 20,
  iocontrol=8 jiffies=1090c0300 mb[0-3]=[0x4000 0x0 0x40 0xda] mb7 0x500
  host_status 0x40000010 hccr 0x3f00
qla2xxx [0000:02:00.0]-101e:6: Mailbox cmd timeout occurred, cmd=0x20,
  mb[0]=0x20. Scheduling ISP abort
(the cmd varies; sometimes it is 0x20, 0x22, 0x54, 0x5a, 0x5d, or 0x6a)

This problem can be reproduced with a 16 or 32 Gbps HBA by loading
qla2xxx with qlini_mode=disabled and running a high IOPS test while
triggering frequent RSCN database change events.

While analyzing the problem I discovered that even with
disable_msix_handshake forced to 0, it is not necessary to clear the
RISC interrupt from qla2xxx_msix_rsp_q_hs() (more below).  So just
completely remove qla2xxx_msix_rsp_q_hs() and the logic for selecting
it, which also fixes the bug with qlini_mode=disabled.

The test below describes the justification for not needing
qla2xxx_msix_rsp_q_hs():

Force disable_msix_handshake to 0:
qla24xx_config_rings():
if (0 && (ha->fw_attributes & BIT_6) && (IS_MSIX_NACK_CAPABLE(ha)) &&
    (ha->flags.msix_enabled)) {

In qla24xx_msix_rsp_q() and qla2xxx_msix_rsp_q_hs(), check:
  (rd_reg_dword(&reg->host_status) & HSRX_RISC_INT)

Count the number of calls to each function with HSRX_RISC_INT set and
the number with HSRX_RISC_INT not set while performing some I/O.

If qla2xxx_msix_rsp_q_hs() clears the RISC interrupt (original code):
qla24xx_msix_rsp_q:    50% of calls have HSRX_RISC_INT set
qla2xxx_msix_rsp_q_hs:  5% of calls have HSRX_RISC_INT set
(# of qla2xxx_msix_rsp_q_hs interrupts) =
    (# of qla24xx_msix_rsp_q interrupts) * 3

If qla2xxx_msix_rsp_q_hs() does not clear the RISC interrupt (patched
code):
qla24xx_msix_rsp_q:    100% of calls have HSRX_RISC_INT set
qla2xxx_msix_rsp_q_hs:   9% of calls have HSRX_RISC_INT set
(# of qla2xxx_msix_rsp_q_hs interrupts) =
    (# of qla24xx_msix_rsp_q interrupts) * 3

In the case of the original code, qla24xx_msix_rsp_q() was seeing
HSRX_RISC_INT set only 50% of the time because qla2xxx_msix_rsp_q_hs()
was clearing it when it shouldn't have been.  In the patched code,
qla24xx_msix_rsp_q() sees HSRX_RISC_INT set 100% of the time, which
makes sense if that interrupt handler needs to clear the RISC interrupt
(which it does).  qla2xxx_msix_rsp_q_hs() sees HSRX_RISC_INT only 9% of
the time, which is just overlap from the other interrupt during the
high IOPS test.

Tested with SCST on:
QLE2742  FW:v9.08.02 (32 Gbps 2-port)
QLE2694L FW:v9.10.11 (16 Gbps 4-port)
QLE2694L FW:v9.08.02 (16 Gbps 4-port)
QLE2672  FW:v8.07.12 (16 Gbps 2-port)
both initiator and target mode

Signed-off-by: default avatarTony Battersby <tonyb@cybernetics.com>
Link: https://patch.msgid.link/56d378eb-14ad-49c7-bae9-c649b6c7691e@cybernetics.com


Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
parent 8f58fc64
Loading
Loading
Loading
Loading
+0 −1
Original line number Diff line number Diff line
@@ -3503,7 +3503,6 @@ struct isp_operations {
#define QLA_MSIX_RSP_Q			0x01
#define QLA_ATIO_VECTOR		0x02
#define QLA_MSIX_QPAIR_MULTIQ_RSP_Q	0x03
#define QLA_MSIX_QPAIR_MULTIQ_RSP_Q_HS	0x04

#define QLA_MIDX_DEFAULT	0
#define QLA_MIDX_RSP_Q		1
+1 −1
Original line number Diff line number Diff line
@@ -766,7 +766,7 @@ extern int qla2x00_dfs_remove(scsi_qla_host_t *);

/* Globa function prototypes for multi-q */
extern int qla25xx_request_irq(struct qla_hw_data *, struct qla_qpair *,
	struct qla_msix_entry *, int);
	struct qla_msix_entry *);
extern int qla25xx_init_req_que(struct scsi_qla_host *, struct req_que *);
extern int qla25xx_init_rsp_que(struct scsi_qla_host *, struct rsp_que *);
extern int qla25xx_create_req_que(struct qla_hw_data *, uint16_t, uint8_t,
+3 −29
Original line number Diff line number Diff line
@@ -4467,32 +4467,6 @@ qla2xxx_msix_rsp_q(int irq, void *dev_id)
	return IRQ_HANDLED;
}

irqreturn_t
qla2xxx_msix_rsp_q_hs(int irq, void *dev_id)
{
	struct qla_hw_data *ha;
	struct qla_qpair *qpair;
	struct device_reg_24xx __iomem *reg;
	unsigned long flags;

	qpair = dev_id;
	if (!qpair) {
		ql_log(ql_log_info, NULL, 0x505b,
		    "%s: NULL response queue pointer.\n", __func__);
		return IRQ_NONE;
	}
	ha = qpair->hw;

	reg = &ha->iobase->isp24;
	spin_lock_irqsave(&ha->hardware_lock, flags);
	wrt_reg_dword(&reg->hccr, HCCRX_CLR_RISC_INT);
	spin_unlock_irqrestore(&ha->hardware_lock, flags);

	queue_work(ha->wq, &qpair->q_work);

	return IRQ_HANDLED;
}

/* Interrupt handling helpers. */

struct qla_init_msix_entry {
@@ -4505,7 +4479,6 @@ static const struct qla_init_msix_entry msix_entries[] = {
	{ "rsp_q", qla24xx_msix_rsp_q },
	{ "atio_q", qla83xx_msix_atio_q },
	{ "qpair_multiq", qla2xxx_msix_rsp_q },
	{ "qpair_multiq_hs", qla2xxx_msix_rsp_q_hs },
};

static const struct qla_init_msix_entry qla82xx_msix_entries[] = {
@@ -4792,9 +4765,10 @@ qla2x00_free_irqs(scsi_qla_host_t *vha)
}

int qla25xx_request_irq(struct qla_hw_data *ha, struct qla_qpair *qpair,
	struct qla_msix_entry *msix, int vector_type)
	struct qla_msix_entry *msix)
{
	const struct qla_init_msix_entry *intr = &msix_entries[vector_type];
	const struct qla_init_msix_entry *intr =
		&msix_entries[QLA_MSIX_QPAIR_MULTIQ_RSP_Q];
	scsi_qla_host_t *vha = pci_get_drvdata(ha->pdev);
	int ret;

+1 −3
Original line number Diff line number Diff line
@@ -899,9 +899,7 @@ qla25xx_create_rsp_que(struct qla_hw_data *ha, uint16_t options,
	    rsp->options, rsp->id, rsp->rsp_q_in,
	    rsp->rsp_q_out);

	ret = qla25xx_request_irq(ha, qpair, qpair->msix,
		ha->flags.disable_msix_handshake ?
		QLA_MSIX_QPAIR_MULTIQ_RSP_Q : QLA_MSIX_QPAIR_MULTIQ_RSP_Q_HS);
	ret = qla25xx_request_irq(ha, qpair, qpair->msix);
	if (ret)
		goto que_failed;