Add a loop versioning pass

This patch adds a pass that versions loops with variable index strides for the case in which the stride is 1. E.g.: for (int i = 0; i < n; ++i) x[i * stride] = ...; becomes: if (stepx == 1) for (int i = 0; i < n; ++i) x[i] = ...; else for (int i = 0; i < n; ++i) x[i * stride] = ...; This is useful for both vector code and scalar code, and in some cases can enable further optimisations like loop interchange or pattern recognition. The pass gives a 7.6% improvement on Cortex-A72 for 554.roms_r at -O3 and a 2.4% improvement for 465.tonto. I haven't found any SPEC tests that regress. Sizewise, there's a 10% increase in .text for both 554.roms_r and 465.tonto. That's obviously a lot, but in tonto's case it's because the whole program is written using assumed-shape arrays and pointers, so a large number of functions really do benefit from versioning. roms likewise makes heavy use of assumed-shape arrays, and that improvement in performance IMO justifies the code growth. The next biggest .text increase is 4.5% for 548.exchange2_r. I did see a small (0.4%) speed improvement there, but although both 3-iteration runs produced stable results, that might still be noise. There was a slightly larger (non-noise) improvement for a 256-bit SVE model. 481.wrf and 521.wrf_r .text grew by 2.8% and 2.5% respectively, but without any noticeable improvement in performance. No other test grew by more than 2%. Although the main SPEC beneficiaries are all Fortran tests, the benchmarks we use for SVE also include some C and C++ tests that benefit. Using -frepack-arrays gives the same benefits in many Fortran cases. The problem is that using that option inappropriately can force a full array copy for arguments that the function only reads once, and so it isn't really something we can turn on by default. The new pass is supposed to give most of the benefits of -frepack-arrays without the risk of unnecessary repacking. The patch therefore enables the pass by default at -O3. 2018-12-17 Richard Sandiford <richard.sandiford@arm.com> Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> Kyrylo Tkachov <kyrylo.tkachov@arm.com> gcc/ * doc/invoke.texi (-fversion-loops-for-strides): Document (loop-versioning-group-size, loop-versioning-max-inner-insns) (loop-versioning-max-outer-insns): Document new --params. * Makefile.in (OBJS): Add gimple-loop-versioning.o. * common.opt (fversion-loops-for-strides): New option. * opts.c (default_options_table): Enable fversion-loops-for-strides at -O3. * params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE) (PARAM_LOOP_VERSIONING_MAX_INNER_INSNS) (PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters. * passes.def: Add pass_loop_versioning. * timevar.def (TV_LOOP_VERSIONING): New time variable. * tree-ssa-propagate.h (substitute_and_fold_engine::substitute_and_fold): Add an optional block parameter. * tree-ssa-propagate.c (substitute_and_fold_engine::substitute_and_fold): Likewise. When passed, only walk blocks dominated by that block. * tree-vrp.h (range_includes_p): Declare. (range_includes_zero_p): Turn into an inline wrapper around range_includes_p. * tree-vrp.c (range_includes_p): New function, generalizing... (range_includes_zero_p): ...this. * tree-pass.h (make_pass_loop_versioning): Declare. * gimple-loop-versioning.cc: New file. gcc/testsuite/ * gcc.dg/loop-versioning-1.c: New test. * gcc.dg/loop-versioning-10.c: Likewise. * gcc.dg/loop-versioning-11.c: Likewise. * gcc.dg/loop-versioning-2.c: Likewise. * gcc.dg/loop-versioning-3.c: Likewise. * gcc.dg/loop-versioning-4.c: Likewise. * gcc.dg/loop-versioning-5.c: Likewise. * gcc.dg/loop-versioning-6.c: Likewise. * gcc.dg/loop-versioning-7.c: Likewise. * gcc.dg/loop-versioning-8.c: Likewise. * gcc.dg/loop-versioning-9.c: Likewise. * gfortran.dg/loop_versioning_1.f90: Likewise. * gfortran.dg/loop_versioning_2.f90: Likewise. * gfortran.dg/loop_versioning_3.f90: Likewise. * gfortran.dg/loop_versioning_4.f90: Likewise. * gfortran.dg/loop_versioning_5.f90: Likewise. * gfortran.dg/loop_versioning_6.f90: Likewise. * gfortran.dg/loop_versioning_7.f90: Likewise. * gfortran.dg/loop_versioning_8.f90: Likewise. From-SVN: r267197
2018-12-17 10:05:51 +00:00 · 2018-12-17 10:05:51 +00:00 · 13e08dc939
parent fb2974dcf5
commit 13e08dc939
39 changed files with 3204 additions and 12 deletions
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@ -1,3 +1,31 @@
+2018-12-17  Richard Sandiford  <richard.sandiford@arm.com>
+
+	* doc/invoke.texi (-fversion-loops-for-strides): Document
+	(loop-versioning-group-size, loop-versioning-max-inner-insns)
+	(loop-versioning-max-outer-insns): Document new --params.
+	* Makefile.in (OBJS): Add gimple-loop-versioning.o.
+	* common.opt (fversion-loops-for-strides): New option.
+	* opts.c (default_options_table): Enable fversion-loops-for-strides
+	at -O3.
+	* params.def (PARAM_LOOP_VERSIONING_GROUP_SIZE)
+	(PARAM_LOOP_VERSIONING_MAX_INNER_INSNS)
+	(PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS): New parameters.
+	* passes.def: Add pass_loop_versioning.
+	* timevar.def (TV_LOOP_VERSIONING): New time variable.
+	* tree-ssa-propagate.h
+	(substitute_and_fold_engine::substitute_and_fold): Add an optional
+	block parameter.
+	* tree-ssa-propagate.c
+	(substitute_and_fold_engine::substitute_and_fold): Likewise.
+	When passed, only walk blocks dominated by that block.
+	* tree-vrp.h (range_includes_p): Declare.
+	(range_includes_zero_p): Turn into an inline wrapper around
+	range_includes_p.
+	* tree-vrp.c (range_includes_p): New function, generalizing...
+	(range_includes_zero_p): ...this.
+	* tree-pass.h (make_pass_loop_versioning): Declare.
+	* gimple-loop-versioning.cc: New file.
+
 2018-12-15  Jan Hubicka  <hubicka@ucw.cz>

 	* ipa-fnsummary.c (remap_edge_change_prob): Do not ICE when changes
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@ -1320,6 +1320,7 @@ OBJS = \
 	gimple-laddress.o \
 	gimple-loop-interchange.o \
 	gimple-loop-jam.o \
+	gimple-loop-versioning.o \
 	gimple-low.o \
 	gimple-pretty-print.o \
 	gimple-ssa-backprop.o \
--- a/gcc/common.opt
+++ b/gcc/common.opt
@ -2775,6 +2775,10 @@ fsplit-loops
 Common Report Var(flag_split_loops) Optimization
 Perform loop splitting.

+fversion-loops-for-strides
+Common Report Var(flag_version_loops_for_strides) Optimization
+Version loops based on whether indices have a stride of one.
+
 funwind-tables
 Common Report Var(flag_unwind_tables) Optimization
 Just generate unwind tables for exception handling.
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@ -8220,7 +8220,8 @@ by @option{-O2} and also turns on the following optimization flags:
 -ftree-partial-pre @gol
 -ftree-slp-vectorize @gol
 -funswitch-loops @gol
-fvect-cost-model}
+-fvect-cost-model @gol
+-fversion-loops-for-strides}

@item -O0
@opindex O0
@ -10772,6 +10773,30 @@ of the loop on both branches (modified according to result of the condition).

 Enabled by @option{-fprofile-use} and @option{-fauto-profile}.

+@item -fversion-loops-for-strides
+@opindex fversion-loops-for-strides
+If a loop iterates over an array with a variable stride, create another
+version of the loop that assumes the stride is always one.  For example:
+
+@smallexample
+for (int i = 0; i < n; ++i)
+  x[i * stride] = @dots{};
+@end smallexample
+
+becomes:
+
+@smallexample
+if (stride == 1)
+  for (int i = 0; i < n; ++i)
+    x[i] = @dots{};
+else
+  for (int i = 0; i < n; ++i)
+    x[i * stride] = @dots{};
+@end smallexample
+
+This is particularly useful for assumed-shape arrays in Fortran where
+(for example) it allows better vectorization assuming contiguous accesses.
+
@item -ffunction-sections
@itemx -fdata-sections
@opindex ffunction-sections
@ -11981,6 +12006,15 @@ Hardware autoprefetcher scheduler model control flag.
 Number of lookahead cycles the model looks into; at '
 ' only enable instruction sorting heuristic.

+@item loop-versioning-max-inner-insns
+The maximum number of instructions that an inner loop can have
+before the loop versioning pass considers it too big to copy.
+
+@item loop-versioning-max-outer-insns
+The maximum number of instructions that an outer loop can have
+before the loop versioning pass considers it too big to copy,
+discounting any instructions in inner loops that directly benefit
+from versioning.

@end table
@end table
--- a/gcc/gimple-loop-versioning.cc
+++ b/gcc/gimple-loop-versioning.cc
--- a/gcc/opts.c
+++ b/gcc/opts.c
@ -556,6 +556,7 @@ static const struct default_options default_options_table[] =
    { OPT_LEVELS_3_PLUS, OPT_ftree_slp_vectorize, NULL, 1 },
    { OPT_LEVELS_3_PLUS, OPT_funswitch_loops, NULL, 1 },
    { OPT_LEVELS_3_PLUS, OPT_fvect_cost_model_, NULL, VECT_COST_MODEL_DYNAMIC },
+    { OPT_LEVELS_3_PLUS, OPT_fversion_loops_for_strides, NULL, 1 },

    /* -Ofast adds optimizations to -O3.  */
    { OPT_LEVELS_FAST, OPT_ffast_math, NULL, 1 },
--- a/gcc/params.def
+++ b/gcc/params.def
@ -1365,6 +1365,19 @@ DEFPARAM(PARAM_LOGICAL_OP_NON_SHORT_CIRCUIT,
 	 "True if a non-short-circuit operation is optimal.",
 	 -1, -1, 1)

+DEFPARAM(PARAM_LOOP_VERSIONING_MAX_INNER_INSNS,
+	 "loop-versioning-max-inner-insns",
+	 "The maximum number of instructions in an inner loop that is being"
+	 " considered for versioning.",
+	 200, 0, 0)
+
+DEFPARAM(PARAM_LOOP_VERSIONING_MAX_OUTER_INSNS,
+	 "loop-versioning-max-outer-insns",
+	 "The maximum number of instructions in an outer loop that is being"
+	 " considered for versioning, on top of the instructions in inner"
+	 " loops.",
+	 100, 0, 0)
+
 /*

 Local variables:
--- a/gcc/passes.def
+++ b/gcc/passes.def
@ -265,6 +265,7 @@ along with GCC; see the file COPYING3.  If not see
 	  NEXT_PASS (pass_tree_unswitch);
 	  NEXT_PASS (pass_scev_cprop);
 	  NEXT_PASS (pass_loop_split);
+	  NEXT_PASS (pass_loop_versioning);
 	  NEXT_PASS (pass_loop_jam);
 	  /* All unswitching, final value replacement and splitting can expose
 	     empty loops.  Remove them now.  */
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@ -1,3 +1,25 @@
+2018-12-17  Richard Sandiford  <richard.sandiford@arm.com>
+
+	* gcc.dg/loop-versioning-1.c: New test.
+	* gcc.dg/loop-versioning-10.c: Likewise.
+	* gcc.dg/loop-versioning-11.c: Likewise.
+	* gcc.dg/loop-versioning-2.c: Likewise.
+	* gcc.dg/loop-versioning-3.c: Likewise.
+	* gcc.dg/loop-versioning-4.c: Likewise.
+	* gcc.dg/loop-versioning-5.c: Likewise.
+	* gcc.dg/loop-versioning-6.c: Likewise.
+	* gcc.dg/loop-versioning-7.c: Likewise.
+	* gcc.dg/loop-versioning-8.c: Likewise.
+	* gcc.dg/loop-versioning-9.c: Likewise.
+	* gfortran.dg/loop_versioning_1.f90: Likewise.
+	* gfortran.dg/loop_versioning_2.f90: Likewise.
+	* gfortran.dg/loop_versioning_3.f90: Likewise.
+	* gfortran.dg/loop_versioning_4.f90: Likewise.
+	* gfortran.dg/loop_versioning_5.f90: Likewise.
+	* gfortran.dg/loop_versioning_6.f90: Likewise.
+	* gfortran.dg/loop_versioning_7.f90: Likewise.
+	* gfortran.dg/loop_versioning_8.f90: Likewise.
+
 2018-12-16  Steven G. Kargl  <kargl@gcc.gnu.org>

 	PR fortran/88116
--- a/gcc/testsuite/gcc.dg/loop-versioning-1.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-1.c
@ -0,0 +1,92 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* The simplest IV case.  */
+
+void
+f1 (double *x, int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    x[stepx * i] = 100;
+}
+
+void
+f2 (double *x, int stepx, int limit)
+{
+  for (int i = 0; i < limit; i += stepx)
+    x[i] = 100;
+}
+
+void
+f3 (double *x, int stepx, int limit)
+{
+  for (double *y = x; y < x + limit; y += stepx)
+    *y = 100;
+}
+
+void
+f4 (double *x, int stepx, unsigned int n)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    x[stepx * i] = 100;
+}
+
+void
+f5 (double *x, int stepx, unsigned int limit)
+{
+  for (unsigned int i = 0; i < limit; i += stepx)
+    x[i] = 100;
+}
+
+void
+f6 (double *x, int stepx, unsigned int limit)
+{
+  for (double *y = x; y < x + limit; y += stepx)
+    *y = 100;
+}
+
+double x[10000];
+
+void
+g1 (int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    x[stepx * i] = 100;
+}
+
+void
+g2 (int stepx, int limit)
+{
+  for (int i = 0; i < limit; i += stepx)
+    x[i] = 100;
+}
+
+void
+g3 (int stepx, int limit)
+{
+  for (double *y = x; y < x + limit; y += stepx)
+    *y = 100;
+}
+
+void
+g4 (int stepx, unsigned int n)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    x[stepx * i] = 100;
+}
+
+void
+g5 (int stepx, unsigned int limit)
+{
+  for (unsigned int i = 0; i < limit; i += stepx)
+    x[i] = 100;
+}
+
+void
+g6 (int stepx, unsigned int limit)
+{
+  for (double *y = x; y < x + limit; y += stepx)
+    *y = 100;
+}
+
+/* { dg-final { scan-tree-dump-times {want to version containing loop} 12 "lversion" } } */
+/* { dg-final { scan-tree-dump-times {versioned this loop} 12 "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-10.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-10.c
@ -0,0 +1,52 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* Test that we can version a gather-like operation in which a variable
+   stride is applied to the index.  */
+
+int
+f1 (int *x, int *index, int step, int n)
+{
+  int res = 0;
+  for (int i = 0; i < n; ++i)
+    res += x[index[i] * step];
+  return res;
+}
+
+int
+f2 (int *x, int *index, int step, int n)
+{
+  int res = 0;
+  for (int i = 0; i < n; ++i)
+    {
+      int *ptr = x + index[i] * step;
+      res += *ptr;
+    }
+  return res;
+}
+
+int x[1000];
+
+int
+g1 (int *index, int step, int n)
+{
+  int res = 0;
+  for (int i = 0; i < n; ++i)
+    res += x[index[i] * step];
+  return res;
+}
+
+int
+g2 (int *index, int step, int n)
+{
+  int res = 0;
+  for (int i = 0; i < n; ++i)
+    {
+      int *ptr = x + index[i] * step;
+      res += *ptr;
+    }
+  return res;
+}
+
+/* { dg-final { scan-tree-dump-times {address term [^\n]* \* loop-invariant} 4 "lversion" } } */
+/* { dg-final { scan-tree-dump-times {want to version containing loop} 4 "lversion" } } */
+/* { dg-final { scan-tree-dump-times {versioned this loop} 4 "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-11.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-11.c
@ -0,0 +1,29 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* Test that we don't try to version for something that is never 1.  */
+
+void
+f1 (double *x, int stepx, int n)
+{
+  if (stepx == 1)
+    for (int i = 0; i < n; ++i)
+      x[i] = 100;
+  else
+    for (int i = 0; i < n; ++i)
+      x[stepx * i] = 100;
+}
+
+void
+f2 (double *x, int stepx, int n)
+{
+  if (stepx <= 1)
+    for (int i = 0; i < n; ++i)
+      x[i] = 100;
+  else
+    for (int i = 0; i < n; ++i)
+      x[stepx * i] = 100;
+}
+
+/* { dg-final { scan-tree-dump-times {want to version containing loop} 2 "lversion" } } */
+/* { dg-final { scan-tree-dump-times {can never be 1} 2 "lversion" } } */
+/* { dg-final { scan-tree-dump-not {versioned} "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-12.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-12.c
@ -0,0 +1,149 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* Test that we don't try to version for a step of 1 when that would
+   cause the iterations to overlap.  */
+
+void
+f1 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx] = 100;
+      x[i * stepx + 1] = 99;
+    }
+}
+
+void
+f2 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx)
+    {
+      x[i] = 100;
+      x[i + 1] = 99;
+    }
+}
+
+void
+f3 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx - 16] = 100;
+      x[i * stepx - 15] = 99;
+    }
+}
+
+void
+f4 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx)
+    {
+      x[i - 16] = 100;
+      x[i - 15] = 99;
+    }
+}
+
+void
+f5 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx - 16] = 100;
+      x[i * stepx + 15] = 99;
+    }
+}
+
+void
+f6 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx)
+    {
+      x[i - 16] = 100;
+      x[i + 15] = 99;
+    }
+}
+
+void
+f7 (unsigned short *x, int stepx, int n)
+{
+  for (unsigned short *y = x; y < x + n; y += stepx)
+    {
+      y[0] = 100;
+      y[1] = 99;
+    }
+}
+
+unsigned short x[1000];
+
+void
+g1 (int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx] = 100;
+      x[i * stepx + 1] = 99;
+    }
+}
+
+void
+g2 (int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx)
+    {
+      x[i] = 100;
+      x[i + 1] = 99;
+    }
+}
+
+void
+g3 (int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx - 16] = 100;
+      x[i * stepx - 15] = 99;
+    }
+}
+
+void
+g4 (int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx)
+    {
+      x[i - 16] = 100;
+      x[i - 15] = 99;
+    }
+}
+
+void
+g5 (int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx - 16] = 100;
+      x[i * stepx + 15] = 99;
+    }
+}
+
+void
+g6 (int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx)
+    {
+      x[i - 16] = 100;
+      x[i + 15] = 99;
+    }
+}
+
+void
+g7 (int stepx, int n)
+{
+  for (unsigned short *y = x; y < x + n; y += stepx)
+    {
+      y[0] = 100;
+      y[1] = 99;
+    }
+}
+
+/* { dg-final { scan-tree-dump-not {want to version} "lversion" } } */
+/* { dg-final { scan-tree-dump-not {versioned} "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-13.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-13.c
@ -0,0 +1,109 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* Test that we do version for a step of 1 when that would lead the
+   iterations to access consecutive groups.  */
+
+void
+f1 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx * 2] = 100;
+      x[i * stepx * 2 + 1] = 99;
+    }
+}
+
+void
+f2 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx * 2)
+    {
+      x[i] = 100;
+      x[i + 1] = 99;
+    }
+}
+
+void
+f3 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx * 2 - 16] = 100;
+      x[i * stepx * 2 - 15] = 99;
+    }
+}
+
+void
+f4 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx * 2)
+    {
+      x[i - 16] = 100;
+      x[i - 15] = 99;
+    }
+}
+
+void
+f5 (unsigned short *x, int stepx, int n)
+{
+  for (unsigned short *y = x; y < x + n; y += stepx * 2)
+    {
+      y[0] = 100;
+      y[1] = 99;
+    }
+}
+
+unsigned short x[1000];
+
+void
+g1 (int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx * 2] = 100;
+      x[i * stepx * 2 + 1] = 99;
+    }
+}
+
+void
+g2 (int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx * 2)
+    {
+      x[i] = 100;
+      x[i + 1] = 99;
+    }
+}
+
+void
+g3 (int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx * 2 - 16] = 100;
+      x[i * stepx * 2 - 15] = 99;
+    }
+}
+
+void
+g4 (int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx * 2)
+    {
+      x[i - 16] = 100;
+      x[i - 15] = 99;
+    }
+}
+
+void
+g5 (int stepx, int n)
+{
+  for (unsigned short *y = x; y < x + n; y += stepx * 2)
+    {
+      y[0] = 100;
+      y[1] = 99;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times {want to version containing loop} 10 "lversion" } } */
+/* { dg-final { scan-tree-dump-times {versioned this loop} 10 "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-14.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-14.c
@ -0,0 +1,149 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* Test that we don't try to version for a step of 1 when that would
+   cause the iterations to leave a gap between accesses.  */
+
+void
+f1 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx * 4] = 100;
+      x[i * stepx * 4 + 1] = 99;
+    }
+}
+
+void
+f2 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx * 4)
+    {
+      x[i] = 100;
+      x[i + 1] = 99;
+    }
+}
+
+void
+f3 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx * 4 - 16] = 100;
+      x[i * stepx * 4 - 15] = 99;
+    }
+}
+
+void
+f4 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx * 4)
+    {
+      x[i - 16] = 100;
+      x[i - 15] = 99;
+    }
+}
+
+void
+f5 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx * 64 - 16] = 100;
+      x[i * stepx * 64 + 15] = 99;
+    }
+}
+
+void
+f6 (unsigned short *x, int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx * 64)
+    {
+      x[i - 16] = 100;
+      x[i + 15] = 99;
+    }
+}
+
+void
+f7 (unsigned short *x, int stepx, int n)
+{
+  for (unsigned short *y = x; y < x + n; y += stepx * 4)
+    {
+      y[0] = 100;
+      y[1] = 99;
+    }
+}
+
+unsigned short x[1000];
+
+void
+g1 (int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx * 4] = 100;
+      x[i * stepx * 4 + 1] = 99;
+    }
+}
+
+void
+g2 (int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx * 4)
+    {
+      x[i] = 100;
+      x[i + 1] = 99;
+    }
+}
+
+void
+g3 (int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx * 4 - 16] = 100;
+      x[i * stepx * 4 - 15] = 99;
+    }
+}
+
+void
+g4 (int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx * 4)
+    {
+      x[i - 16] = 100;
+      x[i - 15] = 99;
+    }
+}
+
+void
+g5 (int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[i * stepx * 64 - 16] = 100;
+      x[i * stepx * 64 + 15] = 99;
+    }
+}
+
+void
+g6 (int stepx, int n)
+{
+  for (int i = 0; i < n; i += stepx * 64)
+    {
+      x[i - 16] = 100;
+      x[i + 15] = 99;
+    }
+}
+
+void
+g7 (int stepx, int n)
+{
+  for (unsigned short *y = x; y < x + n; y += stepx * 4)
+    {
+      y[0] = 100;
+      y[1] = 99;
+    }
+}
+
+/* { dg-final { scan-tree-dump-not {want to version} "lversion" } } */
+/* { dg-final { scan-tree-dump-not {versioned} "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-2.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-2.c
@ -0,0 +1,73 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* Versioning for step == 1 in these loops would allow loop interchange,
+   but otherwise isn't worthwhile.  At the moment we decide not to version.  */
+
+void
+f1 (double x[][100], int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[j * step][i] = 100;
+}
+
+void
+f2 (double x[][100], int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[j][i * step] = 100;
+}
+
+void
+f3 (double x[][100], int step, int limit)
+{
+  for (int i = 0; i < 100; ++i)
+    for (int j = 0; j < limit; j += step)
+      x[j][i] = 100;
+}
+
+void
+f4 (double x[][100], int step, int limit)
+{
+  for (int i = 0; i < limit; i += step)
+    for (int j = 0; j < 100; ++j)
+      x[j][i] = 100;
+}
+
+double x[100][100];
+
+void
+g1 (int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[j * step][i] = 100;
+}
+
+void
+g2 (int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[j][i * step] = 100;
+}
+
+void
+g3 (int step, int limit)
+{
+  for (int i = 0; i < 100; ++i)
+    for (int j = 0; j < limit; j += step)
+      x[j][i] = 100;
+}
+
+void
+g4 (int step, int limit)
+{
+  for (int i = 0; i < limit; i += step)
+    for (int j = 0; j < 100; ++j)
+      x[j][i] = 100;
+}
+
+/* { dg-final { scan-tree-dump-not {want to version} "lversion" } } */
+/* { dg-final { scan-tree-dump-not {versioned} "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-3.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-3.c
@ -0,0 +1,24 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* Versioning these loops for when both steps are 1 allows loop
+   interchange, but otherwise isn't worthwhile.  At the moment we decide
+   not to version.  */
+
+void
+f1 (double x[][100], int step1, int step2, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[j * step1][i * step2] = 100;
+}
+
+void
+f2 (double x[][100], int step1, int step2, int limit)
+{
+  for (int i = 0; i < limit; i += step1)
+    for (int j = 0; j < limit; j += step2)
+      x[j][i] = 100;
+}
+
+/* { dg-final { scan-tree-dump-not {want to version} "lversion" } } */
+/* { dg-final { scan-tree-dump-not {versioned} "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-4.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-4.c
@ -0,0 +1,39 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* These shouldn't be versioned; it's extremely likely that the code
+   is emulating two-dimensional arrays.  */
+
+void
+f1 (double *x, int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[i * step + j] = 100;
+}
+
+void
+f2 (double *x, int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[j * step + i] = 100;
+}
+
+void
+f3 (double *x, int *offsets, int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[i * step + j + offsets[i]] = 100;
+}
+
+void
+f4 (double *x, int *offsets, int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[j * step + i + offsets[i]] = 100;
+}
+
+/* { dg-final { scan-tree-dump-not {want to version} "lversion" } } */
+/* { dg-final { scan-tree-dump-not {versioned} "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-5.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-5.c
@ -0,0 +1,17 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* There's no information about whether STEP1 or STEP2 is innermost,
+   so we should assume the code is sensible and version for the inner
+   evolution, i.e. when STEP2 is 1.  */
+
+void
+f1 (double *x, int step1, int step2, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[i * step1 + j * step2] = 100;
+}
+
+/* { dg-final { scan-tree-dump-times {want to version containing loop for when step2} 1 "lversion" } } */
+/* { dg-final { scan-tree-dump-times {want to version containing loop} 1 "lversion" } } */
+/* { dg-final { scan-tree-dump-times {versioned this loop} 1 "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-6.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-6.c
@ -0,0 +1,31 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* The read from y in f1 will be hoisted to the outer loop.  In general
+   it's not worth versioning outer loops when the inner loops don't also
+   benefit.
+
+   This test is meant to be a slight counterexample, since versioning
+   does lead to cheaper outer-loop vectorization.  However, the benefit
+   isn't enough to justify the cost.  */
+
+void
+f1 (double *restrict x, double *restrict y, int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[i + j] = y[i * step];
+}
+
+/* A similar example in which the read can't be hoisted, but could
+   for example be handled by vectorizer alias checks.  */
+
+void
+f2 (double *x, double *y, int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[i + j] = y[i * step];
+}
+
+/* { dg-final { scan-tree-dump-not {want to version} "lversion" } } */
+/* { dg-final { scan-tree-dump-not {versioned} "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-7.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-7.c
@ -0,0 +1,32 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* Check that versioning can handle arrays of structures.  */
+
+struct foo {
+  int a, b, c;
+};
+
+void
+f1 (struct foo *x, int stepx, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[stepx * i].a = 1;
+      x[stepx * i].b = 2;
+      x[stepx * i].c = 3;
+    }
+}
+
+void
+f2 (struct foo *x, int stepx, int limit)
+{
+  for (int i = 0; i < limit; i += stepx)
+    {
+      x[i].a = 1;
+      x[i].b = 2;
+      x[i].c = 3;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times {want to version containing loop} 2 "lversion" } } */
+/* { dg-final { scan-tree-dump-times {versioned this loop} 2 "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-8.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-8.c
@ -0,0 +1,43 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* Versioning for step == 1 in these loops would allow loop interchange,
+   but otherwise isn't worthwhile.  At the moment we decide not to version.  */
+
+struct foo {
+  int a[100];
+};
+
+void
+f1 (struct foo *x, int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[j * step].a[i] = 100;
+}
+
+void
+f2 (struct foo *x, int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < n; ++j)
+      x[j].a[i * step] = 100;
+}
+
+void
+f3 (struct foo *x, int step, int limit)
+{
+  for (int i = 0; i < 100; ++i)
+    for (int j = 0; j < limit; j += step)
+      x[j].a[i] = 100;
+}
+
+void
+f4 (struct foo *x, int step, int limit)
+{
+  for (int i = 0; i < limit; i += step)
+    for (int j = 0; j < 100; ++j)
+      x[j].a[i] = 100;
+}
+
+/* { dg-final { scan-tree-dump-not {want to version} "lversion" } } */
+/* { dg-final { scan-tree-dump-not {versioned} "lversion" } } */
--- a/gcc/testsuite/gcc.dg/loop-versioning-9.c
+++ b/gcc/testsuite/gcc.dg/loop-versioning-9.c
@ -0,0 +1,48 @@
+/* { dg-options "-O3 -fdump-tree-lversion-details" } */
+
+/* Check that versioning can handle small groups of accesses.  */
+
+void
+f1 (int *x, int *y, int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    x[i] = y[i * step * 2] + y[i * step * 2 + 1];
+}
+
+void
+f2 (int *x, int *y, __INTPTR_TYPE__ step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    x[i] = y[i * step * 2] + y[i * step * 2 + 1];
+}
+
+void
+f3 (int *x, int *y, int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    x[i] = y[i * step * 3] + y[i * step * 3 + 2];
+}
+
+void
+f4 (int *x, int *y, __INTPTR_TYPE__ step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    x[i] = y[i * step * 3] + y[i * step * 3 + 2];
+}
+
+void
+f5 (int *x, int *y, int step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    x[i] = y[i * step * 4] + y[i * step * 4 + 3];
+}
+
+void
+f6 (int *x, int *y, __INTPTR_TYPE__ step, int n)
+{
+  for (int i = 0; i < n; ++i)
+    x[i] = y[i * step * 4] + y[i * step * 4 + 3];
+}
+
+/* { dg-final { scan-tree-dump-times {want to version containing loop} 6 "lversion" } } */
+/* { dg-final { scan-tree-dump-times {versioned this loop} 6 "lversion" } } */
--- a/gcc/testsuite/gcc.dg/vect/slp-43.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-43.c
@ -1,5 +1,5 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-additional-options "-O3" } */
+/* { dg-additional-options "-O3 -fno-version-loops-for-strides" } */

 #include <string.h>
 #include "tree-vect.h"
--- a/gcc/testsuite/gcc.dg/vect/slp-45.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-45.c
@ -1,5 +1,5 @@
 /* { dg-require-effective-target vect_int } */
-/* { dg-additional-options "-O3" } */
+/* { dg-additional-options "-O3 -fno-version-loops-for-strides" } */

 #include <string.h>
 #include "tree-vect.h"
--- a/gcc/testsuite/gfortran.dg/loop_versioning_1.f90
+++ b/gcc/testsuite/gfortran.dg/loop_versioning_1.f90
@ -0,0 +1,28 @@
+! { dg-options "-O3 -fdump-tree-lversion-details" }
+
+! The simplest IV case.
+
+subroutine f1(x)
+  real :: x(:)
+  x(:) = 100
+end subroutine f1
+
+subroutine f2(x, n, step)
+  integer :: n, step
+  real :: x(n * step)
+  do i = 1, n
+     x(i * step) = 100
+  end do
+end subroutine f2
+
+subroutine f3(x, limit, step)
+  integer :: limit, step
+  real :: x(limit)
+  do i = 1, limit, step
+     x(i) = 100
+  end do
+end subroutine f3
+
+! { dg-final { scan-tree-dump-times {likely to be the innermost dimension} 1 "lversion" } }
+! { dg-final { scan-tree-dump-times {want to version containing loop} 3 "lversion" } }
+! { dg-final { scan-tree-dump-times {versioned this loop} 3 "lversion" } }
--- a/gcc/testsuite/gfortran.dg/loop_versioning_2.f90
+++ b/gcc/testsuite/gfortran.dg/loop_versioning_2.f90
@ -0,0 +1,39 @@
+! { dg-options "-O3 -fdump-tree-lversion-details -fno-frontend-loop-interchange" }
+
+! We could version the loop for when the first dimension has a stride
+! of 1, but at present there's no real benefit.  The gimple loop
+! interchange pass couldn't handle the versioned loop, and interchange
+! is instead done by the frontend (but disabled by the options above).
+
+subroutine f1(x)
+  real :: x(:, :)
+  do i = lbound(x, 1), ubound(x, 1)
+     do j = lbound(x, 2), ubound(x, 2)
+        x(i, j) = 100
+     end do
+  end do
+end subroutine f1
+
+subroutine f2(x, n, step)
+  integer :: n, step
+  real :: x(100, 100)
+  do i = 1, n
+     do j = 1, n
+        x(i * step, j) = 100
+     end do
+  end do
+end subroutine f2
+
+subroutine f3(x, n, step)
+  integer :: n, step
+  real :: x(n * step, n)
+  do i = 1, n
+     do j = 1, n
+        x(i * step, j) = 100
+     end do
+  end do
+end subroutine f3
+
+! { dg-final { scan-tree-dump-times {likely to be the innermost dimension} 1 "lversion" } }
+! { dg-final { scan-tree-dump-not {want to version} "lversion" } }
+! { dg-final { scan-tree-dump-not {versioned} "lversion" } }
--- a/gcc/testsuite/gfortran.dg/loop_versioning_3.f90
+++ b/gcc/testsuite/gfortran.dg/loop_versioning_3.f90
@ -0,0 +1,30 @@
+! { dg-options "-O3 -fdump-tree-lversion-details -fno-frontend-loop-interchange" }
+
+! Test a case in which the outer loop iterates over the inner dimension.
+! The options above prevent the frontend from interchanging the loops.
+
+subroutine f1(x, limit, step, n)
+  integer :: limit, step, n
+  real :: x(limit, n)
+  do i = 1, limit, step
+     do j = 1, n
+        x(i, j) = 100
+     end do
+  end do
+end subroutine f1
+
+subroutine f2(x, n, limit, step)
+  integer :: n, limit, step
+  real :: x(limit, n)
+  do i = 1, n
+     do j = 1, limit, step
+        x(j, i) = 100
+     end do
+  end do
+end subroutine f2
+
+! FIXME: The frontend doesn't give us enough information to tell which loop
+! is iterating over the innermost dimension, so we optimistically
+! assume the inner one is.
+! { dg-final { scan-tree-dump-not {want to version} "lversion" { xfail *-*-* } } }
+! { dg-final { scan-tree-dump-not {versioned} "lversion" { xfail *-*-* } } }
--- a/gcc/testsuite/gfortran.dg/loop_versioning_4.f90
+++ b/gcc/testsuite/gfortran.dg/loop_versioning_4.f90
@ -0,0 +1,95 @@
+! { dg-options "-O3 -fdump-tree-lversion-details -fno-frontend-loop-interchange" }
+
+! Test cases in which versioning is useful for a two-dimensional array.
+
+subroutine f1(x)
+  real :: x(:, :)
+  x(:, :) = 100
+end subroutine f1
+
+subroutine f2(x)
+  real :: x(:, :)
+  do i = lbound(x, 1), ubound(x, 1)
+     do j = lbound(x, 2), ubound(x, 2)
+        x(j, i) = 100
+     end do
+  end do
+end subroutine f2
+
+subroutine f3(x, n, step)
+  integer :: n, step
+  real :: x(100, 100)
+  do i = 1, n
+     do j = 1, n
+        x(j * step, i) = 100
+     end do
+  end do
+end subroutine f3
+
+subroutine f4(x, n, step)
+  integer :: n, step
+  real :: x(n * step, n)
+  do i = 1, n
+     do j = 1, n
+        x(j * step, i) = 100
+     end do
+  end do
+end subroutine f4
+
+subroutine f5(x, n, limit, step)
+  integer :: n, limit, step
+  real :: x(limit, n)
+  do i = 1, n
+     do j = 1, limit, step
+        x(j, i) = 100
+     end do
+  end do
+end subroutine f5
+
+subroutine f6(x, y)
+  real :: x(:, :), y(:)
+  do i = lbound(x, 1), ubound(x, 1)
+     do j = lbound(x, 2), ubound(x, 2)
+        x(j, i) = 100
+     end do
+     y(i) = 200
+  end do
+end subroutine f6
+
+subroutine f7(x, y, n, step)
+  integer :: n, step
+  real :: x(100, 100), y(100)
+  do i = 1, n
+     do j = 1, n
+        x(j * step, i) = 100
+     end do
+     y(i * step) = 200
+  end do
+end subroutine f7
+
+subroutine f8(x, y, n, step)
+  integer :: n, step
+  real :: x(n * step, n), y(n * step)
+  do i = 1, n
+     do j = 1, n
+        x(j * step, i) = 100
+     end do
+     y(i * step) = 200
+  end do
+end subroutine f8
+
+subroutine f9(x, n, limit, step)
+  integer :: n, limit, step
+  real :: x(limit, n), y(limit)
+  do i = 1, n
+     do j = 1, limit, step
+        x(j, i) = 100
+     end do
+     y(i) = 200
+  end do
+end subroutine f9
+
+! { dg-final { scan-tree-dump-times {likely to be the innermost dimension} 3 "lversion" } }
+! { dg-final { scan-tree-dump-times {want to version containing loop} 9 "lversion" } }
+! { dg-final { scan-tree-dump-times {hoisting check} 9 "lversion" } }
+! { dg-final { scan-tree-dump-times {versioned this loop} 9 "lversion" } }
--- a/gcc/testsuite/gfortran.dg/loop_versioning_5.f90
+++ b/gcc/testsuite/gfortran.dg/loop_versioning_5.f90
@ -0,0 +1,57 @@
+! { dg-options "-O3 -fdump-tree-lversion-details -fno-frontend-loop-interchange" }
+
+! Make sure that in a "badly nested" loop, we don't treat the inner loop
+! as iterating over the inner dimension with a variable stride.
+
+subroutine f1(x, n)
+  integer :: n
+  real :: x(100, 100)
+  do i = 1, n
+     do j = 1, n
+        x(i, j) = 100
+     end do
+  end do
+end subroutine f1
+
+subroutine f2(x, n, step)
+  integer :: n, step
+  real :: x(100, 100)
+  do i = 1, n
+     do j = 1, n
+        x(i, j * step) = 100
+     end do
+  end do
+end subroutine f2
+
+subroutine f3(x, n)
+  integer :: n
+  real :: x(n, n)
+  do i = 1, n
+     do j = 1, n
+        x(i, j) = 100
+     end do
+  end do
+end subroutine f3
+
+subroutine f4(x, n, step)
+  integer :: n, step
+  real :: x(n, n * step)
+  do i = 1, n
+     do j = 1, n
+        x(i, j * step) = 100
+     end do
+  end do
+end subroutine f4
+
+subroutine f5(x, n, limit, step)
+  integer :: n, limit, step
+  real :: x(n, limit)
+  do i = 1, n
+     do j = 1, limit, step
+        x(i, j) = 100
+     end do
+  end do
+end subroutine f5
+
+! { dg-final { scan-tree-dump-not {want to version} "lversion" } }
+! { dg-final { scan-tree-dump-not {versioned} "lversion" } }
--- a/gcc/testsuite/gfortran.dg/loop_versioning_6.f90
+++ b/gcc/testsuite/gfortran.dg/loop_versioning_6.f90
@ -0,0 +1,93 @@
+! { dg-options "-O3 -fdump-tree-lversion-details" }
+
+! Check that versioning can handle small groups of accesses.
+
+subroutine f1(x)
+  real :: x(:)
+  do i = lbound(x, 1), ubound(x, 1) / 2
+     x(i * 2) = 100
+     x(i * 2 + 1) = 101
+  end do
+end subroutine f1
+
+subroutine f2(x, n, step)
+  integer :: n, step
+  real :: x(n * step * 2)
+  do i = 1, n
+     x(i * step * 2) = 100
+     x(i * step * 2 + 1) = 101
+  end do
+end subroutine f2
+
+subroutine f3(x, limit, step)
+  integer :: limit, step
+  real :: x(limit * 2)
+  do i = 1, limit, step
+     x(i * 2) = 100
+     x(i * 2 + 1) = 101
+  end do
+end subroutine f3
+
+subroutine f4(x)
+  real :: x(:)
+  do i = lbound(x, 1), ubound(x, 1) / 3
+     x(i * 3) = 100
+     x(i * 3 + 1) = 101
+     x(i * 3 + 2) = 102
+  end do
+end subroutine f4
+
+subroutine f5(x, n, step)
+  integer :: n, step
+  real :: x(n * step * 3)
+  do i = 1, n
+     x(i * step * 3) = 100
+     x(i * step * 3 + 1) = 101
+     x(i * step * 3 + 2) = 102
+  end do
+end subroutine f5
+
+subroutine f6(x, limit, step)
+  integer :: limit, step
+  real :: x(limit * 3)
+  do i = 1, limit, step
+     x(i * 3) = 100
+     x(i * 3 + 1) = 101
+     x(i * 3 + 2) = 102
+  end do
+end subroutine f6
+
+subroutine f7(x)
+  real :: x(:)
+  do i = lbound(x, 1), ubound(x, 1) / 4
+     x(i * 4) = 100
+     x(i * 4 + 1) = 101
+     x(i * 4 + 2) = 102
+     x(i * 4 + 3) = 103
+  end do
+end subroutine f7
+
+subroutine f8(x, n, step)
+  integer :: n, step
+  real :: x(n * step * 4)
+  do i = 1, n
+     x(i * step * 4) = 100
+     x(i * step * 4 + 1) = 101
+     x(i * step * 4 + 2) = 102
+     x(i * step * 4 + 3) = 103
+  end do
+end subroutine f8
+
+subroutine f9(x, limit, step)
+  integer :: limit, step
+  real :: x(limit * 4)
+  do i = 1, limit, step
+     x(i * 4) = 100
+     x(i * 4 + 1) = 101
+     x(i * 4 + 2) = 102
+     x(i * 4 + 3) = 103
+  end do
+end subroutine f9
+
+! { dg-final { scan-tree-dump-times {want to version containing loop} 9 "lversion" } }
+! { dg-final { scan-tree-dump-times {versioned this loop} 9 "lversion" } }
--- a/gcc/testsuite/gfortran.dg/loop_versioning_7.f90
+++ b/gcc/testsuite/gfortran.dg/loop_versioning_7.f90
@ -0,0 +1,67 @@
+! { dg-options "-O3 -fdump-tree-lversion-details" }
+
+! Check that versioning can handle small groups of accesses, with the
+! group being a separate array dimension.
+
+subroutine f1(x, n, step)
+  integer :: n, step
+  real :: x(2, n * step)
+  do i = 1, n
+     x(1, i * step) = 100
+     x(2, i * step) = 101
+  end do
+end subroutine f1
+
+subroutine f2(x, limit, step)
+  integer :: limit, step
+  real :: x(2, limit)
+  do i = 1, limit, step
+     x(1, i) = 100
+     x(2, i) = 101
+  end do
+end subroutine f2
+
+subroutine f3(x, n, step)
+  integer :: n, step
+  real :: x(3, n * step)
+  do i = 1, n
+     x(1, i * step) = 100
+     x(2, i * step) = 101
+     x(3, i * step) = 102
+  end do
+end subroutine f3
+
+subroutine f4(x, limit, step)
+  integer :: limit, step
+  real :: x(3, limit)
+  do i = 1, limit, step
+     x(1, i) = 100
+     x(2, i) = 101
+     x(3, i) = 102
+  end do
+end subroutine f4
+
+subroutine f5(x, n, step)
+  integer :: n, step
+  real :: x(4, n * step)
+  do i = 1, n
+     x(1, i * step) = 100
+     x(2, i * step) = 101
+     x(3, i * step) = 102
+     x(4, i * step) = 103
+  end do
+end subroutine f5
+
+subroutine f6(x, limit, step)
+  integer :: limit, step
+  real :: x(4, limit)
+  do i = 1, limit, step
+     x(1, i) = 100
+     x(2, i) = 101
+     x(3, i) = 102
+     x(4, i) = 103
+  end do
+end subroutine f6
+
+! { dg-final { scan-tree-dump-times {want to version containing loop} 6 "lversion" } }
+! { dg-final { scan-tree-dump-times {versioned this loop} 6 "lversion" } }
--- a/gcc/testsuite/gfortran.dg/loop_versioning_8.f90
+++ b/gcc/testsuite/gfortran.dg/loop_versioning_8.f90
@ -0,0 +1,13 @@
+! { dg-options "-O3 -fdump-tree-lversion-details" }
+
+! Check that versioning is applied to a gather-like reduction operation.
+
+function f(x, index, n)
+  integer :: n
+  real :: x(:)
+  integer :: index(n)
+  f = sum(x(index(:)))
+end function f
+
+! { dg-final { scan-tree-dump-times {want to version containing loop} 1 "lversion" } }
+! { dg-final { scan-tree-dump-times {versioned this loop} 1 "lversion" } }
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@ -234,6 +234,7 @@ DEFTIMEVAR (TV_DSE1                  , "dead store elim1")
 DEFTIMEVAR (TV_DSE2                  , "dead store elim2")
 DEFTIMEVAR (TV_LOOP                  , "loop analysis")
 DEFTIMEVAR (TV_LOOP_INIT	     , "loop init")
+DEFTIMEVAR (TV_LOOP_VERSIONING	     , "loop versioning")
 DEFTIMEVAR (TV_LOOP_MOVE_INVARIANTS  , "loop invariant motion")
 DEFTIMEVAR (TV_LOOP_UNROLL           , "loop unrolling")
 DEFTIMEVAR (TV_LOOP_DOLOOP           , "loop doloop")
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@ -362,6 +362,7 @@ extern gimple_opt_pass *make_pass_fix_loops (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_no_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_init (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_loop_versioning (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_lim (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_linterchange (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_unswitch (gcc::context *ctxt);
--- a/gcc/tree-ssa-propagate.c
+++ b/gcc/tree-ssa-propagate.c
@ -1154,6 +1154,10 @@ substitute_and_fold_dom_walker::before_dom_children (basic_block bb)


 /* Perform final substitution and folding of propagated values.
+   Process the whole function if BLOCK is null, otherwise only
+   process the blocks that BLOCK dominates.  In the latter case,
+   it is the caller's responsibility to ensure that dominator
+   information is available and up-to-date.

   PROP_VALUE[I] contains the single value that should be substituted
   at every use of SSA name N_I.  If PROP_VALUE is NULL, no values are
@ -1170,16 +1174,24 @@ substitute_and_fold_dom_walker::before_dom_children (basic_block bb)
   Return TRUE when something changed.  */

 bool
-substitute_and_fold_engine::substitute_and_fold (void)
+substitute_and_fold_engine::substitute_and_fold (basic_block block)
 {
  if (dump_file && (dump_flags & TDF_DETAILS))
    fprintf (dump_file, "\nSubstituting values and folding statements\n\n");

  memset (&prop_stats, 0, sizeof (prop_stats));

-  calculate_dominance_info (CDI_DOMINATORS);
+  /* Don't call calculate_dominance_info when iterating over a subgraph.
+     Callers that are using the interface this way are likely to want to
+     iterate over several disjoint subgraphs, and it would be expensive
+     in enable-checking builds to revalidate the whole dominance tree
+     each time.  */
+  if (block)
+    gcc_assert (dom_info_state (CDI_DOMINATORS));
+  else
+    calculate_dominance_info (CDI_DOMINATORS);
  substitute_and_fold_dom_walker walker (CDI_DOMINATORS, this);
-  walker.walk (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+  walker.walk (block ? block : ENTRY_BLOCK_PTR_FOR_FN (cfun));

  /* We cannot remove stmts during the BB walk, especially not release
     SSA names there as that destroys the lattice of our callers.
--- a/gcc/tree-ssa-propagate.h
+++ b/gcc/tree-ssa-propagate.h
@ -104,7 +104,7 @@ class substitute_and_fold_engine
  virtual bool fold_stmt (gimple_stmt_iterator *) { return false; }
  virtual tree get_value (tree) { return NULL_TREE; }

-  bool substitute_and_fold (void);
+  bool substitute_and_fold (basic_block = NULL);
  bool replace_uses_in (gimple *);
  bool replace_phi_args_in (gphi *);
 };
--- a/gcc/tree-vrp.c
+++ b/gcc/tree-vrp.c
@ -1173,15 +1173,14 @@ value_inside_range (tree val, tree min, tree max)
 }


-/* Return TRUE if *VR includes the value zero.  */
+/* Return TRUE if *VR includes the value X.  */

 bool
-range_includes_zero_p (const value_range_base *vr)
+range_includes_p (const value_range_base *vr, HOST_WIDE_INT x)
 {
  if (vr->varying_p () || vr->undefined_p ())
    return true;
-  tree zero = build_int_cst (vr->type (), 0);
-  return vr->may_contain_p (zero);
+  return vr->may_contain_p (build_int_cst (vr->type (), x));
 }

 /* If *VR has a value range that is a single constant value return that,
--- a/gcc/tree-vrp.h
+++ b/gcc/tree-vrp.h
@ -243,7 +243,7 @@ struct assert_info
 extern void register_edge_assert_for (tree, edge, enum tree_code,
 				      tree, tree, vec<assert_info> &);
 extern bool stmt_interesting_for_vrp (gimple *);
-extern bool range_includes_zero_p (const value_range_base *);
+extern bool range_includes_p (const value_range_base *, HOST_WIDE_INT);
 extern bool infer_value_range (gimple *, tree, tree_code *, tree *);

 extern bool vrp_bitmap_equal_p (const_bitmap, const_bitmap);
@ -285,4 +285,12 @@ extern tree get_single_symbol (tree, bool *, tree *);
 extern void maybe_set_nonzero_bits (edge, tree);
 extern value_range_kind determine_value_range (tree, wide_int *, wide_int *);

+/* Return TRUE if *VR includes the value zero.  */
+
+inline bool
+range_includes_zero_p (const value_range_base *vr)
+{
+  return range_includes_p (vr, 0);
+}
+
 #endif /* GCC_TREE_VRP_H */