I try to drill into my high school students that in classical statistics we make probability statements about our methods, not about the unknown real-world parameters. (I can't believe what I first wrote in the post. I would mark it wrong if a student wrote it.)
When we use a significance level, or alpha, of .05, we are saying that there is a chance of 1 out of 20 that our study will not replicate. So, in the universe of studies published using alpha of .05 as the criterion, we would expect that a number would fail to replicate.
And what does fail to replicate mean? It does not mean that next time you fail to find a result that is significant at the .05 level. It does not mean that next time you fail to find an effect of X and instead only find an effect of half of X. It means that you fail to find any effect at all! If you find any effect at all, then in the ordinary way these methods are used, you have replicated your original study.
Another point that I drill in is that there is a difference between quantitative significance and statistical significance. If an effect of X is significant quantitatively but an effect of 0.5X is not significant quantitatively, then it is not really useful to report that the effect of X is significantly different from zero.
Some of these problems could be alleviated if it were standard to report confidence intervals rather than p-values. If you saw that a 90 percent confidence interval did not include zero but included very low values for the effect, you might be less unnerved by a subsequent study finding a low value for the effect. However, given the various biases I discussed in my previous post, I would want to see a wider confidence interval reported--something like the 99.5 percent interval.
If we continue to use "statistical significance," then I recommend against a binary "is or is not" classification. Instead, I would suggest three ranges: A p-value greater than 0.15 means that you were unable to control for other factors well enough to have confidence in the effect you found. The experimental method is not promising in that sense. A p-value between .001 and .15 means that your results should be treated as intriguing, justifying an attempt to replicate your study in order to verify that other factors did not cause the result. A p-value below .001 suggests that other factors are unlikely to have caused your result, as long as there was no flaw in the design or execution of the study.