![calculate the linear regression equation r2 calculate the linear regression equation r2](https://i.ytimg.com/vi/GXUq-lBnQ8o/maxresdefault.jpg)
You decide to experimentally determine the resistance of the component by measuring its i-V (current, voltage) curve (response). But if the y has a relationship with x such that it increases as x increases, how does y/y_bar represent error in any sense?
![calculate the linear regression equation r2 calculate the linear regression equation r2](https://www.wallstreetmojo.com/wp-content/uploads/2023/05/Regression-Formula.png)
So, how can a measured value of y over the average of all measured y's represent an error of anything? If the measured y's were for the same x value, then a variation in y could be measured as an error. The average, at least to me, really does not represent anything. You could measure the total error by taking the difference of each measured y and the value 6. This would make sense if the y value was a constant, say 6. How does the mean of y (y_bar) subtracted from any given y represent an error? What does the average y (y_bar) represent? What if, as x increases, there IS an upward trend? R2 only matters if you pick the right model and sample at the right resolution.Ħ:40 This makes absolutely no sense at all. More importantly, the line would predict that temperature increases forever (since it was warming in spring, when we sampled), which clearly is not true, even under the most dire global warming predictions ). If the data turns out linear (probably would not be), a best fit line could have a high R2, but the line would not describe 24h variation in temperature caused by night/day cycles. Say I am trying to model outdoor air temperature over time, but I only measure air temperatures once a day, and only during spring. It is possible (and common, even in science) that a linear model describes the data perfectly even though it is the wrong model for whatever process generated the data. How likely a model is correct depends on many things and is the subject of hypothesis testing (covered in future videos). R2 only measures how well a line approximates points on a graph. If we see things this way, SE_line / SE_y kind of measures how much of a better fit we have with our model compared to the most-basic model available. So, SE_y can be seen as the error that is committed by fitting points with the worst - or most basic - model available. Still, a constant line is the most basic model one could come up with, as a linear function, an exponential function, a quadratic function all can adapt better to points and have more "degrees of freedom" (more parameters to be played with) than a line y = constant. In fact, y = mean_y is the line of the form y = constant that minimizes the SE. Now that makes sense to me, it's what one would do if they had no better tools for fitting lines to points than saying "we want to fit a line to a bunch of points? Hey, why don't we just take a horizontal line that goes through the mean of the y values we have". So, this is equivalent to the error that one commits if they fit the points with a horizontal line y = mean_y. The variation in y, as it was defined, measures the error from the mean_y. It measures what's the error that one commits with their estimation of the relation between x and y (regression line).
![calculate the linear regression equation r2 calculate the linear regression equation r2](https://i.ytimg.com/vi/it2Lqu5sS_Y/hqdefault.jpg)
Unlike the variation in y, the Standard Error is a much more significant concept in this context. Now, I've managed to explain to myself what's been done here in a different way, and this kind of makes intuitive sense to me. So why would we care about how much this random number we calculate and call "variation in y" is and how much of it is "explained" - whatever that even means? If we take points that have higher x's, our mean_y will increase, and if we take points with a wider range in x, our "variation in y" will also increase! So it seems to me that this "variation in y" has really no meaning in this context - it's an arbitrary number that depends on which x values we happen to choose. So, there is really no central tendency for the y values, and in fact, the values you calculate for the "mean_y" and the "variation in y" will vary depending on which x values you choose. Here, however, we have y's that are positively correlated to the x's, which means that if you pick higher and higher values for x, you also get higher and higher values for y. The variance helps you quantify how much those points scatter around. I've always thought that the variance (or variation) of something is important when that something has a central tendency, and points tend to scatter randomly around that centre. However, I found myself wondering what is really this variation in y, what does it describe? Why do we care about this number? In this video Sal talks a lot about the "variation in y" and "how much of the variation in y is described". I have some troubles understanding the concepts explained in this video on a deeper level.