Linear Regression

Finding a line that best fits given data.

Purpose of exercise

Activities

Mathematics overview

Compute the slope, b, and y-intercept, a, of the line that minimizes the sum of the squared deviations in a given data set from the line specified by the equation y = bx + a

The data set will consist of pairs of x and y coordinates. So, the slope and intercept must be chosen to minimize the sum

Sum-over-i(bxi + a - yi)2 where the pairs (xi, yi) are those in the data set and the sum runs across all pairs in the data set.

Use the following formulas to compute b and a

Note: r is related to what is normally called correlation, but is not the same.

Requirements

  1. Define a function, (running-avg n avg xs): a tail-recursive function that is (a) based on the running-average recurrence and (b) assumes avg is the average of some sequence ys of rational numbers, and delivers the average of the sequence consisting of the numbers in ys together with the numbers in xs, which are also rational numbers (a function that uses the traditional formula for the average, that is, the sum divided by the number terms, does not meet the requirements of this part of the exercise)
  2. Define a function (avg xs): a non-recursive function that uses the function running-avg to compute the average of xs, which is a nonempty list of rational numbers (a function that uses the traditional formula for the average, that is, the sum divided by the number terms, does not meet the requirements of this part of the exercise)
  3. Define a function (vector+scalar xs s): delivers the list formed by adding the number, s, to each element of the list, xs (assuming xs is a list of numbers)
  4. Define a function (vector*vector xs ys): delivers the list of products of corresponding elements from the lists xs and ys (assuming xs and ys are lists of numbers and contain the same number of elements)

Properties

  1. The product of the nth element of xs and the nth element of ys is the nth element of (vector*vector xs ys)
  2. The value (avg xs) is equal to the sum of the numbers in xs divided by the number of elements in xs

Lemma. You will probably need at least one lemma to reach a proof of the property that relates (avg xs) to sums. A lemma that may help is one that asserts that the value delivered by the running-avg function is the same as the value of a formula based on the traditional formula for the average and the key to understanding the running-average method.

Input/Output

Useful books of theorems