Linear Regression

Finding a line that best fits given data.

Purpose of exercise

• Practice writing equation-based programs
• Practice specifying correctness properties
• Use DrACuLa DoubleCheck for predicate-based, random testing
• Certify correctness properties using ACL2

Activities

• Define the functions required below
• Package the functions in a book and certify it
• Define the predicates required below and package them in another book (the "predicate book")
• Express the properties stated below in DoubleCheck form, package them in another book (the "test book"), and run the tests
• Convert the properties from Double-Check form to ACL2 theorems, package them in another book (the "verification book"), and certify it

Mathematics overview

Compute the slope, b, and y-intercept, a, of the line that minimizes the sum of the squared deviations in a given data set from the line specified by the equation y = bx + a

The data set will consist of pairs of x and y coordinates. So, the slope and intercept must be chosen to minimize the sum

Sum-over-i(bxi + a - yi)2 where the pairs (xi, yi) are those in the data set and the sum runs across all pairs in the data set.

Use the following formulas to compute b and a

• b = r(x, y)/xVar
• a = yAvg - b*xAvg
• xAvg is the average of the xi values
• yAvg is the average of the yi
• r(x, y) is the average of the products of corresponding deviations, (xi - xAvg)(yi - yAvg)
• xVar is the variance of the xi values, which is the average of the squared deviations (xi - xAvg)2

Note: r is related to what is normally called correlation, but is not the same.

Requirements

1. Define a function, (running-avg n avg xs): a tail-recursive function that is (a) based on the running-average recurrence and (b) assumes avg is the average of some sequence ys of rational numbers, and delivers the average of the sequence consisting of the numbers in ys together with the numbers in xs, which are also rational numbers (a function that uses the traditional formula for the average, that is, the sum divided by the number terms, does not meet the requirements of this part of the exercise)
2. Define a function (avg xs): a non-recursive function that uses the function running-avg to compute the average of xs, which is a nonempty list of rational numbers (a function that uses the traditional formula for the average, that is, the sum divided by the number terms, does not meet the requirements of this part of the exercise)
3. Define a function (vector+scalar xs s): delivers the list formed by adding the number, s, to each element of the list, xs (assuming xs is a list of numbers)
4. Define a function (vector*vector xs ys): delivers the list of products of corresponding elements from the lists xs and ys (assuming xs and ys are lists of numbers and contain the same number of elements)

Properties

1. The product of the nth element of xs and the nth element of ys is the nth element of (vector*vector xs ys)
2. The value (avg xs) is equal to the sum of the numbers in xs divided by the number of elements in xs

Lemma. You will probably need at least one lemma to reach a proof of the property that relates (avg xs) to sums. A lemma that may help is one that asserts that the value delivered by the running-avg function is the same as the value of a formula based on the traditional formula for the average and the key to understanding the running-average method.

Input/Output

• Input file: a sequence of lines, each of which contains one (x, y)-pair in the data set. The number x will occur first on the line, then y, separated by one or more blanks. Numbers in the file will be in ordinary decimal notation, such as 3.1416 or 185 or -22.9 or some other decimal numeral.
• Output file: same name as input file, but with addiotional segment "withLRcoeffs"
• First line, simply a label: Linear Regression Coefficients (slope, intercept)
• Second line: the regression coefficients, themselves (slope followed by intercept, expressed as decimal numerals with as many significant digits as the longest numeral in the input file).
• Third line, simply a label: :x-y Data
• Remaining lines: data from the input file

Useful books of theorems

• arithmetic-3/top