Linear Regression
Finding a line that best fits given data.
Purpose of exercise
- Practice writing equation-based programs
- Practice specifying correctness properties
- Use DrACuLa DoubleCheck for predicate-based, random testing
- Certify correctness properties using ACL2
Activities
- Define the functions required below
- Package the functions in a book and certify it
- Define the predicates required below and package them in another book (the
"predicate book")
- Express the properties stated below in DoubleCheck form, package them in
another book (the "test book"), and run the tests
- Convert the properties from Double-Check form to ACL2 theorems, package them in another book (the "verification book"), and certify it
Mathematics overview
Compute the slope, b, and y-intercept,
a, of the
line that minimizes the sum of the squared deviations in a given data set from the
line specified by the equation y = bx + a
The data set will consist of pairs of x and y coordinates.
So, the slope and intercept must be chosen to minimize the sum
Sum-over-i(bxi + a - yi)2
where the pairs (xi, yi) are those
in the data set and the sum runs across all pairs in the data set.
Use the following formulas to compute b and a
- b = r(x, y)/xVar
- a = yAvg - b*xAvg
- xAvg is the average of the
xi values
- yAvg is the average of the
yi
-
r(x,
y) is the average of the products of corresponding deviations, (xi
- xAvg)(yi - yAvg)
- xVar is the variance of the xi values, which is the
average of the squared deviations (xi -
xAvg)2
Note: r is related to what is
normally called correlation, but is not the same.
Requirements
- Define a function, (running-avg n avg xs): a tail-recursive
function that is (a) based on the running-average recurrence and (b) assumes
avg is the average of some sequence ys of rational numbers, and delivers the
average of the sequence consisting of the numbers in ys together with the
numbers in xs, which are also rational numbers (a function that uses the
traditional formula for the average, that is, the sum divided by the
number terms, does not meet the requirements of this part of the exercise)
- Define a function (avg
xs): a non-recursive function that uses the function running-avg to compute the average of
xs, which is a nonempty list of rational
numbers (a function that uses the traditional formula for the average,
that is, the sum divided by the number terms, does not meet the
requirements of this part of the exercise)
- Define a function (vector+scalar xs s): delivers the list formed by adding the number, s, to each
element of the list, xs (assuming xs is a list of numbers)
- Define a function (vector*vector xs ys): delivers the list of products of corresponding elements from
the lists xs and ys (assuming xs and ys are lists of numbers and contain
the same number of elements)
Properties
- The product of the nth
element of xs and the nth element of ys is the nth
element of (vector*vector xs ys)
- The value (avg xs) is equal to the sum of the numbers
in xs divided by the number of elements in xs
Lemma. You will probably need at least one lemma to reach a
proof of the property that relates (avg xs) to sums. A lemma that may help is one that asserts that the value
delivered by the running-avg function is the same as the value of a formula
based on the traditional formula for the average and the key to understanding
the running-average method.
Input/Output
- Input file: a sequence of lines, each of which
contains one (x, y)-pair in
the data set.
The number x will occur first on the line, then y,
separated by one or more blanks. Numbers in the file will be in ordinary
decimal notation, such as 3.1416 or 185 or –22.9 or some other decimal numeral.
- Output file: same name as input file, but with addiotional segment
"withLRcoeffs"
- First line, simply a label:
Linear Regression Coefficients (slope, intercept)
- Second line: the regression coefficients, themselves (slope followed by intercept,
expressed as decimal numerals with as many significant digits as the longest
numeral in the input file).
- Third line, simply a label: :x-y Data
- Remaining lines: data from the input file
Useful books of theorems