AWS Configuration

This semester, we will be using an Amazon Web Service (AWS)-based compute cluster for our homework and project work. This will give us a uniform environment in which to work. Although we will start with a single computer, once we start our more substantial work, the size and capabilities of the cluster will be increased.

Compute Cluster Software

The compute cluster coniststs of a set of virtual Linux machines. We will use ssh to connect to a command-line shell, through which you will do some of your work. See this Linux command-line cheat sheet for typical commands. Common tools available on this compute cluster include:

python (version 2 and 3)
- scikit-learn
- tensorflow
Editors:
- vi
- emacs
- gedit
Development environments:
- jupyter

Other tools can be installed, as needed.

Compute Cluster Machines

Once up and running, the compute cluster machines will be mlfdsX.cs.ou.edu, where X is a number (0, 1, ...). All of the machines share the same accounts and user file system (all user accounts are located in /home2). In general, user accounts are configured so that only the user and the instructor have access to the account files. Common materials are placed in the /home2/ubuntu directory and is readable by all. In particular, the data sets that we will be using for homework assignments are located in /home2/ubuntu/datasets.

Master Node. The master node of the cluster is mlfds0. This node will be used for testing new configurations before being rolled out to the the other cluster machines.

Status: mlfds0 is up

Hostname: mlfds0.cs.ou.edu
Varies between 2 and 4 processors, depending on the day.
Varies between 2 and 16 GB of memory.
32GB of swap.

Status: mlfds1 is up

Hostname: mlfds0.cs.ou.edu
Varies between 2 and 4 processors, depending on the day.
Varies between 2 and 16 GB of memory.
32GB of swap.

SSH Access

We are using key-based authentication to the compute cluster. This means that access will be linked to specific machines and accounts that you will be accessing the cluster from. Also, you will not use a passord for access (unless your local private key is encrypted).

SSH Installation

Linux: installed by default
OSX: installed by default
Windows: there is a number of ssh clients out there. One is Putty (I am open to others)
Windows alternative: install the Windows subsystem for Linux. After this, then follow the Linux instructions.

Create SSH Keys

If you already have a .ssh/id_rsa.pub in your home (user) directory, then you are all set and can skip the generation step.

Generate a public/private key pair on your local machine:
- Unix: use ssh-keygen. It is okay to use an empty passphrase, but doing so means that your private key is unencrypted (this is often okay, since it is stored on your local machine only).
- Windows: use Putty Keygen
Email your public key (.ssh/id_rsa.pub) to the instructor. Do not email the private key or you will compromise your key pair.
After you have access to the cluster, you can add other machines to the cluster access list. This is done by appending the contents of other id_rsa.pub files to the .ssh/authorized_keys file on your cluster account.

Cluster Access

Once your account has been confirmed, you may open up an ssh connection to one of the cluster nodes (one of the host names above).

Unix: ssh USERNAME@HOSTNAME
Putty: Create a profile for HOSTNAME, specifying your USERNAME (use the default ssh port number for this)
Open the connection

This gives you terminal access to the node, which allows you to list/view/edit files within your home directory (and list/view files in some other directories).

Jupyter

Jupyter is an interactive environment for writing and executing python (and Julia and R) code. Here is a few different references:

High-level tutorial (note that we have already done all of the installation for you).
Notebook tutorial (including plotting in-line)
Full Documentation

Jupyter in the Cluster

One of the benefits of Jupyter is that the user interface executes in your browser. This means that the Jupyter server can be executing on one of our cluster machines and the interface itself can be on your laptop or desktop. Here is the procedure:

Initial configuration (do this once!):

Execute the following lines in the shell:
source activate python3
jupyter notebook --generate-config
Edit .jupyter/jupyter_notebook_config.py (use gedit, emacs or vi)
- Find the line containing c.NotebookApp.port
- Remove the comment symbol at the front of the line (the #) and set the port to 90XX, where XX are two digits that have been assigned to you.
- Save the file and exit the editor

Every time you wish to use Jupyter:

Unix: on your local machine, execute:
ssh -L 90XX:127.0.0.1:90XX USERNAME@HOSTNAME
where XX is your assigned digits, UID is your cluster user name and HOSTNAME is the name of the cluster machine that you will be executing your jupyter server on. This sets up an encrypted tunnel from your local machine to the port that the Jupyter server is listening on.
Putty: on your local machine: add a tunnel to your profile
Port: PORTNUMBER
Host: 127.0.0.1:PORTNUMBER
Don't forget to click the "Add" button.
Open the connection to the cluster machine.
On the cluster machine: execute the following lines in the shell:
source activate python3
jupyter notebook
This will start the Jupyter server and result in a URL being printed in your shell
Point your browser to this URL
When you are done using Jupyter, make sure that everything is saved. Then, stop the Jupyter server (^C in the shell in which the server was started, and answer "y").

Starting a New Jupyter Notebook

A notebook represents a single interactive session, an experiment or an entire processing pipeline. When you first open the Jupyter URL, you will be presented with a file browser (the default directory is the directory in which the server was started).

Create a new notebook: click on the new menu. This will open a drop-down menu. From here, you can create new directories (folders), as well as new noteooks. In order to create a new notebook, you must select the python environment (kernel) with which the notebook will be associated. For our purposes for now, select Python [conda root]. This environment includes scikit-learn and many other tools.
Executing commands / code: each In block is an area that you may type arbitrary python code. Once you have completed the code, you may press Run to execute the selected block. If the block generates output, then it will be shown in the associated Out block
The code within an In block can be edited and re-executed. You can also execute an entire set of blocks.
Example: type the following into an In block and execute it:
```
import pandas as pd
housing = pd.read_csv("/home2/ubuntu/datasets/housing.csv")
housing       
       
```
This will load in the housing database from the book and display the first few rows.
Notebooks currently being edited will be saved periodically. However, I suggest making sure that all notebooks are saved before killing the server.

andrewhfagg@gmail.com

Last modified: Tue Feb 13 12:07:25 2018