$$ \newcommand{\dint}{\mathrm{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\mathrm{vec}} \newcommand{\vecemph}{\mathrm{vec}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\S}{\mathcal{S}} \newcommand{\diag}[1]{\mathrm{diag}(#1)} \newcommand{\diagemph}[1]{\mathrm{diag}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\mathrm{II}} \newcommand{\GL}{\mathrm{GL}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\grad}[1]{\mathrm{grad} \, #1} \newcommand{\gradat}[2]{\mathrm{grad} \, #1 \, \vert_{#2}} \newcommand{\Hess}[1]{\mathrm{Hess} \, #1} \newcommand{\T}{\text{T}} \newcommand{\dim}[1]{\mathrm{dim} \, #1} \newcommand{\partder}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\rank}[1]{\mathrm{rank} \, #1} $$

Neural Network with Keras

Introduction
Keras Dense Neural Network
Keras CNN example

Introduction

TensorFlow developped by Google, is a brilliant tool, with lots of power and flexibility. However, for quick prototyping work it can be a bit verbose. Keras is a higher level library which operates over either TensorFlow or Theano, and is intended to stream-line the process of building deep learning networks. This post will show you how to design Neural Networks with Keras. Whenever we work with machine learning algorithms that use a stochastic process (e.g. random numbers), it is a good idea to set the random number seed. This is so that you can run the same code again and again and get the same result. This is useful if you need to demonstrate a result, compare algorithms using the same source of randomness or to debug a part of your code. You can initialize the random number generator with any seed you like, for example:

            
import numpy as np
# Fix random seed for reproductibility
np.random.seed(7)

We are going to use the Pima Indians onset of diabetes dataset. This is a standard machine learning dataset from the UCI Machine Learning repository. It describes patient medical record data for Pima Indians and whether they had an onset of diabetes within five years. As such, it is a binary classification problem (onset of diabetes as 1 or not as 0). All of the input variables that describe each patient are numerical. This makes it easy to use directly with neural networks that expect numerical input and output values, and ideal for our first neural network in Keras. You can now load the file directly using the NumPy function loadtxt(). There are eight input variables and one output variable (the last column). Once loaded we can split the dataset into input variables (X) and the output class variable (Y).

            
# load pima indians dataset
dataset = np.loadtxt("pima-indians-diabetes.txt", delimiter=",")

#  split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

We have initialized our random number generator to ensure our results are reproducible and loaded our data. We are now ready to define our neural network model.

Keras Dense Neural Network

Models in Keras are defined as a sequence of layers. We create a Sequential model and add layers one at a time until we are happy with our network topology. The first thing to get right is to ensure the input layer has the right number of inputs. This can be specified when creating the first layer with the input_dim argument and setting it to $8$ for the $8$ input variables.

How do we know the number of layers and their types?

This is a very hard question. There are heuristics that we can use and often the best network structure is found through a process of trial and error experimentation. Generally, you need a network large enough to capture the structure of the problem if that helps at all. In this example, we will use a fully-connected network structure with three layers.

            
# Create your first MLP in Keras
from keras.models import Sequential
from keras.layers import Dense
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
        
    

As we can see, the fully connected layers are defined using the Dense class:

We can specify the number of neurons in the layer as the first argument, the initialization method as the second argument as init and specify the activation function using the activation argument.
In this case, we initialize the network weights to a small random number generated from a uniform distribution (‘uniform‘). In this case between 0 and 0.05 because that is the default uniform weight initialization in Keras.
Another traditional alternative would be ‘normal’ for small random numbers generated from a Gaussian distribution.
We will use the rectifier (‘relu‘) activation function on the first two layers and the sigmoid function in the output layer. It used to be the case that sigmoid and tanh activation functions were preferred for all layers.
These days, better performance is achieved using the rectifier activation function. We use a sigmoid on the output layer to ensure our network output is between 0 and 1 and easy to map to either a probability of class $1$ or snap to a hard classification of either class with a default threshold of $0.5$.
We can piece it all together by adding each layer. The first layer has $12$ neurons and expects $8$ input variables. The second hidden layer has $8$ neurons and finally, the output layer has $1$ neuron to predict the class (onset of diabetes or not).

The model is now defined. We can compile it. Compiling the model uses the efficient numerical libraries under the covers (the so-called backend) such as Theano or TensorFlow. The backend automatically chooses the best way to represent the network for training and making predictions to run on your hardware, such as CPU or GPU or even distributed. When compiling, we must specify some additional properties required when training the network. Remember training a network means finding the best set of weights to make predictions for this problem.
We must specify:

the loss function to use to evaluate a set of weights,
the optimizer used to search through different weights for the network and any optional metrics we would like to collect and report during training.

            
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In this case, we will use logarithmic loss, which for a binary classification problem is defined in Keras as binary_crossentropy. We will also use the efficient gradient descent algorithm adam for no other reason that it is an efficient default. You can learn more about the Adam optimization algorithm (see references). Finally, because it is a classification problem, we will collect and report the classification accuracy as the metric. We have defined our model and compiled it ready for efficient computation. Now it is time to execute the model on some data. We can train or fit our model on our loaded data by calling the fit() function on the model.

            
# Fit the model
model.fit(X, Y, epochs=150, batch_size=10)

This is where the work happens on your CPU or GPU.

The training process will run for a fixed number of iterations through the dataset called epochs, that we must specify using the nepochs argument. We can also set the number of instances that are evaluated before a weight update in the network is performed, called the batch size and set using the batch_size argument. For this problem, we will run for a small number of iterations $150$ and use a relatively small batch size of $10$. Again, these can be chosen experimentally by trial and error.
We have trained our neural network on the entire dataset and we can evaluate the performance of the network on the same dataset. This will only give us an idea of how well we have modeled the dataset (e.g. train accuracy), but no idea of how well the algorithm might perform on new data. We have done this for simplicity, but ideally, you could separate your data into train and test datasets for training and evaluation of your model.
You can evaluate your model on your training dataset using the evaluate() function on your model and pass it the same input and output used to train the model. This will generate a prediction for each input and output pair and collect scores, including the average loss and any metrics you have configured, such as accuracy.

            
# evaluate the model
scores = model.evaluate(X, Y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

Running this example, you should see a message for each of the 150 epochs printing the loss and accuracy for each, followed by the final evaluation of the trained model on the training dataset. It takes about $10$ seconds to execute on my laptop running on the CPU with a TensorFlow backend.

Exercise[Spam/Ham prediction]

We are going to use the Spam and Ham sms dataset. Text mining or deriving information from text, is a wide field which has gained popularity with the huge text data being generated. Automation of a number of applications like sentiment analysis, document classication, topic classication, text summarization, machine translation has been done using machine learning models. Spam filtering is a beginner's example of document classication task which involves classifying an sms or email as spam or non-spam (a.k.a. ham). Spam box in your e-mail account is the best example of this. So lets get started in building a spam filter on a publicly available sms corpus. The extracted subset on which we will be working can be downloaded from here. We will walk through the following steps to build this application :

1. Preparing the text data.
2. Creating word dictionary.
3. Feature extraction process
4. Training the classifier

Further, we will check the results on test set of the subset created. Preparing the text data The data-set used here, has to be split into a training set and a test set. You must divide it equally between spam and ham sms. You will easily recognize spam and ham sms. In any text mining problem, text cleaning is the first step where we remove those words from the document which may not contribute to the information we want to extract. Sms may contain a lot of undesirable characters like punctuation marks, stop words, digits... which may not be helpful in detecting the spam sms. We'll use the bag-of-words approach, where each unique word in a text will be represented by one number. As a first step, we have to split a message into its individual words. You have to process in the sms in the following ways:
a) Remove of stop words - Stop words like "and", "the", "of", etc are very common in all English sentences and are not very meaningful in deciding spam or legitimate status, so these words have to be removed from the sms.
b) Lemmatization - It is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, "include", 'includes' and "included" would all be represented as "include". The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider meaning of the sentence). We still need to remove the non-words like punctuation marks or special characters from the sms documents. There are several ways to do it. Here, we have to remove such words after creating a dictionary, which is a very convenient method to do so since when you have a dictionary, you need to remove every such word only once.

Creating word dictionary

A sample sms in the data-set looks like this:
1. ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
2. ham Ok lar... Joking wif u oni...
3. ham U dun say so early hor... U c already then say...
4. ham Nah I don't think he goes to usf, he lives around here though
5. spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, $1.50 to rcv

This corpus will be our labeled training set. Using these ham/spam examples, we'll train a machine learning model to learn to discriminate between ham/spam automatically. Then, with a trained model, we'll be able to classify arbitrary unlabeled messages as ham or spam. It can be seen that the first word in each line is ham or spam and you will found the content of the sms after in the line. You will only perform text analytics on the content to detect the spam sms.

As a first step, we need to read the data from the sms.txt file and create the target and data list.

                #open the f i l e in the directory
                file = open("C:/data/sms.txt","r")
                line = file.readline()
                target =[ ]
                data =[ ]
                while 1:
                    # split each line in a list , the delimiters are removed
                    line = file.readline().split()
                    # stop reading when the len of the line is equal to zero
                    if len(line) ==0:
                        break
                    target.append(line[0])
                    del(line[0])
                    data.append(line)
                file.close()

As a second step, we need to create a dictionary of words and their frequency. For this task, training set of half sms is utilized.

            def make_Dictionary(traindir):
                allwords = []
                # create a list of all words
                # to do
                for i in range (len(data)):
                    for j in range (len(data[i])):
                        allwords += data [j]
                dictionary = Counter(allwords)
                # Paste code for non-word removal here
                return dictionary

Once the dictionary is created we can add just a few lines to the above function to remove non-words about which we talked in the first step. You have also to remove absurd single characters in the dictionary which are irrelevant here.

            list_to_remove = dictionary.keys()
            for item in list_to_remove:
                if item.isalpha() == False:
                    del dictionary[item]
                elif len(item)== 1:
                    del dictionary[item]
            dictionary = dictionary.most_common(3000)

Once the dictionary is ready, we can extract word count vector (our feature here) of 3000 dimensions for each sms of training set. Each word count vector contains the frequency of 3000 words in the training file. Of course you might have guessed by now that most of them will be zero. Let us take an example. Suppose we have 500 words in our dictionary. Each word count vector contains the frequency of 500 dictionary words in the training file. Suppose text in training file was "Get the work done, work done" then it will be encoded as [0,0,0,0,0,. . . . . . .0,0,2,0,0,0,. . . . . . ,0,0,1,0,0,. . . 0,0,1,0,0,. . . . . . 2,0,0,0,0,0]. Here, all the word counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector and the rest at zero.
Implement with Keras a model so as to classify Ham and Spam.

Keras CNN example

We will apply the CNN methodology to the MNIST digits dataset. The aim is to see how to increase the efficiency of the Neural Network by using the CNN. The code below is the “guts” of the CNN structure that will be used:

            
# Create your first CNN in Keras
from keras.models import Sequential
from keras.layers import Dense
# create model
model = Sequential()
model.add(Conv2D(32, kernel_size=(5, 5), strides=(1, 1),
                 activation='relu',
                 input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(Conv2D(64, (5, 5), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(1000, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
        
    

I’ll go through most of the lines in turn, explaining as we go.

Models in Keras can come in two forms – Sequential and via the Functional API. For most deep learning networks that you build, the Sequential model is likely what you will use. It allows you to easily stack sequential layers (and even recurrent layers) of the network in order from input to output. The functional API allows you to build more complicated architectures, and it won’t be covered in this tutorial. The first line model = Sequential() declares the model type as Sequential().
Next, we add a $2D$ convolutional layer to process the $2D$ MNIST input images. The first argument passed to the Conv2D() layer function is the number of output channels – in this case we have $32$ output channels (as per the architecture shown at the beginning).
The next input is the kernel_size, which in this case we have chosen to be a $5$x$5$ moving window, followed by the strides in the $x$ and $y$ directions $(1, 1)$.
Next, the activation function is a rectified linear unit and finally we have to supply the model with the size of the input to the layer (which is declared in another part of the code). Declaring the input shape is only required of the first layer – Keras is good enough to work out the size of the tensors flowing through the model from there. Also notice that we don’t have to declare any weights or bias variables like we do in TensorFlow, Keras sorts that out for us.
Next we add a 2D max pooling layer:model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2))). The definition of the layer is dead easy. We simply specify the size of the pooling in the $x$ and $y$ directions – $(2, 2)$ in this case, and the strides.
Next we add another convolutional + max pooling layer, with $64$ output channels. The default strides argument in the Conv2D() function is $(1, 1)$ in Keras, so we can leave it out. The default strides argument in Keras is to make it equal ot the pool size, so again, we can leave it out. The input tensor for this layer is (batch_size, $28$, $28$, $32$) – the $28$ x $28$ is the size of the image, and the $32$ is the number of output channels from the previous layer. However, notice we don’t have to explicitly detail what the shape of the input is – Keras will work it out for us. This allows rapid assembling of network architectures without having to worry too much about the sizes of the tensors flowing around our networks.
Now that we’ve built our convolutional layers in this Keras tutorial, we want to flatten the output from these to enter our fully connected layers (all this is detailed in the convolutional neural network tutorial in TensorFlow). In TensorFlow, we had to figure out what the size of our output tensor from the convolutional layers was in order to flatten it, and also to determine explicitly the size of our weight and bias variables. Sure, this isn’t too difficult – but it just makes our life easier not to have to think about it too much.
The next two lines declare our fully connected layers – using the Dense() layer in Keras. Again, it is very simple. First we specify the size – in line with our architecture, we specify $1000$ nodes, each activated by a ReLU function. The second is our soft-max classification, or output layer, which is the size of the number of our classes (10 in this case, for our 10 possible hand-written digits).

That’s it – we have successfully developed the architecture of our CNN in only $8$ lines. Now let’s see what we have to do to train the model and perform predictions.
Training and evaluating our convolutional neural network. We have now developed the architecture of the CNN in Keras, but we haven’t specified the loss function, or told the framework what type of optimiser to use i.e. gradient descent, Adam optimiser... In Keras, this can be performed in one command:

            
model.compile(loss=keras.losses.categorical_crossentropy,
                    optimizer=keras.optimizers.SGD(lr=0.01),
                    metrics=['accuracy'])

Keras supplies many loss functions (or you can build your own) as can be seen here. In this case, we will use the standard cross entropy for categorical class classification keras.losses.categorical_crossentropy. Keras also supplies many optimisers – as can be seen here. In this case, we’ll use the Adam optimizer (keras.optimizers.Adam) as we did in the CNN TensorFlow tutorial. Finally, we can specify a metric that will be calculated when we run evaluate() on the model. In TensorFlow we would have to define an accuracy calculating operation which we would need to call in order to assess the accuracy. In this case, Keras makes it easy for us. See here for a list of metrics that can be used.
Next, we want to train our model. This can be done by again running a single command in Keras:

            
model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test),
                    callbacks=[history])
        
    

This command looks similar to the syntax used in the very popular scikit learn Python machine learning library. We first pass in all of our training data – in this case x_train and y_train. The next argument is the batch size – we don’t have to explicitly handle the batching up of our data during training in Keras, rather we just specify the batch size and it does it for us. In this case we are using a batch size of $128$. Next we pass the number of training epochs ($10$ in this case). The verbose flag, set to $1$ here, specifies if you want detailed information being printed in the console about the progress of the training. During training, if verbose is set to $1$, the following is output to the console:

3328/60000 [>.............................] - ETA: 87s - loss: 0.2180 - acc: 0.9336
3456/60000 [>.............................] - ETA: 87s - loss: 0.2158 - acc: 0.9349
3584/60000 [>.............................] - ETA: 87s - loss: 0.2145 - acc: 0.9350
3712/60000 [>.............................] - ETA: 86s - loss: 0.2150 - acc: 0.93481234

Finally, we pass the validation or test data to the fit function so Keras knows what data to test the metric against when evaluate() is run on the model. Ignore the callbacks argument for the moment – that will be discussed shortly.

Once the model is trained, we can then evaluate it and print the results.

Exercise[MNIST Application]

Apply the previous code to the MNIST example in the dense case only. Try to adapt the parameters so as to optimize the efficiency of the model.

            import keras
            from keras.datasets import mnist
            from keras.models import Sequential, Model
            from keras.layers import Dense, Dropout, Activation
            from keras.optimizers import RMSprop
            from keras.utils import np_utils

            batch_size = 128
            num_classes = 10
            epochs = 15

            # the data, shuffled and split between train and test sets
            (x_train, y_train), (x_test, y_test) = mnist.load_data()

            X_train = x_train.reshape(60000, 784)
            X_test = x_test.reshape(10000, 784)
            X_train = X_train.astype('float32')
            X_test = X_test.astype('float32')
            X_train /= 255
            X_test /= 255

            nb_classes=num_classes
            # convert class vectors to binary class matrices
            Y_train = np_utils.to_categorical(y_train, num_classes)
            Y_test = np_utils.to_categorical(y_test, num_classes)

            model = Sequential()
            model.add(Dense(512, input_shape=(784,)))
            model.add(Activation('relu'))
            model.add(Dropout(0.2))
            model.add(Dense(512))
            model.add(Activation('relu'))
            model.add(Dropout(0.3))
            model.add(Dense(10))
            model.add(Activation('softmax')) 

            #model.compile(loss='categorical_crossentropy',
            #              optimizer=RMSprop(),
            #              metrics=['accuracy'])

            model.compile(loss=keras.losses.categorical_crossentropy,
                        optimizer=keras.optimizers.Adadelta(),
                        metrics=['accuracy'])

            model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=epochs,
                        verbose=1, validation_data=(X_test, Y_test))


            score = model.evaluate(X_test, Y_test, verbose=0)
            print('Test loss:', score[0])
            print('Test accuracy:', score[1])
            model.summary()

Apply the convolution strategy to the MNIST dataset. Try to adapt the parameters so as to optimize the model.

            import keras
            from keras.datasets import mnist
            from keras.models import Sequential
            from keras.layers import Dense, Dropout, Flatten
            from keras.layers import Conv2D, MaxPooling2D
            from keras import backend as K

            batch_size = 128
            num_classes = 10
            epochs = 15

            # input image dimensions
            img_rows, img_cols = 28, 28

            # the data, split between train and test sets
            (x_train, y_train), (x_test, y_test) = mnist.load_data()

            if K.image_data_format() == 'channels_first':
                x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
                x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
                input_shape = (1, img_rows, img_cols)
            else:
                x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
                x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
                input_shape = (img_rows, img_cols, 1)

            x_train = x_train.astype('float32')
            x_test = x_test.astype('float32')
            x_train /= 255
            x_test /= 255
            print('x_train shape:', x_train.shape)
            print(x_train.shape[0], 'train samples')
            print(x_test.shape[0], 'test samples')

            # convert class vectors to binary class matrices
            y_train = keras.utils.to_categorical(y_train, num_classes)
            y_test = keras.utils.to_categorical(y_test, num_classes)

            model = Sequential()
            model.add(Conv2D(32, kernel_size=(3, 3),
                            activation='relu',
                            input_shape=input_shape))
            model.add(Conv2D(64, (3, 3), activation='relu'))
            model.add(MaxPooling2D(pool_size=(2, 2)))
            model.add(Dropout(0.25))
            model.add(Flatten())
            model.add(Dense(128, activation='relu'))
            model.add(Dropout(0.5))
            model.add(Dense(num_classes, activation='softmax'))

            model.compile(loss=keras.losses.categorical_crossentropy,
                        optimizer=keras.optimizers.Adadelta(),
                        metrics=['accuracy'])

            model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))
            score = model.evaluate(x_test, y_test, verbose=0)
            print('Test loss:', score[0])
            print('Test accuracy:', score[1])
            model.summary()

References

Franck Rosenblatt (1962), Principles of neurodynamics: perceptrons and the theory of brain mechanisms
Adam: A Method for Stochastic Optimization