The first step in any machine learning project is familiarize yourself with the data. You’ll use the Pandas library for this. Pandas is the primary tool data scientists use for exploring and manipulating data.
The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database.
Pandas has powerful methods for most things you’ll want to do with this type of data.
# save filepath to variable for easier access
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path)
# print a summary of the data in Melbourne data
melbourne_data.describe()
Building Your Model
You will use the scikit-learn library to create your models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.
The steps to building and using a model are:
- Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
- Fit: Capture patterns from provided data. This is the heart of modeling.
- Predict: Just what it sounds like
- Evaluate: Determine how accurate the model’s predictions are.
from sklearn.tree import DecisionTreeRegressor
# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)
# Fit model
melbourne_model.fit(X, y)
Many machine learning models allow some randomness in model training. Specifying a number for random_state
ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won’t depend meaningfully on exactly what value you choose.
We now have a fitted model that we can use to make predictions.
In practice, you’ll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we’ll make predictions for the first few rows of the training data to see how the predict function works.
The scikit-learn library has a function train_test_split
to break up the data into two pieces. We’ll use some of that data as training data to fit the model, and we’ll use the other data as validation data to calculate mean_absolute_error
.
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)
# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
Since we care about accuracy on new data, which we estimate from our validation data, we want to find the sweet spot between underfitting and overfitting. Visually, we want the low point of the (red) validation curve in the figure below.
pipeline
Pipelines are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.
Cross-validation
In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality.
Deep Learning
Deep learning is an approach to machine learning characterized by deep stacks of computations. This depth of computation is what has enabled deep learning models to disentangle the kinds of complex and hierarchical patterns found in the most challenging real-world datasets.
from tensorflow import keras
from tensorflow.keras import layers
# Create a network with 1 linear unit
model = keras.Sequential([
layers.Dense(units=1, input_shape=[3])
])
With the first argument, units
, we define how many outputs we want.
With the second argument, input_shape
, we tell Keras the dimensions of the inputs.
Many Kinds of Layers A “layer” in Keras is a very general kind of thing. A layer can be, essentially, any kind of data transformation. Many layers, like the convolutional and recurrent layers, transform data through use of neurons and differ primarily in the pattern of connections they form. Others though are used for feature engineering or just simple arithmetic. There’s a whole world of layers to discover – check them out!
It turns out, however, that two dense layers with nothing in between are no better than a single dense layer by itself. Dense layers by themselves can never move us out of the world of lines and planes. What we need is something nonlinear. What we need are activation functions.
Without activation functions, neural networks can only learn linear relationships. In order to fit curves, we’ll need to use activation functions.
An activation function is simply some function we apply to each of a layer’s outputs (its activations). The most common is the rectifier function max(0,x)max(0,x).
Stochastic Gradient Descent
In addition to the training data, we need two more things:
- A “loss function” that measures how good the network’s predictions are.
- An “optimizer” that can tell the network how to change its weights.
The optimizer is an algorithm that adjusts the weights to minimize the loss.
Virtually all of the optimization algorithms used in deep learning belong to a family called stochastic gradient descent. They are iterative algorithms that train a network in steps. One step of training goes like this:
- Sample some training data and run it through the network to make predictions.
- Measure the loss between the predictions and the true values.
- Finally, adjust the weights in a direction that makes the loss smaller.
Then just do this over and over until the loss is as small as you like (or until it won’t decrease any further.)
Each iteration’s sample of training data is called a minibatch (or often just “batch”), while a complete round of the training data is called an epoch. The number of epochs you train for is how many times the network will see each training example.
After defining a model, you can add a loss function and optimizer with the model’s compile
method:
model.compile(
optimizer="adam",
loss="mae",
)
overfitting and underfitting
A model’s capacity refers to the size and complexity of the patterns it is able to learn. For neural networks, this will largely be determined by how many neurons it has and how they are connected together. If it appears that your network is underfitting the data, you should try increasing its capacity.
You can increase the capacity of a network either by making it wider (more units to existing layers) or by making it deeper (adding more layers). Wider networks have an easier time learning more linear relationships, while deeper networks prefer more nonlinear ones. Which is better just depends on the dataset.
other kind of layers beside dense layer
dropout layer
can help correct overfitting.
To break up these conspiracies, we randomly drop out some fraction of a layer’s input units every step of training, making it much harder for the network to learn those spurious patterns in the training data. Instead, it has to search for broad, general patterns, whose weight patterns tend to be more robust.
In Keras, the dropout rate argument rate
defines what percentage of the input units to shut off. Put the Dropout
layer just before the layer you want the dropout applied to:
keras.Sequential([
# ...
layers.Dropout(rate=0.3), # apply 30% dropout to the next layer
layers.Dense(16),
# ...
])
When adding dropout, you may need to increase the number of units in your Dense
layers.
Classification
Accuracy is one of the many metrics in use for measuring success on a classification problem. Accuracy is the ratio of correct predictions to total predictions: accuracy = number_correct / total
. A model that always predicted correctly would have an accuracy score of 1.0
. All else being equal, accuracy is a reasonable metric to use whenever the classes in the dataset occur with about the same frequency.
The problem with accuracy (and most other classification metrics) is that it can’t be used as a loss function. SGD needs a loss function that changes smoothly, but accuracy, being a ratio of counts, changes in “jumps”. So, we have to choose a substitute to act as the loss function. This substitute is the cross-entropy function.
Now, recall that the loss function defines the objective of the network during training. With regression, our goal was to minimize the distance between the expected outcome and the predicted outcome. We chose MAE to measure this distance.
For classification, what we want instead is a distance between probabilities, and this is what cross-entropy provides. Cross-entropy is a sort of measure for the distance from one probability distribution to another.
Reinforcement Learning
This idea of using reward to track the performance of an agent is a core idea in the field of reinforcement learning. Once we define the problem in this way, we can use any of a variety of reinforcement learning algorithms to produce an agent.
Computer Vision
A convnet used for image classification consists of two parts: a convolutional base and a dense head.
The base is used to extract the features from an image. It is formed primarily of layers performing the convolution operation, but often includes other kinds of layers as well.
The head is used to determine the class of the image. It is formed primarily of dense layers, but might include other layers like dropout.
The goal of the network during training is to learn two things:
- which features to extract from an image (base),
- which class goes with what features (head).
These days, convnets are rarely trained from scratch. More often, we reuse the base of a pretrained model. To the pretrained base we then attach an untrained head. In other words, we reuse the part of a network that has already learned to do 1. Extract features, and attach to it some fresh layers to learn 2. Classify
The most commonly used dataset for pretraining is ImageNet, a large dataset of many kind of natural images. Keras includes a variety models pretrained on ImageNet in its applications
module.
Feature Extraction
The feature extraction performed by the base consists of three basic operations:
- Filter an image for a particular feature (convolution)
- Detect that feature within the filtered image (ReLU)
- Condense the image to enhance the features (maximum pooling)
The weights a convnet learns during training are primarily contained in its convolutional layers. These weights we call kernels. We can represent them as small arrays:
A kernel operates by scanning over an image and producing a weighted sum of pixel values. In this way, a kernel will act sort of like a polarized lens, emphasizing or deemphasizing certain patterns of information.
A kernel acts as a kind of lens.
Kernels define how a convolutional layer is connected to the layer that follows. The kernel above will connect each neuron in the output to nine neurons in the input. By setting the dimensions of the kernels with kernel_size
, you are telling the convnet how to form these connections. Most often, a kernel will have odd-numbered dimensions – like kernel_size=(3, 3)
or (5, 5)
– so that a single pixel sits at the center, but this is not a requirement.
The kernels in a convolutional layer determine what kinds of features it creates. During training, a convnet tries to learn what features it needs to solve the classification problem. This means finding the best values for its kernels.
The activations in the network we call feature maps. They are what result when we apply a filter to an image; they contain the visual features the kernel extracts. Here are a few kernels pictured with feature maps they produced.
Kernels and features.
From the pattern of numbers in the kernel, you can tell the kinds of feature maps it creates. Generally, what a convolution accentuates in its inputs will match the shape of the positive numbers in the kernel. The left and middle kernels above will both filter for horizontal shapes.
With the filters
parameter, you tell the convolutional layer how many feature maps you want it to create as output.
Notice that after applying the ReLU function (Detect) the feature map ends up with a lot of “dead space,” that is, large areas containing only 0’s (the black areas in the image). Having to carry these 0 activations through the entire network would increase the size of the model without adding much useful information. Instead, we would like to condense the feature map to retain only the most useful part – the feature itself.
This in fact is what maximum pooling does. Max pooling takes a patch of activations in the original feature map and replaces them with the maximum activation in that patch.
The pooling step increases the proportion of active pixels to zero pixels.
In fact, the zero-pixels carry positional information. The blank space still positions the feature within the image. When MaxPool2D
removes some of these pixels, it removes some of the positional information in the feature map. This gives a convnet a property called translation invariance. This means that a convnet with maximum pooling will tend not to distinguish features by their location in the image.
We mentioned in the previous exercise that average pooling has largely been superceeded by maximum pooling within the convolutional base. There is, however, a kind of average pooling that is still widely used in the head of a convnet. This is global average pooling. A GlobalAvgPool2D
layer is often used as an alternative to some or all of the hidden Dense
layers in the head of the network
There are two additional parameters affecting both convolution and pooling layers – these are the strides
of the window and whether to use padding
at the image edges. The strides
parameter says how far the window should move at each step, and the padding
parameter describes how we handle the pixels at the edges of the input.
.
Increasing the stride means that we miss out on potentially valuble information in our summary. Maximum pooling layers, however, will almost always have stride values greater than 1, like (2, 2)
or (3, 3)
, but not larger than the window itself.
What the convolution does with these boundary values is determined by its padding
parameter. In TensorFlow, you have two choices: either padding='same'
or padding='valid'
. There are trade-offs with each.
When we set padding='valid'
, the convolution window will stay entirely inside the input. The drawback is that the output shrinks (loses pixels), and shrinks more for larger kernels. This will limit the number of layers the network can contain, especially when inputs are small in size.
The alternative is to use padding='same'
. The trick here is to pad the input with 0’s around its borders, using just enough 0’s to make the size of the output the same as the size of the input. This can have the effect however of diluting the influence of pixels at the borders. The animation below shows a sliding window with 'same'
padding.
The VGG model we’ve been looking at uses same
padding for all of its convolutional layers. Most modern convnets will use some combination of the two.
. A single round of feature extraction can only extract relatively simple features from an image, things like simple lines or contrasts. These are too simple to solve most classification problems. Instead, convnets will repeat this extraction over and over, so that the features become more complex and refined as they travel deeper into the network.
It does this by passing them through long chains of convolutional blocks which perform this extraction.
These convolutional blocks are stacks of Conv2D
and MaxPool2D
layers, whose role in feature extraction we learned about in the last few lessons.
Each block represents a round of extraction, and by composing these blocks the convnet can combine and recombine the features produced, growing them and shaping them to better fit the problem at hand. The deep structure of modern convnets is what allows this sophisticated feature engineering and has been largely responsible for their superior performance.