Introduction

Yet it’s a success that followed a long preamble that includes recent advances in three key areas: hardware, particularly GPUs (ideally suited to the vector and matrix based mathematics usually required in machine learning); data, due to the accessibility of larger and larger datasets; and algorithms and techniques, as deep learning research breakthroughs like those described in Krizhevsky, Sutskever and Hinton’s landmark paper began to demonstrate best-of-breed results on benchmark challenges.

So it’s not just hype, and as IT engineers it’s worth our while to gain better understanding of it. But the field can seem rather daunting to a newcomer due to all the math, statistics and algorithms involved. Even popular online courses don’t really dispel this impression, because they are typically aimed at aspiring data scientists rather than engineers.

So what to do if you’re a software developer or devops engineer who wants to get an understanding of machine learning technologies and processes without going down too many rabbit holes?

The good news is that, in many cases, it’s not as hard as it seems.

Before we get to the good stuff, it’s worth taking a quick look at some of the common machine learning areas. If this is already familiar territory for you, feel free to skip to the next section.

Machine Learning Areas

Machine learning can be divided into two broad areas: supervised learning and unsupervised learning. Unsupervised learning involves learning without the guidance of known labels or targets. A common unsupervised learning approach is clustering.

Supervised learning comprises the majority of machine learning approaches. In this case we usually already have labels assigned to the data. Algorithmically we can think of supervised learning as involving input variables (x) that produce an output variable (y) via a specific function:

y = f(x)

Classification is a type of supervised learning task. In this task, there are discrete known categories to which a data instance can belong. The well known MNIST challenge is an image classification problem of this kind, because the algorithm has to decide which of the ten numerical digit categories (0 to 9) a particular image belongs to.

Not all supervised learning problems aim to predict discrete categories. When the target is a continuous value, regression would be a more suitable approach.

Deep Learning is a subset of machine learning that also encompasses both supervised and unsupervised learning. It typically involves feature learning and involves neural networks. Deep Learning, and its associated frameworks, is more difficult to get to grips with than what we will demonstrate today.

The problem we will discuss in this blog post involves looking at the supervised learning task of classification.

Process Flow

Classification requires us to train the algorithm on a set of data to create a model that allows us to predict which category a novel data instance belongs to. This process flow pattern is broadly similar across many different machine learning domains and tasks, not just classification.

Training

Prediction

Overview

In this blog post I will show you how to use familiar data and tools – and a couple that may be unfamiliar – to solve a machine learning classification problem. Specifically, we will attempt to classify a set of logs with regards to their source type. The classification of logs is not necessarily a problem any of us regularly need to solve, as specific log entries can usually be traced to a source log quite easily. However, since logs are familiar data to most of us, and readily available data, it is hoped that this will make it convenient to reason about the problem (other commonly used datasets include the 80-year old Iris dataset, or the MNIST database already mentioned).

To solve this problem we will write a Python script that builds a classification model from an existing set of logs, and then predict the log types of a new set of logs.

We will use the Python machine learning library Scikit Learn as our Machine Learning framework. Python has been the data scientist’s friend for a long time, with excellent math, science, machine learning, data frames and graph support in the shape of Numpy, Scipy, Scikit Learn, Pandas, Matplotlib and many others.

Why don’t we use Tensorflow, Pytorch or Theano, you may ask, as they are also Python frameworks? As explained by Sebastian Raschka and others, these frameworks are particularly well suited to Deep Learning architectures and are able to utilise GPUs for better performance. However not all machine learning problems are in need of Deep Learning. Additionally, GPU optimisation requires, of course, access to GPUs, while machine learning can use CPUs perfectly well. These frameworks also allow you to build your own algorithm with lower level building blocks, which is usually unnecessary when the algorithm is already implemented in the library.

Scikit Learn, by comparison, offers a wide range of off-the-shelf algorithms, and works well on CPU. It is by no means the only option, but it is a popular and well supported choice that works great for exploratory work, prototyping, and more.

Step 1: Prepare the Data

So let’s start. First we need some data.

I am going to use system logs from my laptop, since that is readily available to me – and hopefully to many reading this article – but you can use any available logs: Apache, Tomcat, System, Docker. Whatever you have to hand.

To train our model we will use a set of training data, and to test how well the resulting model generalises to other data we will use a smaller set of test data. So let’s go ahead and create a local data directory for each dataset:

mkdir -p data/{train,test}/laptop

Next we’ll grab some logs with at least 10k of data (so that there is enough to train on). These will be divided between training and test data to a ratio of 10 to 1.

find /var/log -type f -size +10k -name "*.log" 2>/dev/null | while read log
do
  rows=$(wc -l "$log" | awk '{ print $1 }')
  head -$(($rows - ($rows / 10))) "$log" > data/train/laptop/"${log##*/}"
  tail -$(($rows / 10)) "$log" > data/test/laptop/"${log##*/}"
done

Let’s see how we did with that:

$ wc -l data/train/laptop/*.log

   15033 data/train/laptop/corecaptured.log
   28257 data/train/laptop/debuglog.log
     607 data/train/laptop/displaypolicyd.log
     258 data/train/laptop/displaypolicyd.stdout.log
    4401 data/train/laptop/fsck_hfs.log
     614 data/train/laptop/failfast.log
      44 data/train/laptop/httpd.log
  129561 data/train/laptop/install.log
   13671 data/train/laptop/system.log
   55905 data/train/laptop/trac.log
    3065 data/train/laptop/wifi.log
  251416 total

Not too bad, we can work with that.

Step 2: Prepare the Scripting Environment

Next make sure you have a recent version of Python 2.7 installed or see the download docs.

You will also need Python pip.

With those in place, install the Python libraries we are going to use.

pip install numpy sklearn

Step 3: Write the Script

While writing the script we will see that more than half of the script is bread and butter for any developer or devops engineer. Only a small core part of the script deals with the actual machine learning process, and what’s more, in many cases this core pattern is very similar amongst scripts solving very different machine learning problems.

So if you understand this script, chances are you will be able to identify key parts of many other machine learning scripts as well!

Pass input parameters

To begin with we are going to provide the script a couple of input parameters: our data directories. So far so simple:

import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--train_data_dir', type=str, default='data/train/laptop',
                    help='data directory containing training logs')
parser.add_argument('--test_data_dir', type=str, default='data/test/laptop',
                    help='data directory containing training logs')
args = parser.parse_args()

Create data structures

Next we’re going to read the logs into two Python arrays (lists), wrapped by a dictionary: one dict for training logs and one for testing logs. Note that we could have used something like Pandas data frames or Numpy arrays for the same purpose, but this wouldn’t really add anything at this stage and we want to keep things lean and transparent.

I would like to be able to create the dictionaries by passing only the input directory of the data to a function:

train_log_collection = create_log_dict(args.train_data_dir)
test_log_collection = create_log_dict(args.test_data_dir)

The corresponding function needs to do three things:
1. glob the log files in the corresponding data directory
2. extract the text data from each logfile, line by line, and read it into an array
3. identify the log source type, for each line, and set that in a second array

We differentiate data from type by indexing each dictionary with corresponding keys: one called ‘data’ and one called ‘type’. Our function would then look something like the following:

import glob

def create_log_dict(logfile_path):
    log_collection = {}
    logfiles = glob.glob(logfile_path + "/*.log") # Get list of log files
    for logfile in logfiles:
        file_handle = open(logfile, "r")
        filedata_array = file_handle.read().split('\n')
        file_handle.close()
        # Remove empty lines
        for line in filedata_array:
            if len(line) == 0:
                del filedata_array[filedata_array.index(line)]
        # Add log file data and type
        if log_collection.has_key('data'):
            log_collection['data'] = log_collection['data'] + filedata_array
            # numerise log type for each line
            temp_types = [logfiles.index(logfile)] * len(filedata_array)
            log_collection['type'] = log_collection['type'] + temp_types # Add log type array
        # Cater for first time iteration
        else:
            log_collection['data'] = filedata_array
            temp_types = [logfiles.index(logfile)] * len(filedata_array)
            log_collection['type'] = temp_types

    return log_collection

Believe it or not, we’re done with the hardest part: preparing our data and formulating the corresponding data structures!

Note that there are several additional steps we could have taken to prepare the data. We may decide to normalise our data by extracting a standard set of fields (eg. date, program type, message, etc.) and formatting the data in those fields to be the same (eg. date formats may be different across different log file types). We may also wish to get rid of obvious outliers that could skew the model, and remove entries with empty fields.

In our case, largely for the sake of brevity, we’ve taken a punt that we can get results without spending more time on preparing the data (other than removing empty lines). However in a business scenario normalisation would be the norm.

Fit the Model

The rest turns out to be formulaic. We are going to create a model by using an algorithm to find a good fit for the training data, and then make new predictions on our test data.

But first, what is a model?

In everyday terms a model is a simplified version of reality. In machine learning this concept is no different: a model is a simplified version of a complex data reality. We create this model by trying to fit our data (the text log data) with a suitable algorithm capable of producing a result as close as possible to the real thing. The smaller the error between the data and the algorithm’s prediction, the better the model is likely to be. The process of finding this approximation is called ‘fitting’.

What constitutes a suitable algorithm and how the fitting process works is outside the scope of this blog. Most of that is hidden from view in the high level way Scikit Learn allows us to operate, and I will show you that Scikit Learn also makes it trivial to experiment with a range of algorithms and to choose the one that suits your purpose.

Train the Model

But first we need a training function.

Our training function (let’s call it ‘train’) requires three elements:
1. The algorithm; we’ll start by using Naive Bayes Multinomial, which is simple and fast
2. The feature data (x), which is the log data array in our log collection dictionary: train_log_collection[‘data’]
3. The target data (y), which is the log type array in our log collection dictionary: train_log_collection[‘type’]

Note: If you’re wondering why or how we chose Naive Bayes Multinomial, the short answer is it doesn’t really matter at this stage. We’re simply picking any algorithm that we trust will give us results and that has shown to be fast and convenient in the past. The longer answer is that Naive Bayes has a longstanding association with text categorisation (our use case), and the multinomial variation is often used for document classification. So we know that it is a reasonable choice.

If we do our job properly, we should be able to derive a model with the following function call:

model = train(algorithm, feature_data, target_data)

Which in our example translates to:

from sklearn import naive_bayes

model = train(naive_bayes.MultinomialNB(), train_log_collection['data'], train_log_collection['type'])

Let’s see how we go.

One thing we need to be mindful of is that our feature data (the log data) is text, whereas algorithms usually require numerical input data as a vector, matrix or tensor (an n-dimensional numerical array). So we will need a way to convert our text into an n-dimensional array of numbers that the algorithm can accept as input. Without digressing too much, Scikit Learn comes with a handy set of tools to do just that, extracting features from text and turning them into numbers, available from sklearn.feature_extraction.text.

For our use case, the best practice is to first convert the text documents into a matrix of token counts using CountVectorizer, and then to use what is called a term frequency transformer such as TfidfTransformer to optimise the resulting data. Feel free to dig into those, but the essential thing to know is that they allow us to convert our text into a performant numbers matrix.

Finally we’ll chain these transformations, as well as the algorithm, together using Scikit Learn’s handy Pipeline utility.

When we put all of that together we have a function that looks as follows:

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

def train(algorithm, training_feature_data, training_target_data):
    model = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', algorithm)])
    model.fit(training_feature_data,training_target_data)
    return model

Making new predictions

Once we have a model we can start to make predictions on new data. Let’s give the model a random line from our test log data:

print(model.predict(test_log_collection['data'][321:322]))

Running that gives us:

[0]

Not particularly useful. Our log types were indexed as integers, and that line is predicted as type 0. Let’s make that a little clearer:

print(test_log_collection['data'][321] + "\n" + glob.glob(args.train_data_dir + "/*.log")[model.predict(test_log_collection['data'][321:322])[0].astype(int)])

That gives us the following output:

Oct 20 20:20:15 CCFile::captureLogRun Skipping current file Dir file [2017-10-20_20,20,15.846930]-AirPortBrcm4360_Logs-004.txt, Current File [2017-10-20_20,20,15.846930]-AirPortBrcm4360_Logs-004.txt
data/train/laptop/corecaptured.log

Ok, so that log line is predicted to be from corecaptured.log, and a quick check in the actual log file verifies that this is correct.

But what if I wanted to know the percentage of predictions that are correct, in other words how accurate my model is? We can use Numpy’s mean function to compare the predictions about our test logs with the actual, known log types:

import numpy as np

print(np.mean(model.predict(test_log_collection['data']) == test_log_collection['type']))

That gives us the following output:

0.986352045397

Our model has 98.64% accuracy. Not bad.

Improving our script

Could we improve on this if we used a different algorithm?

To see what different algorithms give us, let’s quickly refactor our process with a couple of helper functions:

def predict(model, new_docs):
    predicted = model.predict(new_docs['data'])
    accuracy = np.mean(predicted == new_docs['type'])
    return accuracy
    
def report(clf_type,accuracy):
    print("\033[1m" + clf_type + "\033[0m\033[92m")
    print("Accuracy: " + str(round(accuracy * 100,2)) + "%\n")
    print

Now we can simplify the core flow as follows:

algorithms = [ naive_bayes.MultinomialNB() ]
for algorithm in algorithms:
    model = train(algorithm, log_collection['data'], log_collection['type'])
    accuracy = predict(model,test_log_collection)
    report((str(algorithm).split('(')[0]),accuracy)

And all we that’s left to do is to fill out our algorithms array with a selection with which we would like to experiment. We can choose classifiers from the Scikit Learn’s docs, but be aware that while most algorithms are fast a few take a bit longer to run.

Let’s try out a few:

algorithms = [
    linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None),
    naive_bayes.MultinomialNB(),
    tree.DecisionTreeClassifier(max_depth=1000),
    ensemble.ExtraTreesClassifier(),
    svm.LinearSVC(),
]

Running this list with our optimisations we get:

Training log collection => 250587 data entries
Testing log collection => 27843 data entries

SGDClassifier
Accuracy: 97.38%

MultinomialNB
Accuracy: 98.64%

DecisionTreeClassifier
Accuracy: 95.33%

ExtraTreesClassifier
Accuracy: 99.15%

LinearSVC
Accuracy: 99.17%

So among these, the LinearSVC algorithm gives us the best results for our particular use case. Note that the type of feature data and how the features relate to the target prediction will play a significant role in deciding which algorithm is best suited.

During test runs the neural network MLPClassifier() showed even better results, although it is very slow to run.

The code is available on Github. It includes one or two enhancements, such as saving the model to a file for speedy loading and prediction at a later date.

Conclusion

In this blog post we have seen that machine learning tasks like classification can be applied without having to either build complex architectures or implement our own algorithms with low level building blocks. Instead we can use a framework like Scikit Learn that comes with many built-in algorithms to train a model, and run predictions.

Along the way we saw that a familiar programming environment like Python provides us with all the basic tools we need to prepare the data for our algorithm, train a model, and run our predictions. We’ve also seen that we don’t need huge amounts of data to implement a simple log classifier, but can leverage data readily available to us.

If prior to this you believed that machine learning can only be approached with a PHD in math and statistics, I hope that this post has gone a little way to help change that perception.

Check out the related vlog by Maartens Lourens here

This blog is written exclusively by the OpenCredo team. We do not accept external contributions.

RETURN TO BLOG

SHARE