Monday, March 16, 2015

Machine Learning with MOSEK Fusion: linear soft-margin Support Vector Machine

Machine-Learning (ML)  has become a common widespread tool in many applications that affect our everyday life. In many cases, at the very core of these techniques there is an optimization problem. In this post we will show how one of the most common ML technique, the Support-Vector Machine (SVM) is defined as the solution of a conic optimization problem.

The aim of this post is two-fold:
  1. To show that simple SVM models can be efficiently solved with a commercial general-purpose solver as MOSEK
  2. That the MOSEK Fusion API yield  a very compact and expressive code.
We are not claiming that a general-purpose solver is the tool of choice for solving SVMs problems. There is a huge literature about specialized algorithms, and software packages (see for instance LibSVM) that exploit their peculiar structure. 

In words, the basic SVM model can be stated as:
"We are given a set of points $m$ in a $n$-dimensional space, partitioned in two groups. Find, if any, the separating hyperplane of the two subsets with the largest margin, i.e. as far as possible from the points."

In mathematical form, that means to determine an hypeplane $w^T x = b$ that separate  two sets of points leaving the largest margin possible. It can be proved that the margin is given by $2/\|w\|$.

Therefore, we need to solve the problem of maximizing  $2/\|w\|$ with respect of $w,b$ with the constraints that points of the same class must lie on the same side of the hyperplane. Denoting with $x_i\in\mathbb{R}^n$ the i-th observation and assuming that each point is given a label $y_i\in \{-1,+1\}$, it is easy to see that the separation is equivalent to:

     y_i (w^T x_i -b)\geq 1.

A simple example in two dimension is the following:

It is easy to show (see [1]) that the separating hyperplane is the solution of the following optimization problem:

\begin{array} {lll}
\min & \frac{1}{2} \| w \|^2  &\\
& y_i (w^T x_i -b ) \geq 1& i=1,\ldots,m

If a solution exists, $w,b$ define the separating hyperplane and the sign of $w^Tx -b$ can be used to decide the class in which a point $x$ falls.

To allow more flexibility the soft-margin SVM classifier is often used instead. It allows for violation of the classification. To this extent a non-negative slack variable is added to each linear constraint and penalized in the objective function.

\begin{array} {lll}
\min_{b,w} & \frac{1}{2} \| w \|^2  + C \sum_{i=1}^{m} \xi_i&\\
& y_i (w^T x_i -b ) \geq 1 - \xi_ i& i=1,\ldots,m\\
& \xi_i \geq 0 & i=1,\ldots,m

In matrix form we have 

\begin{array} {lll}
\min_{b, w, \xi} & \frac{1}{2} \| w \|^2  + C \mathbf{e}^T \xi\\
&    y \star ( X w - b \mathbf{e})  + \xi \geq \mathbf{e}\\
& \xi \geq 0

where $\star$ denotes the component-wise product, and $\mathbf{e}$ a vector with all components equal to one. The constant $C>0$ acts both as scaling factor and as weight. Varying $C$ yields different trade-off between accuracy and robustness.

Implementing the matrix formulation of the soft-margin SVM in Fusion is very easy. We only need to cast the problem in conic form, which in this case only involves converting the quadratic term of the objective function in a conic constraint:

\begin{array} {ll}
\min_{b, w, \xi,t} & t  + C \mathbf{e}^T \xi& \\
&    \xi + y \star ( X w - b \mathbf{e})  \geq \mathbf{e}\\
& (1,t,w) \in \mathcal{Q_r}^{n+2}\\
& \xi \geq 0

where $\mathcal{Q_r}$ denotes a rotated cone of dimension $n+2$.

In our Fusion implementation we will allow the user to specify a set of values for $C$, and we will solve a problem for each one. We will show Python code for sake of simplicity. However, much better performance can be achieved with the Java or C# API.

Let assume now we are given an array y of labels, a matrix X where input data are stored row-wise and a set of values C for  we want to test. The implementation in Fusion of our conic model is the following

The code is so expressive that it deserves very few comments:

  1.  A sequence of problem can be solved just changing the objective function, as in loop starting in line 19: in this most of the model is build only once.
  2. Expression in line 13 shows how a linear expression can be store and reused.
  3. The construct Variable.repeat(b,m)  logically repeats variable $b$, i.e. forms an array of the form $(b,b,b,\ldots,b)^T \in \mathbb{R}^m$.
  4. The solution is recovered by the level method: if an optimal solution hasn't been found, an exception is thrown.

Let's make few tests. We generate two sets of random two dimensional points each from a Gaussian distribution: the first set centered at $(1.0,1.0)$; the second at $(-1.0,-1.0)$. We experiment standard deviation $\sigma$ of $1/2$ and $1$.

With $\sigma=1/2$ we obtain a separable sets of points and for $C=1$ we have  the following

For $\sigma=1$ separability is usually not possible. Larger value of $C$ emphasize good classification, but reduces the margin and hence robustness. For instance, we may obtain the following:

The black and red lines correspond to $C=1$ and $C=2^{10}$.

MOSEK solves this problems in almost no time:

Further improvements can be obtained exploiting sparsity: in most cases input data are supposed to be sparse, i.e. most entries in the X matrix are zero. If this is the case, then Fusion, and therefore MOSEK, can exploit such a property. You only need to feed the function with a Fusion sparse matrix:

where Xi,Xj,Xv are the arrays that store the sparse representation of X in triplet form. This will save memory and reduce the computational burden.

So to summarize: you can easily build your own linear SVM classifier with MOSEK Fusion! Give it a try and play with it. Fusion demonstrates again its simplicity and power of expressiveness.

[1] Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine learning 20.3 (1995): 273-297.