Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Scala for Machine Learning, Second Edition

You're reading from   Scala for Machine Learning, Second Edition Build systems for data processing, machine learning, and deep learning

Arrow left icon
Product type Paperback
Published in Sep 2017
Publisher Packt
ISBN-13 9781787122383
Length 740 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Patrick R. Nicolas Patrick R. Nicolas
Author Profile Icon Patrick R. Nicolas
Patrick R. Nicolas
Arrow right icon
View More author details
Toc

Table of Contents (21) Chapters Close

Preface 1. Getting Started 2. Data Pipelines FREE CHAPTER 3. Data Preprocessing 4. Unsupervised Learning 5. Dimension Reduction 6. Naïve Bayes Classifiers 7. Sequential Data Models 8. Monte Carlo Inference 9. Regression and Regularization 10. Multilayer Perceptron 11. Deep Learning 12. Kernel Models and SVM 13. Evolutionary Computing 14. Multiarmed Bandits 15. Reinforcement Learning 16. Parallelism in Scala and Akka 17. Apache Spark MLlib A. Basic Concepts B. References Index

Why Scala?

Like most functional languages, Scala provides developers and scientists with a toolbox to implement iterative computations that can be easily woven into a coherent dataflow. To some extent, Scala can be regarded as an extension of the popular map-reduce model for distributed computation of large amounts of data.

Note

Disclaimer

This section does not constitute a formal introduction or description of the features of Scala. It merely mentions some of its features that are valuable to machine learning practitioners. Experienced Scala developers may skip to the next section.

Among the capabilities of the language, the following features are deemed essential in machine learning and statistical analysis.

Scala as a functional language

There are many functional features in Scala which may unsettle software engineers with experience in object-oriented programming. This section deals specifically with monadic and functorial representations of data. Functors and monads are concepts defined in the field of mathematics known as category theory. Formerly:

  • A functor is a data type that defines how a transformation known as a map applies to it. Scala implements functors as type classes with a map method.
  • A monad is a wrapper around an existing data type. It applies a transformation to a data of wrapper type and returns a value of the same wrapper type. Scala implements monads as type classes with unit and flatMap methods. Monads extends functors in Scala.

Abstraction

Functors and monads are important concepts in functional programming.

Functors and monads are derived from category and group theory; they allow developers to create a high-level abstraction, as illustrated in the following Scala libraries:

In mathematics, a category M is a structure that is defined by the following:

  • Objects of some type {x e X, y Є Y, z Є Z, …}
  • Morphisms or maps applied to these objects x Є X, y Є Y, f: x -› y
  • Composition of morphisms f: x -› y, g: y -› z => g o f: x -› z

Covariant, contravariant functors, and bi-functors are well-understood concepts in algebraic topology that are related to manifold and vector bundles. They are commonly used in differential geometry for the generation of non-linear models.

Higher kinded types

Higher kinded types (HKTs) are abstractions of types. They generate a new type from existing types. Let's consider the following parameterized trait:

trait M[T] {  .  } 

A higher kinded type H over a trait M is defined as follows:

       trait H[M[_]]; class H[M[_]] 

Functors and monads are higher kinded types.

How are higher kinded types relevant to data analysis and machine learning?

Scientists define observations as sets or vectors of features.

Classification problems rely on the estimation of the similarity between vectors of observations. One technique consists of comparing two vectors by computing the normalized inner (or dot) product. A co-vector is defined as a linear map α of vector to the inner product (field).

Note

Inner product

M1: Definition of inner product <.> and co-vector α:

Higher kinded types

Let's define a vector as a constructor from any field, _ => Vector[_]. A co-vector is then defined as the mapping function of a vector to its field: Vector[_].

Let's then define a two-dimension (two types or fields) higher kinded structure, Hom, that can be defined as either a vector or a co-vector by fixing one of the two types:

type Hom[T] = {
  type Right[X] = Function1[X,T] // Co-vector
  type Left[X] = Function1[T,X]   // Vector
 }

Note

Tensors and manifolds

Vector and co-vector are classes of tensor (contravariant and covariant). Tensors (fields) are used in manifold learning non-linear models and in the generation of kernel functions. Manifolds are briefly introduced in the Manifolds section in. The topic of tensor fields in manifold learning is beyond the scope of this book.

The projections of the higher-kind Hom to Right or Left single parameter types are known as functors:

  • Covariant functor for the right projection
  • Contravariant functor for the left projection

Functors

A covariant functor is a mapping function, such as F: C => C, with the following properties:

  • If f: x -› y is a morphism on C then F(x) -› F(y) is also a morphism on C
  • If id: x -› x is the identity morphism on C then F(id) is also an identity morphism on C
  • If g: y -› z is also a morphism on C then F(g o f) = F(g) o F(f)

The definition of the covariant functor is F[U => V] := F[U] => F[V]. Its implementation in Scala is:

trait Functor[M[_]]{ def map[U,V](m: M[U])(f: U=>V): M[V] }

For example, let's consider an observation defined as an n dimension vector of type T, Obs[T]. The constructor for the observation can be represented as Function1[T,Obs]. Its functor, ObsFunctor, is implemented as:

trait ObsFunctor[T] extends Functor[(Hom[T])#Left] { self =>
  override def map[U,V](vu: Function1[T,U])(f: U =>V): 
    Function1[T,V] = f.compose(vu)
}

The functor is qualified as a covariant functor because the morphism is applied to the return type of the element of Obs, Function1[T, Obs]. The projection of the two parameters types Hom to a vector is implemented as (Hom[T])#Left.

A contravariant functor is a mapping function, F: C => C, with the following properties:

  • If f: x -› y is a morphism on C then F(x) -› F(y) is also a morphism on C
  • If id: x -› x is the identity morphism on C then F(id) is also an identity morphism on C
  • If g: y -› z is also a morphism on C then F(g o f) = F(f) o F(g)

The definition of the contravariant functor is F[U => V] := F[V] => F[U], as follows:

trait CoFunctor[M[_]]{ def map[U,V](m: M[U])(f: V=>U): M[V]}

Note that the input and output types in the morphism f are reversed from the definition of a covariant functor. The constructor for the co-vector can be represented as Function1[Obs,T]. Its functor, CoObsFunctor, is implemented as:

trait CoObsFunctor[T] extends CoFunctor[(Hom[T])#Right] {
  self =>
      override def map[U,V](vu: Function1[U,T])(f: V =>U): 
       Function1[V,T] = f.andThen(vu)
}

Monads

Monads are structures in algebraic topology related to category theory. Monads extend the concept of functors to allow composition known as the monadic composition of morphisms on a single type. They enable the chaining or weaving of computation into a sequence of steps sometimes known as a data pipeline. The collections bundled with the Scala standard library (List, Map…) are constructed as monads [1:1].

Monads provide the ability for those collections to do the following:

  • Create the collection
  • Transform the elements of the collection
  • Flatten nested collections

The following Scala definition of a monad as a trait illustrates the concept of a higher kinded Monad trait for type M:

trait Monad[M[_]] {
  def unit[T](a: T): M[T]   
  def map[U,V](m: M[U])(f U =>V): M[V]              
  def flatMap[U,V](m: M[U])(f: U =>M[V]): M[V] 
}

Monads are therefore critical in machine learning as they enable the composition of multiple data transformation functions into a sequence or workflow. This property is applicable to any type of complex scientific computation [1:2].

Note

Monadic composition of kernel functions

Monads are used in the composition of kernel functions in the Kernel functions monadic composition section in Chapter 12, Kernel Models and Support Vector Machines.

Scala as an object oriented language

Machine learning models are generated through sequences of tasks or dataflows that demand a modular design.

As an object-oriented programming language, Scala allows developers to do the following:

  • Define high-level component abstraction
  • Allow different developers to work concurrently on different components
  • Reuse code
  • Isolate functionality for easier debugging and testing (unit tests)

You may wonder how Scala fares as an object-oriented programming against Java.

Tip

Scala versus Java

Scala is the purest form of object oriented language than Java. It does not support static methods (static methods are methods of singletons) and primitive types.

One important facet of object oriented programming is the ability to change modules or implement functionality on the fly, without the need to recompile the client code. This technique is known as dependency injection. Scala supports dependency injection using a combination of abstract variables, self-referenced composition, and stackable traits [1:3]. One of the most commonly used dependency injection patterns, the cake pattern, is described in the Building workflows with mixins section in Chapter 2, Data Pipelines.

Scala as a scalable language

As seen previously, functors and monads enable the parallelization and chaining of data processing functions by leveraging the Scala higher-order methods. In terms of implementation, actors are one of the core elements that make Scala scalable. Actors provide Scala developers with a high level of abstraction to build scalable, distributed, and concurrent applications. Actors hide the nitty-gritty implementation of concurrency and the management of the underlying threads pool. Actors communicate through asynchronous immutable messages. A distributed computing Scala framework such as Akka or Apache Spark extends the capabilities of the Scala standard library to support computation on very large datasets. Akka and Apache Spark are described in detail in the last chapter of this book [1:4].

Concisely, a workflow is implemented as a sequence of activities or computational tasks. These tasks consist of higher-order Scala methods such as flatMap, map, fold, reduce, collect, join, or filter applied to a large collection of observations. Scala provides developers with the tools to partition datasets and execute the tasks through a cluster of actors. Scala also supports message dispatching and routing between local and remote actors. A developer may decide to deploy a workflow either locally or across multiple CPU cores and servers with very few code alterations.

The following figure visualizes the different elements of the definition and deployment of a workflow (or data pipeline):

Scala as a scalable language

Deployment of a workflow for model training as a distributed computation

In the preceding diagram, a controller, that is, the master node, manages the sequence of tasks 1 to 4 in a similar way to a scheduler. These tasks are actually executed over multiple worker nodes, and are implemented by actors. The master node or actor exchanges messages with the workers to manage the state of the execution of the workflow, as well as its reliability, as illustrated in the Scalability with actors section of Chapter 16, Parallelism with Scala and Akka. The high availability of these tasks is maintained through a hierarchy of supervising actors.

Tip

Domain-specific languages (DSLs)

Scala embeds DSLs natively. DSLs are syntactic layers built on top of Scala native libraries. DSLs allow software developers to abstract computation in terms that are easily understood by scientists. A notorious application of DSLs is the definition of the emulation of the syntax use in the MATLAB program, familiar to most data scientists.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image