Data Engineering with Scala and Spark

Scala Essentials for Data Engineers

Welcome to the world of data engineering with Scala. But why Scala? The following are some of the reasons for learning Scala:

Scala provides type safety
Big corporations such as Netflix and Airbnb have a lot of data pipelines written in Scala
Scala is native to Spark
Scala allows data engineers to adopt a software engineering mindset

Scala is a high-level general-purpose programming language that runs on a standard Java platform. It was created by Martin Odersky in 2001. The name Scala stands for scalable language, and it provides excellent support for both object-oriented and functional programming styles.

This chapter is meant as a quick introduction to concepts that the subsequent chapters build upon. Specifically, this chapter covers the following topics:

Understanding functional programming
Understanding objects, classes, and traits
Higher-order functions (HOFs)
Examples of HOFs from the Scala collection library
Understanding polymorphic functions
Variance
Option types
Collections
Pattern matching
Implicits in Scala

Understanding functional programming

Functional programming is based on the principle that programs are constructed using only pure functions. A pure function does not have any side effects and only returns a result. Some examples of side effects are modifying a variable, modifying a data structure in place, and performing I/O. We can think of a pure function as just like a regular algebraic function.

An example of a pure function is the length function on a string object. It only returns the length of the string and does nothing else, such as mutating a variable. Similarly, an integer addition function that takes two integers and returns an integer is a pure function.

Two important aspects of functional programming are referential transparency (RT) and the substitution model. An expression is referentially transparent if all of its occurrences can be substituted by the result of the expression without altering the meaning of the program.

In the following example, Example 1.1, we set x and then use it to set r1 and r2, both of which have the same value:

scala> val x: String = "hello"
x: String = hello
scala> val r1 = x + " world!"
r1: String = hello world!
scala> val r2 = x + " world!"
r2: String = hello world!

Example 1.1

Now, if we replace x with the expression referenced by x, r1 and r2 will be the same. In other words, the expression hello is referentially transparent.

Example 1.2 shows the output from a Scala interpreter:

scala> val r1 = "hello" + " world!"
r1: String = hello world!
scala> val r2 = "hello" + " world!"
r2: String = hello world!

Example 1.2

Let’s now look at the following example, Example 1.3, where x is an instance of StringBuilder instead of String:

scala> val x = new StringBuilder("who")
x: StringBuilder = who
scala> val y = x.append(" am i?")
y: StringBuilder = who am i?
scala> val r1 = y.toString
r1: String = who am i?
scala> val r2 = y.toString
r2: String = who am i?

Example 1.3

If we substitute y with the expression it refers to (val y = x.append(" am i?")), r1 and r2 will no longer be equal:

scala> val x = new StringBuilder("who")
x: StringBuilder = who
scala> val r1 = x.append(" am i?").toString
r1: String = who am i?
scala> val r2 = x.append(" am i?").toString
r2: String = who am i? am i?

Example 1.4

So, the expression x.append(" am i?") is not referentially transparent.

One of the advantages of the functional programming style is it allows you to apply local reasoning without having to worry about whether it updates any globally accessible mutable state. Also, since no variable in the global scope is updated, it considerably simplifies building a multi-threaded application.

Another advantage is pure functions are also easier to test as they do not depend on any state apart from the inputs supplied, and they generate the same output for the same input values.

We won’t delve deep into functional programming as it is outside of the scope of this book. Please refer to the Further reading section for additional material on functional programming. In the rest of this chapter, we will provide a high-level tour of some of the important language features that the subsequent chapters build upon.

In this section, we looked at a very high-level introduction to functional programming. Starting with the next section, we will look at Scala language features that enable both functional and object-oriented programming styles.

Understanding objects, classes, and traits

In this section, we are going to look at classes, traits, and objects. If you have used Java before, then some of the topics covered in this section will look familiar. However, there are several differences too. For example, Scala provides singleton objects, which automatically create a class and a single instance of that class in one go. Another example is Scala has case classes, which provide great support for pattern matching, allow you to create instances without the new keyword, and provide a default toString implementation that is quite handy when printing to the console.

We will first look at classes, followed by objects, and then wrap this section up with a quick tour of traits.

Classes

A class is a blueprint for objects, which are instances of that class. For example, we can create a Point class using the following code:

class Point(val x: Int, val y: Int) {
  def add(that: Point): Point = new Point(x + that.x, y + that.y)
  override def toString: String = s"($x, $y)"
}

Example 1.5

The Point class has four members—two immutable variables, x and y, as well as two methods, add and toString. We can create instances of the Point class as follows:

scala> val p1 = new Point(1,1)
p1: Point = (1, 1)
scala> val p2 = new Point(2,3)
p2: Point = (2, 3)

Example 1.6

We can then create a new instance, p3, by adding p1 and p2, as follows:

scala> val p3 = p1 add p2
p3: Point = (3, 4)

Example 1.7

Scala supports the infix notation, characterized by the placement of operators between operands, and automatically converts p1 add p2 to p1.add(p2). Another way to define the Point class is using a case class, as shown here:

case class Point(x: Int, y: Int) {
  def add(that: Point): Point = new Point(x + that.x, y + that.y)
}

Example 1.8

A case class automatically adds a factory method with the name of the class, which enables us to leave out the new keyword when creating an instance. A factory method is used to create instances of a class without requiring us to explicitly call the constructor method. Refer to the following example:

scala> val p1 = Point(1,1)
p1: Point = Point(1,1)
scala> val p2 = Point(2,3)
p2: Point = Point(2,3)

Example 1.9

The compiler also adds default implementations of various methods such as toString and hashCode, which the regular class definition lacks. So, we did not have to override the toString method, as was done earlier, and yet both p1 and p2 were printed neatly on the console (Example 1.9).

All arguments in the parameter list of a case class automatically get a val prefix, which makes them parametric fields. A parametric field is a shorthand that defines a parameter and a field with the same name.

To better understand the difference, let’s look at the following example:

scala> case class Point1(x: Int, y: Int) //x and y are parametric fields
defined class Point1
scala> class Point2(x: Int, y: Int) //x and y are regular parameters
defined class Point2
scala> val p1 = Point1(1, 2)
p1: Point1 = Point1(1,2)
scala> val p2 = new Point2(3, 4)
p2: Point2 = Point2@203ced18

Example 1.10

If we now try to access p1.x, it will work because x is a parametric field, whereas trying to access p2.x will result in an error. Example 1.11 illustrates this:

scala> println(p1.x)
1
scala> println(p2.x)
<console>:13: error: value x is not a member of Point2
       println(p2.x)
                  ^

Example 1.11

Trying to access p2.x will result in a compile error, value x is not a member of Point2. Case classes also have excellent support for pattern matching, as we will see in the Understanding pattern matching section.

Scala also provides an abstract class, which, unlike a regular class, can contain abstract methods. For example, we can define the following hierarchy:

abstract class Animal
abstract class Pet extends Animal {
  def name: String
}
class Dog(val name: String) extends Pet {
  override def toString = s"Dog($name)"
}
scala> val pluto = new Dog("Pluto")
pluto: Dog = Dog(Pluto)

Example 1.12

Animal is the base class. Pet extends Animal and declares an abstract method, name. Dog extends Pet and uses a parametric field, name (it is both a parameter as well as a field). Because Scala uses the same namespace for fields and methods, this allows the field name in the Dog class to provide a concrete implementation of the abstract method name in Pet.

Object

Unlike Java, Scala does not support static members in classes; instead, it has singleton objects. A singleton object is defined using the object keyword, as shown here:

class Point(val x: Int, val y: Int) {
  // new keyword is not required to create a Point object
  // apply method from companion object is invoked
  def add(that: Point): Point = Point(x + that.x, y + that.y)
  override def toString: String = s"($x, $y)"
}
object Point {
  def apply(x: Int, y: Int) = new Point(x, y)
}

Example 1.13

In this example, the Point singleton object shares the same name with the class and is called that class’s companion object. The class is called the companion class of the singleton object. For an object to qualify as a companion object of a given class, it needs to be in the same source file as the class itself.

Please note that the add method does not use the new keyword on the right-hand side. Point(x1, y1) is de-sugared into Point.apply(x1, y1), which returns a Point instance.

Singleton objects are also used to write an entrypoint for Scala applications. One option is to provide an explicit main method within the singleton object, as shown here:

object SampleScalaApplication {
  def main(args: Array[String]): Unit = {
    println(s"This is a sample Scala application")
  }
}

Example 1.14

The other option is to extend the App trait, which provides a main method implementation. We will cover traits in the next section. You can also refer to the Further reading section (the third point) for more information:

 object SampleScalaApplication extends App {
  println(s"This is a sample Scala application")
}

Example 1.15

Trait

Scala also has traits, which are used to define rich interfaces as well as stackable modifications. You can read more stackable modifications in the Further reading section (the fourth point) Unlike class inheritance, where each class inherits from just one super class, a class can mix in any number of traits. A trait can have abstract as well as concrete members. Here is a simplified example of the Ordered trait from the Scala standard library:

trait Ordered[T] {
  // compares receiver (this) with argument of the same type
  def compare(that: T): Int
  def <(that: T): Boolean = (this compare that) < 0
  def >(that: T): Boolean = (this compare that) > 0
  def <=(that: T): Boolean = (this compare that) <= 0
  def >=(that: T): Boolean = (this compare that) >= 0
}

Example 1.16

The Ordered trait takes a type parameter, T, and has an abstract method, compare. All of the other methods are defined in terms of that method. A class can add the functionalities defined by <, >, and so on, just by defining the compare method. The compare method should return a negative integer if the receiver is less than the argument, positive if the receiver is greater than the argument, and 0 if both objects are the same.

Going back to our Point example, we can define a rule to say that a point, p1, is greater than p2 if the distance of p1 from the origin is greater than that of p2:

case class Point(x: Int, y: Int) extends Ordered[Point] {
  def add(that: Point): Point = new Point(x + that.x, y + that.y)
  def compare(that: Point) = (x ^ 2 + y ^ 2) ^ 1 / 2 - (that.x ^ 2 + that.y ^ 2) ^ 1 / 2
}

Example 1.17

With the definition of compare now in place, we can perform a comparison between two arbitrary points, as follows:

scala> val p1 = Point(1,1)
p1: Point = Point(1,1)
scala> val p2 = Point(2,2)
p2: Point = Point(2,2)
scala> println(s"p1 is greater than p2: ${p1 > p2}")
p1 is greater than p2: false
example 1.18

In this section, we looked at objects, classes, and traits. In the next section, we are going to look at HOFs.

Working with higher-order functions (HOFs)

In Scala, functions are first-class citizens, which means function values can be assigned to variables, passed to functions as arguments, or returned by a function as a value. HOFs take one or more functions as arguments or return a function as a value.

A method can also be passed as an argument to an HOF because the Scala compiler will coerce a method into a function of the required type. For example, let’s define a function literal and a method, both of which take a pair of integers, perform an operation, and then return an integer:

//function literal
val add: (Int, Int) => Int = (x, y) => x + y
//a method
def multiply(x: Int, y: Int): Int = x * y

Example 1.19

Let’s now define a method that takes two integer arguments and performs an operation, op, on them:

def op(x: Int, y: Int) (f: (Int, Int) => Int): Int = f(x,y)

Example 1.20

We can pass any function (or method) of type (Int, Int) => Int to op, as the following example illustrates:

scala> op(1,2)(add)
res15: Int = 3
scala> op(2,3)(multiply)
res16: Int = 6

Example 1.21

This ability to pass functions as parameters is extremely powerful as it allows us to write generic code that can execute arbitrary user-supplied functions. In fact, many of the methods defined in the Scala collection library require functions as arguments, as we will see in the next section.

Examples of HOFs from the Scala collection library

Scala collections provide transformers that take a base collection, run some transformations over each of the collection’s elements, and return a new collection. For example, we can transform a list of integers by doubling each of its elements using the map method, which we will cover in a bit:

scala> List(1,2,3,4).map(_ * 2)
res17: List[Int] = List(2, 4, 6, 8)

Example 1.22

A traversable trait, which is a base trait for all kinds of Scala collections, implements behaviors common to all collections, in terms of a foreach method, with the following signature:

def foreach[U](f: A => U): Unit

Example 1.23

The argument f is a function of type A => U, which is shorthand for Function1[A,U], and thus foreach is an HOF. This is an abstract method that needs to be implemented by all classes that mix in Traversable. The return type is Unit, which means this method does not return any meaningful value and is primarily used for side effects.

Here is an example that prints the elements of a List:

scala> /** let's start with a foreach call that prints the numbers in a list
     |   * List(1,2,3,4).foreach((i: Int) => println(i))
     |   * we can skip the type argument and let Scala infer it
     |   * List(1,2,3,4).foreach( i => println(i))
     |   * Scala provides a shorthand to replace arguments using _
     |   * if the arguments are used only once on the right side
     |   * List(1,2,3,4).foreach(println(_))
     |   * finally Scala allows to leave the argument altogether
     |   * if there is only one argument used on the right side
     |   */
     | List(1,2,3,4).foreach(println)
1
2
3
4

Example 1.24

For the rest of the examples, we will continue to use the List collection type, but they are available for other types of collections, such as Array, Map, and Set.

map is similar to foreach, but instead of returning a unit, it returns a collection by applying the function f to each element of the base collection. Here is the signature for List[A]:

final def map[B](f: (A) ⇒ B): List[B]

Example 1.25

Using the list from the previous example, if we want to double each of the elements in the list, but return a list of Doubles instead of Ints, it can be achieved by using the following:

scala> List(1,2,3,4).map(_ * 2.0)
res22: List[Double] = List(2.0, 4.0, 6.0, 8.0)

Example 1.26

The preceding expression returns a list of Double and can be chained with foreach to print the values contained in the list:

scala> List(1,2,3,4).map(_ * 2.0).foreach(println)
2.0
4.0
6.0
8.0

Example 1.27

A close cousin of map is flatMap, which comprises of two parts—map and flatten. Before looking into flatMap, let’s look at flatten:

//converts a list of traversable collections into a list
//formed by the elements of the traversable collections
def flatten[B]: List[B]

Example 1.28

As the name suggests, it flattens the inner collections:

scala> List(Set(1,2,3), Set(4,5,6)).flatten
res24: List[Int] = List(1, 2, 3, 4, 5, 6)

Example 1.29

Now that we have seen what flatten does, let’s go back to flatMap.

Let’s say that for each element of List(1,2,3,4), we want to create List of elements from 0 to that number (both inclusive) and then combine all of those individual lists into a single list. Our first pass at it would look like the following:

scala> List(1,2,3,4).map(0 to _).flatten
res25: List[Int] = List(0, 1, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4)

Example 1.30

With flatMap, we can achieve the same result in one step:

scala> List(1,2,3,4).flatMap(0 to _)
res26: List[Int] = List(0, 1, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4)

Example 1.31

Scala collections also provide filter, which accepts a function that returns a Boolean as an argument, which is then used to filter elements of a given collection:

def filter(p: (A) ⇒ Boolean): List[A]

Example 1.32

For example, to filter all of the even integers from List of numbers from 1 to 100, try the following:

scala> List.tabulate(100)(_ + 1).filter(_ % 2 == 0)
res27: List[Int] = List(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100)

Example 1.33

There is also withFilter, which provides performance benefits over filter through the lazy evaluation of intermediate collections. It is part of the TraversableLike trait, with the FilterMonadic trait providing the abstract definition:

trait FilterMonadic[+A, +Repr] extends Any {
  //includes map, flatMap and foreach but are skipped here
  def withFilter(p: A => Boolean): FilterMonadic[A, Repr]
}

Example 1.34

TraversableLike defines the withFilter method through a member class, WithFilter, that extends FilterMonadic:

def withFilter(p: A => Boolean): FilterMonadic[A, Repr] = new WithFilter(p)
class WithFilter(p: A => Boolean) extends FilterMonadic[A, Repr] {
  // implementation of map, flatMap and foreach skipped here
  def withFilter(q: A => Boolean): WithFilter = new WithFilter(x =>
  p(x) && q(x)
  )
}

Example 1.35

Please note that withFilter returns an object of type FilterMonadic, which only has map, flatMap, foreach, and withFilter. These are the only methods that can be chained after a call to withFilter. For example, the following will not compile:

List.tabulate(50)(_ + 1).withFilter(_ % 2 == 0).forall(_ % 2 == 0)

Example 1.36

It is quite common to have a sequence of flatMap, filter, and map chained together and Scala provides syntactic sugar to support that through for comprehensions. To see it in action, let’s consider the following Person class and its instances:

case class Person(firstName: String, isFemale: Boolean, children: Person*)
val bob = Person("Bob", false)
val jennette = Person("Jennette", true)
val laura = Person("Laura", true)
val jean = Person("Jean", true, bob, laura)
val persons = List(bob, jennette, laura, jean)

Example 1.37

Person* represents a variable argument of type Person. A variable argument of type T needs to be the last argument in a class definition or method signature and accepts zero, one, or more instances of type T.

Now say we want to get pairs of mother and child, which would be (Jean, Bob) and (Jean, Laura). Using flatMap, filter, and map we can write it as follows:

scala> persons.filter(_.isFemale).flatMap(p => p.children.map(c => (p.firstName, c.firstName)))
res32: List[(String, String)] = List((Jean,Bob), (Jean,Laura))

Example 1.38

The preceding expression does its job, but it is not quite easy to understand what is happening. This is where for comprehension comes to the rescue:

scala> for {
     |   p <- persons
     |   if p.isFemale
     |   c <- p.children
     | } yield (p.firstName, c.firstName)
res33: List[(String, String)] = List((Jean,Bob), (Jean,Laura))

Example 1.39

It is much easier to understand what this snippet of code does. Behind the scenes, the Scala compiler will convert this expression into the first one (the only difference being filter will be replaced with withFilter).

Scala also provides methods to combine the elements of a collection using the fold and reduce families of functions. The primary difference between the two can be understood by comparing the signatures of foldLeft and reduceLeft:

def foldLeft[B](z: B)(op: (B, A) ⇒ B): B
def reduceLeft[A1 >: A](op: (A1, A1) ⇒ A1): A1

Example 1.40

Both of these methods take a binary operator to combine the elements from left to right. However, foldLeft takes a zero-argument, z, of type B (this value is returned if List is empty), and the output type can differ from the types of the elements in List. On the other hand, reduceLeft requires A1 to be a supertype of A (>: signifies a lower bound). So, we can sum up List[Int] and return the value as Double using foldLeft, as follows:

scala> List(1,2,3,4).foldLeft[Double](0) ( _ + _ )
res34: Double = 10.0

Example 1.41

We cannot do the same with reduceLeft (since Double is not a supertype of Int). Trying to do so will raise a compile-time error of type arguments [Double] do not conform to method reduce's type parameter bounds [A1 >: Int]:

scala> List(1,2,3,4).reduce[Double] ( _ + _ )
<console>:12: error: type arguments [Double] do not conform to method reduce's type parameter bounds [A1 >: Int]
       List(1,2,3,4).reduce[Double] ( _ + _ )
                           ^

Example 1.42

foldRight and reduceRight combine the elements of a collection from right to left. There is also fold and reduce, and for both, the order in which the elements are combined is unspecified and may be nondeterministic.

In this section, we have seen several examples of HOFs from the Scala collection library. By now, you should have noticed that each of these functions uses type parameters. These are called polymorphic functions, which is what we will cover next.

Variance

As mentioned earlier, functions are first-class objects in Scala. Scala automatically converts function literals into objects of the FunctionN type (N = 0 to 22). For example, consider the following anonymous function:

val f: Int => Any = (x: Int) => x

Example 1.45

This function will be converted automatically to the following:

val f = new Function1[Int, Any] {def apply(x: Int) = x}

Example 1.46

Please note that the preceding syntax represents an object of an anonymous class that extends Function1[Int, Any] and implements its abstract apply method. In other words, it is equivalent to the following:

class AnonymousClass extends Function1[Int, Any] {
  def apply(x: Int): Any = x
}
val f = new AnonymousClass

Example 1.47

If we refer to the type signature of the Function1 trait, we would see the following:

Function1[-T1, +T2]

Example 1.48

T1 represents the argument type and T2 represents the return type. The type variance of T1 is contravariant and that of T2 is covariant. In general, covariance designed by + means if a class or trait is covariant in its type parameter T, that is, C[+T], then C[T1] and C[T2] will adhere to the subtyping relationship between T1 and T2. For example, since Any is a supertype of Int, C[Any] will be a supertype of C[Int].

The order is reversed for contravariance. So, if we have C[-T], then C[Int] will be a supertype of C[Any].

Since we have Function1[-T1, +R], that would then mean type Function1[Int, Any] will be a supertype of, say, Function1[Any, String].

To see it in action, let’s define a method that takes a function of type Int => Any and returns Unit:

def caller(op: Int => Any): Unit = List
  .tabulate(5)(i => i + 1)
  .foreach(i => print(s"$i "))

Example 1.49

Let’s now define two functions:

scala> val f1: Int => Any = (x: Int) => x
f1: Int => Any = $Lambda$9151/1234201645@34f561c8
scala> val f2 : Any => String = (x: Any) => x.toString
f2: Any => String = $Lambda$9152/1734317897@699fe6f6

Example 1.50

A function (or method) with a parameter of type T can be invoked with an argument that is either of type T or its subtype. And since Int => Any is a supertype of Any => String, we should be able to pass both of these functions as arguments. As can be seen, both of them indeed work:

scala> caller(f1)
1 2 3 4 5
scala> caller(f2)
1 2 3 4 5

Example 1.51

Option type

Scala’s option type represents optional values. These values can be of two forms: Some(x), where x is the actual value, or None, which represents a missing value. Many of the Scala collection library methods return a value of the Option[T] type. The following are a few examples:

scala> List(1, 2, 3, 4).headOption
res45: Option[Int] = Some(1)
scala> List(1, 2, 3, 4).lastOption
res46: Option[Int] = Some(4)
scala> List("hello,", "world").find(_ == "world")
res47: Option[String] = Some(world)
scala> Map(1 -> "a", 2 -> "b").get(3)
res48: Option[String] = None

Example 1.52

Option also has a rich API and provides many of the functions from the collection library API through an implicit conversion function, option2Iterable, in the companion object. The following are a few examples of methods supported by the Option type:

scala> Some("hello, world!").headOption
res49: Option[String] = Some(hello, world!)
scala> None.getOrElse("Empty")
res50: String = Empty
scala> Some("hello, world!").map(_.replace("!", ".."))
res51: Option[String] = Some(hello, world..)
scala> Some(List.tabulate(5)(_ + 1)).flatMap(_.headOption)
res52: Option[Int] = Some(1)

Example 1.53

Collections

Scala comes with a powerful collection library. Collections are classified into mutable and immutable collections. A mutable collection can be updated in place, whereas an immutable collection never changes. When we add, remove, or update elements of an immutable collection, a new collection is created and returned, keeping the old collection unchanged.

All collection classes are found in the scala.collection package or one of its subpackages: mutable, immutable, and generic. However, for most of our programming needs, we refer to collections in either the mutable or immutable package.

A collection in the scala.collection.immutable package is guaranteed to be immutable and will never change after it is created. So, we will not have to make any defensive copies of an immutable collection, since accessing a collection multiple times will always yield the same set of elements.

On the other hand, collections in the scala.collection.mutable package provide methods that can update a collection in place. Since these collections are mutable, we need to defend against any inadvertent update, p, by other parts of the code base.

By default, Scala picks immutable collections. This easy access is provided through the Predef object, which is implicitly imported into every Scala source file. Refer to the following example:

object Predef {
  type Set[A] = immutable.Set[A]
  type Map[A, +B] = immutable.Map[A, B]
  val Map = immutable.Map
  val Set = immutable.Set
  // ...
}

Example 1.54

The Traversable trait is the base trait for all of the collection types. This is followed by Iterable, which is divided into three subtypes: Seq, Set, and Map. Both Set and Map provide sorted and unsorted variants. Seq, on the other hand, has IndexedSeq and LinearSeq. There is quite a bit of similarity among all these classes. For instance, an instance of any collection can be created by the same uniform syntax, writing the collection class name followed by its elements:

Traversable(1, 2, 3)
Map("x" -> 24, "y" -> 25, "z" -> 26)
Set("red", "green", "blue")
SortedSet("hello", "world")
IndexedSeq(1.0, 2.0)
LinearSeq(a, b, c)

Example 1.55

The following is the hierarchy for scala.collection.immutable collections taken from the docs.scala-lang.org website.

Figure 1.1 – Scala collection hierarchy

The Scala collection library is very rich and has various collection types suited to specific programming needs. If you want to delve deep into the Scala collection library, please refer to the Further reading section (the fifth point).

In this section, we looked at the Scala collection hierarchy. In the next section, we will gain a high-level understanding of pattern matching.

Understanding pattern matching

Scala has excellent support for pattern matching. The most prominent use is the match expression, which takes the following form:

selector match { alternatives }

selector is the expression that the alternatives will be tried against. Each alternative starts with the case keyword and includes a pattern, an arrow symbol =>, and one or more expressions, which will be evaluated if the pattern matches. The patterns can be of various types, such as the following:

Wildcard patterns
Constant patterns
Variable patterns
Constructor patterns
Sequence patterns
Tuple patterns
Typed patterns

Before going through each of these pattern types, let’s define our own custom List:

trait List[+A]
case class Cons[+A](head: A, tail: List[A]) extends List[A]
case object Nil extends List[Nothing]
object List {
  def apply[A](as: A*): List[A] = if (as.isEmpty) Nil else Cons(as.head, apply(as.tail: _*))
}

Example 1.56

Wildcard patterns

The wildcard pattern (_) matches any object and is used as a default, catch-all alternative. Consider the following example:

scala> def emptyList[A](l: List[A]): Boolean = l match {
     |   case Nil => true
     |   case _   => false
     | }
emptyList: [A](l: List[A])Boolean
scala> emptyList(List(1, 2))
res8: Boolean = false

Example 1.57

A wildcard can also be used to ignore parts of an object that we do not care about. Refer to the following code:

scala> def threeElements[A](l: List[A]): Boolean = l match {
     |   case Cons(_, Cons(_, Cons(_, Nil))) => true
     |   case _                            => false
     | }
threeElements: [A](l: List[A])Boolean
scala> threeElements(List(true, false))
res11: Boolean = false
scala> threeElements(Nil)
res12: Boolean = false
scala> threeElements(List(1, 2, 3))
res13: Boolean = true
scala> threeElements(List("a", "b", "c", "d"))
res14: Boolean = false

Example 1.58

In the preceding example, the threeElements method checks whether a given list has exactly three elements. The values themselves are not needed and are thus discarded in the pattern match.

Constant patterns

A constant pattern matches only itself. Any literal can be used as a constant – 1, true, and hi are all constant patterns. Any val or singleton object can also be used as a constant. The emptyList method from the previous example uses Nil to check whether the list is empty.

Variable patterns

Like a wildcard, a variable pattern matches any object and is bound to it. We can then use this variable to refer to the object:

scala> val ints = List(1, 2, 3, 4)
ints: List[Int] = Cons(1,Cons(2,Cons(3,Cons(4,Nil))))
scala> ints match {
     |   case Cons(_, Cons(_, Cons(_, Nil))) => println("A three element list")
     |   case l => println(s"$l is not a three element list")
     | }
Cons(1,Cons(2,Cons(3,Cons(4,Nil)))) is not a three element list

Example 1.59

In the preceding example, l is bound to the entire list, which then is printed to the console.

Constructor patterns

A constructor pattern looks like Cons(_, Cons(_, Cons(_, Nil))). It consists of the name of a case class (Cons), followed by a number of patterns in parentheses. These extra patterns can themselves be constructor patterns, and we can use them to check arbitrarily deep into an object. In this case, checks are performed at four levels.

Sequence patterns

Scala allows us to match against sequence types such as Seq, List, and Array among others. It looks similar to a constructor pattern. Refer to the following:

scala> def thirdElement[A](s: Seq[A]): Option[A] = s match {
     |   case Seq(_, _, a, _*) => Some(a)
     |   case _            => None
     | }
thirdElement: [A](s: Seq[A])Option[A]
scala> val intSeq = Seq(1, 2, 3, 4)
intSeq: Seq[Int] = List(1, 2, 3, 4)
scala> thirdElement(intSeq)
res16: Option[Int] = Some(3)
scala> thirdElement(Seq.empty[String])
res17: Option[String] = None

Example 1.60

As the example illustrates, thirdElement returns a value of type Option[A]. If a sequence has three or more elements, it will return the third element, whereas for any sequence with less than three elements, it will return None. Seq(_, _, a, _*) binds a to the third element if present. The _* pattern matches any number of elements.

Tuple patterns

We can pattern match against tuples too:

scala> val tuple3 = (1, 2, 3)
tuple3: (Int, Int, Int) = (1,2,3)
scala> def printTuple(a: Any): Unit = a match {
     |   case (a, b, c) => println(s"Tuple has $a, $b, $c")
     |   case _     =>
     | }
printTuple: (a: Any)Unit
scala> printTuple(tuple3)
Tuple has 1, 2, 3

Example 1.61

Running the preceding program will print Tuple has 1, 2, 3 to the console.

Typed patterns

A typed pattern allows us to check types in the pattern match and can be used for type tests and type casts:

scala> def getLength(a: Any): Int =
     |   a match {
     |     case s: String    => s.length
     |     case l: List[_]   => l.length //this is List from Scala collection library
     |     case m: Map[_, _] => m.size
     |     case _            => -1
     |   }
getLength: (a: Any)Int
scala> getLength("hello, world")
res3: Int = 12
scala> getLength(List(1, 2, 3, 4))
res4: Int = 4
scala> getLength(Map.empty[Int, String])
res5: Int = 0

Example 1.62

Please note that the argument a of type Any does not support methods such as length or size in the result expression. Scala automatically applies a type test and a type cast to match the target type. For example, case s: String => s.length is equivalent to the following snippet:

if (s.isInstanceOf[String]) {
  val x = s.asInstanceOf[String]
  x.length
}

Example 1.63

One important thing to note, though, is that Scala does not maintain type arguments during runtime. So, there is no way to check whether list has all integer elements or not. For example, the following will print A list of String to the console. The compiler will emit a warning to alert about the runtime behavior. Arrays are the only exception because the element type is stored with the array value:

scala> List.fill(5)(0) match {
     |   case _: List[String] => println("A list of String")
     |   case _           =>
     | }
<console>:13: warning: fruitless type test: a value of type List[Int] cannot also be a List[String] (the underlying of List[String]) (but still might match its erasure)
         case _: List[String] => println("A list of String")
                 ^
A list of String

Example 1.64

Implicits in Scala

Scala provides implicit conversions and parameters. Implicit conversion to an expected type is the first place the compiler uses implicits. For example, the following works:

scala> val d: Double = 2
d: Double = 2.0

Example 1.65

This works because of the following implicit method definition in the Int companion object (it was part of Predef prior to 2.10.x):

implicit def int2double(x: Int): Double = x.toDouble

Example 1.66

Another application of implicit conversion is the receiver of a method call. For example, let’s define a Rational class:

scala> class Rational(n: Int, d: Int) extends Ordered[Rational] {
     |
     |   require(d != 0)
     |   private val g = gcd(n.abs, d.abs)
     |   private def gcd(a: Int, b: Int): Int = if (b == 0) a else gcd(b, a % b)
     |   val numer = n / g
     |   val denom = d / g
     |   def this(n: Int) = this(n, 1)
     |   def +(that: Rational) = new Rational(
     |   this.numer * that.numer + this.denom * that.denom,
     |   this.denom * that.denom
     |   )
     |   def compare(that: Rational) = (this.numer * that.numer - this.denom * that.denom)
     |   override def toString = if (denom == 1) numer.toString else s"$numer/$denom"
     | }
defined class Rational

Example 1.67

Then declare a variable of the Rational type:

scala> val r1 = new Rational(1)
r1: Rational = 1
scala> 1 + r1
<console>:14: error: overloaded method value + with alternatives:
  (x: Double)Double <and>
  (x: Float)Float <and>
  (x: Long)Long <and>
  (x: Int)Int <and>
  (x: Char)Int <and>
  (x: Short)Int <and>
  (x: Byte)Int <and>
  (x: String)String
cannot be applied to (Rational)
       1 + r1
         ^

Example 1.68

If we try to add r1 to 1, we will get a compile-time error. The reason is the + method in Int does not support an argument of type Rational. In order to make it work, we can create an implicit conversion from Int to Rational:

scala> implicit def intToRational(n: Int): Rational = new Rational(n)
intToRational: (n: Int)Rational
scala> val r1 = new Rational(1)
r1: Rational = 1
scala> 1 + r1
res11: Rational = 2

Example 1.69

Om S Mar 25, 2024

In "Data Engineering with Scala and Spark," you'll embark on a journey to enhance your data engineering skills using Scala and functional programming techniques. The book focuses on creating continuous and scheduled pipelines for data ingestion, transformation, and aggregation.Key Features:Use Scala to transform data reliably.Learn to build streaming and batch-processing pipelines with clear explanations.Implement CI/CD best practices and test-driven development (TDD).The book covers essential topics like setting up development environments, working with Spark APIs (DataFrame, Dataset, and Spark SQL), data profiling, quality assurance, and pipeline orchestration. It also includes insights into performance tuning and best practices for building robust data pipelines.

Amazon Verified review

Zheng Zhu Feb 24, 2024

"Data Engineering with Scala and Spark" is a fantastic survey of the key concepts and practices in modern data engineering with Apache Spark and data lake architectures. I'm a data professional in the software industry and have been working with Apache Spark for close to a decade now, which is even prior to cloud data lakes and platforms like Databricks becoming mainstream. This book does a great job of establishing the foundational concepts with Scala and Spark in its first few chapters, which gives the reader the necessary tools to experiment and extend their knowledge. The progression of the book is easy to follow, which goes toward advanced transformations, data quality, and finally to best practice data engineering patterns. I very much respect its coverage of Spark with the Scala language, as it continues to be the native programming language of Spark itself, and one that has the deepest level of integration and best performance characteristics when it comes to data engineering.One concept I really appreciate from the author in this book is its coverage, albeit somewhat brief, of Test Driven Development and CI/CD. The data engineering industry, in my opinion, has yet to fully adopt and institute the degree of rigor and engineering disciplines that are now pervasive with general software engineering in both backend and frontend settings. As a result, data pipelines of any real complexity for large organizations eventually become very brittle, difficult to manage, and costly to operate. This book plants a great seed in the mind of its readers that these concepts around unit and integration testing via CI/CD with data pipelines are best practices for data engineering and a necessary knowledge area for data engineers in our current environment. I would loved to have seen some concrete samples of full integration tests that tests the logic of Spark transformations, which is an essential practice that typical Spark engineers lack familiarity with.In the concluding parts of the book, the author covers areas on orchestration, performance tuning, and end-to-end pipelines for both batch and streaming modalities. These are deep and advanced concepts, and there certainly can be full books written on each of these topics just by themselves. I like the broad coverage of several orchestration frameworks, giving the users an unbiased perspective on how tools like Airflow, Databricks Workflows, and ADF can be used with Spark. I also support the judicious coverage of some of the key concepts in Spark performance tuning, including data skew, partitioning, and right-sizing compute, which are generally the most important concepts to understand when tuning pipelines.Overall, I recommend this book for readers seeking to gain a deeper level of understanding of what data engineering is about and how to best achieve that with Apache Spark, in addition to the current set of companion platforms and tooling in the data engineering ecosystem. The reader should expect to be able to construct and support cloud-based or local data pipelines from various source modalities with Apache Spark in an end-to-end fashion, which I think makes this book a worthwhile journey.

Loni Apr 04, 2024

"Data Engineering with Scala and Spark" offers a comprehensive guide to navigating the complexities of Apache Spark and modern data engineering practices. From fundamental concepts to advanced optimization techniques, each chapter provides clear explanations and practical insights for building efficient data pipelines. With a focus on real-world applications and best practices, this book is essential reading for data engineers and professionals seeking to harness the full potential of Apache Spark in their projects.

H2N Apr 01, 2024

A good resource for who looking to master Scala, Spark, and cloud computing for data engineering. The book covers essential concepts and best practices, it guides readers through setting up environments, developing pipelines, and applying test-driven development and CI/CD and also advanced topics like data transformation, quality checks, and performance tuning with practical examples. Overall, it's a highly valuable resource for anyone aspiring to excel in data engineering.

fernando Feb 09, 2024

This is a book for a newbie. If you have experience you won’t learn much from it.