Recently we shared an introduction to machine learning. While making machines learn from data is fun, the data from real-world scenarios often gets out of hand if you try to implement traditional machine-learning techniques on your computer. To actually use machine learning with big data, it’s crucial to learn how to deal with data that is too big to store or compute on a single computing machine.
Today we will discuss fundamental concepts for working with big data using distributed computing, then introduce the tools you need to build machine learning models. We’ll start with some naive methods of solving problems, which are meant only as an example. As we move forward, we will make things more realistic.
Understanding the idea behind MapReduce
MapReduce is a technique that is used to distribute a data set in parts to different agents. An agent here means a single computer. The idea is to do operations on parts of your data, then combine that data after the operations are done to find the insights you need.
Imagine that you run the security department of a large shopping mall and you need to know how many people entered the mall for a particular day. You ask all the guards at different gates to take a count of every person coming in the mall. When the gates close at the end of the day, you, as the supervisor, go to each guard to get their count. Now you have the total number of people who came to the mall that day.
MapReduce is very similar — the idea is to break down the computing problem and give it to multiple workers (i.e. computers). When each computer is done computing its part, all the parts are reduced into a single result by a master (i.e. another computer managing the workers).
Example: finding the average age for a table of names
Let’s now look at an example from a data engineer’s perspective. Imagine we have a table where every row contains a person’s details — phone number, email address, country, birth date, etc. Now we want to find the average age for each name in our table.
We used this technique to analyze voter roll data for our collaboration with The Hindu. Check out the first part and second part of the story.
We can map on our table to extract a person’s name and age, and reduce the output to reflect only the average age for a particular name.
Here are the steps that our algorithm has to follow to get the results we want:
- Take the name and the age of every person out of the table.
- Count the number of times each name comes up, and sum the corresponding ages for each name.
- Divide the summed ages for each name by the number of people with that name to get the average age.
Assuming that we want this algorithm to run on 5 computers, we will split our table into five parts and give each part to one computer. Each computer will create a new table with three columns — name, sum of age, and count. The computer will check each row in the original table. If it hasn’t seen the name in that row, the computer will create an entry for that name in the new table. If the computer has seen that name before, it will add that age to the summed age and increment the count in the new table. Once each computer has finished computing its parts, the computers will start combining their tables, adding the sum and the count. Then they can divide the count by the sum to get the average age for every name.
Note: This method would be a naive implementation of the MapReduce approach. It’s only meant as an introductory example.
Example: the obligatory word count example
In our second example, let’s count the number of times a word appears in a file. The problem is that our file is so huge that it can neither be stored in a single computer’s memory. However, while the file may be huge, the words it contains may be highly redundant. Thus, it’s possible that the table that we make of words and their counts will fit on a single machine.
Let’s assume that we are splitting our huge file (line by line) among 20 computers (i.e. workers), which are then supposed to count the number of times each word comes up in the file.
Each of our 20 workers will now split the file line by line, and further split the lines by spaces to get a list of all the words in the file. Now we have one big table split across multiple workers with only one column — a word.
Now each worker will create a new table with two columns — the word and its count — and go through our first table one row at a time. The worker will increment the count of the word in the new table by one if it already exists in the table, or put in a new word with a count of one if it does not.
Once all the workers are finished doing their part, a master will come into the picture. It will start reducing the resulting tables by all the workers into a single table containing each word and its count.
If all of this feels daunting, don’t panic. Things will get clearer once we start making these things work in practical programming environments.
Note: If you think that the resulting table at the end of all the MapReduce process will still be too big to store on a computer, don’t worry. We’ve mentioned a solution in the next section to solve that problem.
Introduction to Hadoop and Spark
Companies like Google that have been working with big data for a long time were the first companies to start thinking about distributing computing problems across multiple hardware resources. In 2003, Google published a research paper describing Google File System, which is a system for storing files on multiple storage devices across a data center by splitting them down into multiple parts which are stored on different storage devices. These part files were also replicated so that if one storage device crashed, you could recover the lost data by creating another copy from the replica.
The concept and the algorithms were quickly picked up by the community, and soon Hadoop File System (HDFS) was produced. HDFS was an open-source design of Google’s Distributed File System that enabled everyone in the world to host their files across different storage devices spread across what is usually a data center.
However, the mere storage of the data across various nodes was not enough for the software industry. This was the time when we were already hitting data processing limits and the open source world needed something to perform complex computations on the data across multiple storage devices. Thus, with HDFS the concept of MapReduce was introduced.
MapReduce was a concept implemented in Java, and a Java API was made available. Developers were now able to write the map and reduce function, select the source of data from their HDFS installation, and run their MapReduce operations to get the insights they needed from the data.
HDFS and MapReduce dominated the big data market for a long time until Matei Zahira, a Ph.D. student at the University of California at Berkeley, published his research project called Spark.
Backed by the Apache Software Foundation, Spark soon became one of the top Apache projects and gained massive traction in the market. Spark claimed it led to a 10 to 100 times speed increase over Hadoop’s MapReduce ecosystem. This was achieved by making many optimizations on the MapReduce design. Written in Scala — a modern hybrid (functional and object oriented) programming language — Spark would try to minimize the disk operations and try to keep the data in memory as much as possible.
While Hadoop provided an API for Java, Spark came with an API for Scala (which was way more user friendly), a Java API that was still superior to that of Hadoop, and an almost-as-good Python API. Python was already the programming language loved by data scientists. On the other hand, Scala, because of its functional properties, provided developers high flexibility and ease for quickly working out their solutions. Further, the speed of Spark’s programs allowed for interactive data analysis since people did not have to wait hours for their programs to finish executing.
In this article, we will be exploring Apache Spark and learn how to get basic utility out of it.
Setting up your development environment
As mentioned above, Spark is written in Scala and hence uses Java’s runtime platform for executing the programs. The first step would be to download and install all the dependencies required for our environment.
Here is a list of things that you need to download and install:
- Java Development Kit (8) — we will use Oracle’s JDK for this tutorial
- Simple Build Tool (latest) — SBT automatically downloads Scala
- Apache Spark (version 2.0.0) — the one with the latest Hadoop binaries
In case you noticed, the Spark code contains references to a lot of Hadoop classes. This is because Spark does use Hadoop in some ways. For example, Hadoop’s Distributed File System can be used as an input source of data in Spark. This makes working with the two technologies even more convenient.
If this is the first time you are using Scala, you might not be familiar with the Simple Build Tool (SBT). SBT is a build and integration tool for JVM-based programming languages. It allows you to structure your projects, build them, fetch dependencies, and many other things that you would expect from a good build tool. While Spark uses Maven for production builds, it is easier to use SBT for standalone Spark programs.
Working with Apache Spark
There are many ways to use Apache Spark in your projects, but the easiest way is to launch the interactive Scala console with Spark. To do that, you simply have to navigate to the directory containing the Spark extracts and run the shell from the bin directory.
This will launch an interactive shell for Spark in Scala. This will also create an object named <sc>
, which stands for Spark Context. Spark Context (SC) is the object that is used to access the Spark cluster.
Alternatively, you can also use a Python shell called Pyspark. While Python is a great programming language for exploratory data analysis, we will be using Scala for the purpose of this tutorial.
Another great way is to write stand-alone Spark applications by including Spark as a dependency in your SBT file.
Understanding RDD(s)
The basic data structure of Spark is an RDD, which stands for Resilient Distributed Dataset. As the name suggests, an RDD is resilient and distributed by nature. In case a compute node crashes, Spark will try to recompute that state on another machine by replaying the same set of actions.
You can think of RDDs as a list that is accessed in parallel instead of sequentially. If you have ever programmed in a functional programming language, this will all seem very natural to you.
Word count example (using Spark)
Now that we have the basics clear, let’s try to implement our distributed word count algorithm on Spark. Since both Spark and Scala have functional semantics, we can easily do this in just three lines of code.
While the above code looks suspiciously precise, I assure you it does exactly what we expect it to do. Let’s now go a step further and try to understand the above code does.
The first thing we did is read the text file that we want to process by calling the <textFile>
method on SC. Next, we split each line by spaces to extract words. This is done using the <flatMap>
method that maps over every single line in the file to extract words and then flattens the output into a single line of words.
Once we had all the words, we mapped them all into tuples of key-value pairs containing the word and the integer value 1. This was then reduced using the key (i.e. the word) by summing the left and right operands into one. This way, we summed all the 1s for a word, as described in the algorithm early in the article.
At the end, we get the result by calling the <collect>
method on the output and printing the result of each word, line by line.
Working with DataFrames
Apart from lower level RDDs, Spark also uses DataFrames as the higher level technique for accessing structured data.
DataFrames, at the time of this writing, are an abstraction over RDDs that fit in the tabular data. DataFrames have an API for executing SQL on these tables while also providing some higher level helper functions to work with this data.
DataFrames are supposed to store tabular data, so it makes sense to use tabular data processing and querying techniques to work with them. One of the great advantages of using Spark DataFrames is that you can use SQL syntax to work with the data in the frames. Let’s register a table on our frame so we can use SQL to query the data.
Once a frame is registered as a table, you can use SQL to run your queries. Let’s use SQL to draw some basic insights from the data, trying to stick to things that we did in our previous Titanic blog.
You might have noticed that we call the “show” method on each query result. “Show” is a quick and handy way to get the top output from your data without going through all of it.
To get a more comprehensive understanding of the DataFrames API, you should refer to the official Spark documentation.
Machine learning for big data, using the Titanic data set
Machine learning in a distributed environment can be tricky if you are coming from a data science background, where specialized tools are used for this purpose. It is important that, while writing our Machine Learning algorithms, we optimize them to take advantage of the performance benefits offered by a distributed system.
While trying to understand Machine Learning on Spark, we will use the ML Library that ships with Spark. Spark’s default package ships with two modules — <spark.ml>
and <spark.mllib>
.
For the purpose of this article, we will be using the Random Forest algorithm to train a machine learning model to make predictions.
By now you have learned how to load a data frame into Spark and run some basic analytics on it using SQL. However, machines can not understand the data like we do, so it is important that we convert the data to a format that Spark can understand to build a machine learning model. This will require some basic data cleaning and transformation. Once we have that, all we need to do is call some library functions to build a model, test the model, and then evaluate the success percentage.
Let’s first go through the code and understand the comments, and then explain the functionality in detail. The code for the above is as follows:
The first few steps are easy — we import the functions and classes we will need in our program, and load the CSV file as a DataFrame in Spark.
What you see next is interesting. In the second step, we define two Scala case classes called Initial and Final, both representing the schema for the initial and final state of the data frame in the cleaning process. While there are many ways to do this, it’s interesting how you can use Scala data types to apply methods over a data set in Spark.
Next, we define an <assembler>
, which is a library module provided by ML Spark library for converting numerical values in a DataFrame to a machine-understandable vector. A vector is a linear representation of the data in different dimensions. We will later use this assembler to extract the numerical values out of our data and convert them to a vector. However, before we can do that, we will need to remove the null values and convert any string values in our data to numerical values, which is exactly what our <autobot>
function does.
In the next two steps, we run the autobot function and apply our assembler on the output to produce clean data, from which we then drop extra columns and rename the “survived” column to “label” column since that is what will be the label for our machine learning model.
Now we define a label and feature indexer, set the input and output column name, and fit the entire data set in this index model. The reason why we fit the entire data set, and not the training set, is that we want our model to have all the labels. It is likely that, while splitting the dataset into training and testing parts, we might lose certain labels for features in one of them while they might exist in the other. Using the entire data set allows us to define metadata on all the features and labels so that we can use the index when building the Machine Learning model.
Once we have the indexer, we will create a Random Forest model and set the names for input and output columns on the table we will use for training the model and predicting the labels. We also define a label converter before this so that, when we’re using the index that we created in the last step, we can extract the labels out of that index and put them in the predictions.
That, so far, was the difficult part. In the next steps, we build a pipeline that tells Spark the steps it must follow to create a Random Forest model from the training data, and run the pipeline to extract a model out of it.
Evaluating and testing the model
Once we have trained a model, the next step would be to test that model and evaluate the accuracy. A model can be used to transform a test data frame to produce predictions, as done in the next steps in the code above. The problem that we face is to easily test the accuracy of our model.
Fortunately, Spark comes with a concept called evaluators, which lets us test our test data against our model and gives us the accuracy of that model. In the last two steps of our example, we create an evaluator and print the output of that evaluator on the screen to get the accuracy probability.
Now that we have a model and its accuracy, we can test as many inputs against this model as we want. What would be nice here is a way to dig deeper inside our model and understand how exactly the decisions are being made. Not all machine learning and classification algorithms produce a model that is human readable; but, in our case, we can dig deeper because we are using the Random Forest classification algorithm.
All we need to do in is type cast our model so Scala can identify the model and call the <toDebugString>
method on our model to print the decision making of the model on the console.
Note that we used Random Forest only for demonstration purposes. The accuracy of the predictions will, of course, depend on the model used. Random Forest Classification algorithm is just one of the many algorithms available for training a model. Spark provides many more algorithms and techniques for machine learning, classification, and regression. If you feel confident using Spark, you can even implement your own algorithms using the low-level Spark API.
The future of Spark
Along with other developments in the distributed systems community, Spark is growing very quickly and trying to solve many problems along the way.
There are many things that Spark has to work on. As of this writing, Spark community is working on integrating the project Tungsten with Spark internals to optimize the execution plan for Spark jobs. Other than that, the community is working on making the project easier to use and more accessible in general.
It looks like support for more machine learning techniques and better support for neural networks is the natural progression for Apache Spark. But only time will tell!
Where to go from here
While this tutorial was supposed to give you a quick introduction to Spark and distributed systems, there are still many things that you can learn about this domain. A great place to start is Stanford’s Mining Massive Datasets Project, which comes with a free ebook and an online course to learn about machine learning on distributed systems.
If you want to get involved with Spark, it is recommended that you know Scala to be able to understand how some of the core Spark modules work. Fortunately, there is a great free online program on Coursera, backed by the creators of Scala at EPFL.
If you want to learn more about Spark with the sole purpose of using it in your projects, Spark’s official documentation is a great place to start. To start playing around with Spark, you can refer to UCI Data Repository, which is a great source for freely available data sets used for machine learning.
In the end, the best way to learn is by trying. So download the latest edition of Spark, try it out, and explore the examples. All the best!
2 comments
Great article as is expected from social cops.
But one correction is needed.
RDD stands for Resilient Distributed Dataset not Redundant Distributed Dataset.
And its resilient not because spark stores multiple copies of same data in different nodes, its resilient because spark always has the RDD lineage graph (a DAG) so in case of node or partition failures, it can re-compute that RDD using the lineage graph.
And this is not how HDFS works.
Thanks for pointing this out! We’ve fixed it in the article.