September 2018 - Thomas Henson

What does a Data Scientist do?

Data Scientist are changing the world but what do they really do? Basically a Data Scientist’s job is to find correlation in data that might be able to predict outcomes. Most of the time their job is spent data cleansing and building models using their heavy math skills. The development and architecture cluster management is ran by the Data Engineer. If you like math and love data then Data Scientist might be the right career path for you. Recently Deep Learning has emerged as a hot field within the Data Science community. Let’s explore some of the basic terms around Deep Learning.

What is Deep Learning

Deep Learning is a form of Artificial Intelligence where Data Scientist use Neural Networks to train models. Neural networks are comprised of 3 layers that allow for models to be trained by mimicking the way our brains learn. In Deep Learning the features and weights of the features are not explicitly programmed, but learned by the Neural Network. If you are looking to compare Machine Learning to Deep Learning just remember, machine learning is where we define the features, but Deep Learning is where the Neural Network will decide the features. For example a Machine Learning dog breed detection model would require us to program the features like ear length, nose size, color, height, weight, fur, etc. In Deep Learning we would allow the Neural Network to decide the features and weights for those features. Most of the Deep Learning environment will use GPUs to take advantage GPUs ability to quickly compute computations versus CPUs.

Must Know Deep Learning Terms

So are you ready to take on the challenge of Deep Learning? Let’s start out with learning the basic Deep Learning terms before we build our first model.

#1 Training

The easiest way to understand training in Deep Learning is to think of it as testing. In software development we talk about test environments and how you never code in production right? Deep Learning we refer to test as our training environment where we allow our models to learn. If you were creating a model to identify breeds of dogs, the training phase is where you would feed the input layer with millions and millions of images. During this phase both forward and backward propagation allow the model to be developed. Then it’s on the #2…

#2 Inference

If the training environment is basically test/development then inference is production. After building our model and throwing terabytes or petabytes to get as accurate as we can, it’s time to put it in production. Now typically in software development our production environment is larger than test/development. However, in Deep Learning it’s the inverse because these models are typically deployed on edge devices. One of the largest markets for Deep Learning has been in autonomous vehicles with the goal to deploy these models in vehicles around the planet. While most Data Engineers would love to ride around in a mobile data center it’s not going to be practical.

#3 Overfittiing

Data Scientist and Machine Learning Engineers can get so involved in solving a particular problem that the model create will only solve that particular data set. When a model follows too closely to a particular data set the model is overfitted to the data. The problem is more common because as Engineers we know what we are looking for when we train the models. Typically overfitting can be attributed to making models more complex than necessary. One way to combat overfitting is to never training test set. No seriously never train on testing set.

#4 Supervised Learning

Data is the most valuable resource, behind people for building amazing Deep Learning models. We can train the Data in two ways in Deep Learning. The first way is the Supervised Learning. In Supervised Learning we have labeled data sets where understand the outcome. Back to our Dog Breed detector we have millions of labeled images of different dog breeds to feed in our input layer. Most of Deep Learning training is done by Supervised Learning. Labeled data sets are hard to gather and take a lot of time from the Data Science team. Right now Data Wrangling is something we still have to spend a majority of time doing.

#5 Unsupervised Learning

The second form of learning in Deep Learning is Unsupervised Learning. In Unsupervised Learning we don’t have answer or labeled data data sets. In our dog breed application we would feed the images without label sets identifying the breeds. If Supervised Learning is costly on find labeled data then Unsupervised Learning is the easier form. So why not only use Unsupervised Learning? The reason is simple…we just are quite there from a technology perspective. Back in July I spoke with Wayne Thompson, SAS Chief Data Scientist, about when we will achieve Unsupervised Learning. He believes we are still 5 years out from significant break through in Unsupervised Learning.

#6 Tensorflow

Tensorflow is the most popular Deep Learning framework right now. The Google Brain team released Tensorflow to the open source community in 2015. Tensorflow is a Deep Learning framework that package together execution engines and libraries required to run Neural Networks. Both CPU and GPU processing can be run with Tensorflow but GPU is the chip of choice in Deep Learning.

#7 Caffe

Caffe is an open source highly scalable Deep Learning Framework. Both Python and C++ are supported as first class in Caffe. Caffe is another framework developed and still supported heavily by Facebook. In fact a huge announcement was released in May 2018 about the merging of both Pytorch and Caffe2 into the same codebase. Although Caffe is widely popular in Deep Learning it still lags behind Tensorflow is adoption, users, and documentation. Still Machine Learning Engineers should follow this framework closely.

#8 Learning Rate

Learning Rate is parameter used to calculate the minimal loss of function. In Deep Learning the learning rate is one of the most important tools for calculating the weights for feature in your models. Using a lower value learning rate in general provides more accurate results but takes a lot more time because it slows the steps down to find the minimal loss. If you were walking on a balance beam, you can take smaller steps to ensure foot placement, but it also increases your time on the balance beam. Same concept with Learning rate except we just taking longer time to find our minimal loss.

#9 Embarrassingly Parallel

Embarrassingly Parallel commonly used term in High Performance Computing for problems that can be parallelized. Basically Embarrassingly Parallel means that a problem can be split into to many many parts and computed. An example of Embarrassingly Parallel would be how each image in our dog breed application could be performed independently.

#10 Neural Network

Process that attempts to mimic the way our brains in that of computing. Neural Networks often referred to as Artificial Neural Networks are key to Deep Learning. When I first heard about Neural Networks I imagined multiple servers all connected together in the data center. I was wrong! Neural Networks is at the software and mathematical layer. It’s how the data is processed and guided through the layers in the Neural Network. See #17 Layers.

#11 Pytorch

Pytorch is an open source Machine Learning & Deep Learning framework (sound familiar?). Facebook’s AI group originally developed and released Pytorch for GPU accelerated workloads. Recently it was announced that Pytorch and Caffe2 would merge the two code bases together. Still a popular framework to be followed closely. Both Caffe & Pytorch were heavily used at Facebook.

#12 CNN

Convolutional Neural Network (CNN) is a type of Neural Network typically used visualization. CNNs use a forward feed processing that mimics the human brain which makes it optimal for visualizing images like in our dog breed application. The most popular Neural Network is the CNN because of the ease of use. Images are broken down pixel by pixel to process using a CNN.

#13 RNN

Recursive Neural Networks (RNN) differ from Convolution Neural Networks in they are a recurring loop. The key for RNNs is the feedback loop which act as the reward system for hitting desired outcome. During training the feedback loop helps train the model based on previous runs and desired outcome. RNNs are primary used with time series data because of the ability to loop through.

#14 Forward Propagation

In Deep Learning forward propagation is the process for weighting each feature to test the output. Data moves through the neural network in the forward propagation phase. In our example of the for dog bread assume feature of tail length and assign it a certain value for how much it matters for determining dog breed. After assigning a weight of the feature we then calculate if the assumption was correct.

#15 Back Propagation

Back propagation or backward propagation in training is moving backward through the neural network. This allows us to review how bad we missed our target. Essentially we used the wrong weight values in and the output was wrong. Remember the forward propagation phase is about testing and assigning weight thus the back propagation phase test why we missed.

#16 Convergence

Convergence is the process of moving closer to the correct target or output. Essentially convergence is where we are find the best solution for our problem. As Neural Networks continue to run through multiple iterations the results will begin to converge as reach the target. However, when results take a long time to converge it’s often times called poor convergence.

#17 Layers

Neural Networks are composed of three distinct layers: input, hidden, and output. The first layer is the input which is our data. In our dog breed application all the images both with and without a dog are our input layer. Next is the hidden layer where features for the data are given weights. For example, features of our dogs like ears, fur, color, etc are experimented with different weights in the hidden layer. Also the hidden layer is where Deep Learning received it’s name because the hidden layer can go deep. The final layer is the output layer where find out if the model was correct or not. Back to our dog breed application, did the model predict the dog breed or not? Understanding these 3 layers of a Neural Network is essential for Data Scientist to using Neural Networks.

Want More Data Engineering Tips?

Sign up for my newsletter to be sure and never miss a post or YouTube Episode of Big Data Big Question where I answer questions from the community about Data Engineering questions.

Archives for September 2018

17 Deep Learning Terms Every Data Scientist Must Know