Scientist Who Code
Data Science is a hot career field in Data Analytics. On data teams with Data Engineers how much coding is expected from the Data Scientist?
The role of the Data Scientist includes find features or correlations in data that might predict outcomes. Those prediction then become data models that are tested multiple times. After those data models reach a high confidence level they are then automated in applications. In this episode of Big Data Big Questions let’s find out just how much coding a Data Scientist can be expected to do.
Transcript – Do Data Scientist Code?
Hi folks. Thomas Henson here with thomashenson.com. Today is another episode of Big Data Big Questions. Today, I’m still here in the gym answering your questions, recording between mail trucks coming by. Crazy.
Today’s question comes in from a viewer. Thanks for watching. If you have any questions, make sure you put them in the comments section here below, and also subscribe, make sure you hit that bell so you get notifications. Today’s question comes in around data scientists, and it’s specifically, “Do data scientists code?” We talked about the roles of a data engineer. We’ve talked about the roles of a machine learning engineer and even data scientist, but where is that fine line between how much data scientists code? We’re going to talk about that, and talk about some of the tools that they use, and then try not to day, “Depends.”
I know, I know. You’re like, “Man. He’s going to say, ‘depends.'” But, I’m going to try not to say that. Does a data scientist code? The answer is yes. Data scientists, for the most part, they’re able to code. The tools that they use, how much are they coding, that’s really going to be dependent on — didn’t say depends, dependent — the role that they’re in. If they have a data engineer or a machine learning engineer, that can help them put their code in production and finalize some of the things that they’re doing.
I will say, I’ve worked on a team before where, from a data scientist perspective, they were primarily using MATLAB. They were using MATLAB or Excel, and then when we pushed it to Hadoop, we’ve talked. You’ve heard me talk about it many times on here, that little bit harder, because once we’re doing that in a solo environment on their machine, and then we went to distributed algorithms, at the time we were using Mahout [Phonetic 00:01:45]. Going from Excel or MATLAB to Mahout, we really had to change and tune a lot of different things. That’s where my expertise as a data engineer and developer was able to help out and keep running the same job a hundred times.
Things have changed with TensorFlow and a lot of other tools since that time. Yes, the data scientists will go. It’s going to depend on what they’re going to use. Some of the common tools, like I said, MATLAB, even Excel, dependent on what you’re working on. If it’s a big data project, might not be using some of the other larger tools. Then, you also have R. You have Python, which we’ve talked about and we love here. We also have Scala or skuh-la. I always say it wrong. There are many different tools that data scientists use whenever they’re coding. That doesn’t mean that they’re doing all the coding. This is back to the dividing lines of the roles and where they are. Whenever we’re talking about standing up the environment, pushing things out to production, and even doing some of the heavier lifting, like I will say operationalizing of the code. Getting it ready, past the trendy phase on some of the other pieces that you’re really bringing it in to, “Hey, it’s going to support this dashboard.” It’s going to do this piece, or even some of the ETL jobs, still going to come to us as the machine learning engineers or the data engineers, just depending on the role. Still not saying depends. It’s depending on that role. For the most part, to answer the question simply, yes. Your data scientist is going to code. How much of that code is really in-depth, if you think about, are they doing the job of coding? I would say no. That would just be what I’ve seen. If you think about some of the reasons that we have the tools that we have now, like why was PIG created? Why do we have a Python API in Spark, when Spark, we could do Spark in Java, right?
Why do we have a Scala API in Spark? It’s because of the fact that we want to use a higher-level language, so that data analysts, data scientists, they can run their code, and they can do it without having to worry about Java and some of the other components there. Yes, data scientists code. How much do they code? It’s going to depend on their partner machine learning or data engineer. Your data scientist is not going to replace the data engineer or machine learning engineer. We’re all on the same team, here. We’re not trying to compete back and forth, and if I had to choose a side, I’d say machine learning engineers and data engineers. But, I’m very biased.
That’s all for today. Hope you enjoyed this episode. If you have any questions, throw them in the comments section here below, and make sure you hit subscribe, and ring that bell, and I will see you again on the next episode of Big Data Big Questions.