How to Become a Terrific Data Scientist (+Engineer) Without Coding

Aug. 23, 2021

How to Become a Terrific Data Scientist (+Engineer) Without Coding

If you have dreams of becoming a data scientist or a data engineer, you'd probably see a black screen full of codes in that dream. Polishing your coding skills may be the popular advice you get on this journey. Yet, surprisingly, it has nothing to do with programming.

Data science is the process of making sense from a raw collection of records. A programing language is only a tool. It's like a container for cooking your meals. But the container itself is not the meal.

People lose interest in data science because some aren't good at programming. They couldn't get their head around even an intuitive language such as Python. Yet, for others, it's pretty natural. But these aren't inabilities, but different abilities.

This story will change your perspective. Even if you can't or don't want to program, you can become an exceptional data scientist. Critical thinking and some data literacy will make you even capable of managing a data project.

Today we have technologies that require no coding skills to start data science. They also have several benefits programmers don't have. Because of their intuitive nature and fewer dependencies, I'd suggest them for everyone aspiring to become a data scientist.

We'll discuss the KNIME analytics platform in this post. It requires nothing more than common sense to make sense out of data. Another popular alternative is Rapidminer. Both have been around for a while, and many companies use them in production too. Yet, in my opinion, they are still underrated.

Before moving further, let's first make you a data scientist and a data engineer.

Kickstart data science with KNIME without coding a single line.

You can download and install KNIME on your computer like any other application. The software is free and open-source. You can use it to build data pipelines, data wrangling, training machine learning models, and real-time predictions. That's pretty much the work of most data scientists and data engineers.

Let's suppose you're creating a customer segmentation engine for a retail chain. You receive data from two different systems. One is a table containing the customer's demographic information, and the other is about their buying pattern. Your task is to update the cluster representation every day as you receive new data.

Typical workflow of a data science project with data engineering.

The first part of this is an ETL. It is the data engineering part in our example. We read data from the different data sources(Extract,) join them and filter (Transform,) and save (Load) it for future references.

In the second part, we create a K means clustering engine. It is the data science part of our example. It reads data from the saved path, performs clustering, and outputs a table. The output table has cluster labels of every customer.

What do you need to know about KNIME's Interface?

The interface has lots of incredible features. Yet, for this introductory exercise, we're interested in only two components. The Node repository is at the bottom left corner, and the workflow editor is at the center. The description widget at the right side of the editor is helpful too.

KNIME interface.

The engineering team behind KNIME has done a fantastic job. They have created nodes for almost every activity a data scientist would perform. We can search for any node from the node repository.

You can drag any of those nodes to the editor. Double click on any node; you get a configuration window. You can do all the settings the activity requires to function on this window.

You can pull up instant documentation of any node by clicking on it. It'll explain all the input requirements and what the node will return.

Reading data from data sources—extract.

There are several ways you can extract data from sources in KNIME. You can read from files, query a database, call a REST endpoint, etc.

In this example, we read a couple of CSV files from the local filesystem. You can search for the CSV reader node in the node repository and drag it to the editor.

As you drag it to the main window, you may see the red traffic light is below the node. It means we haven't configured it yet. You can double-click on it and configure it to read from a file path.

Configuring KNIME to read CSV from local filesystem.

You may see the indicator is yellow now. It means the node is ready to perform. Right-click on the node and select execute. Now the pointer turns green. The node execution was successful.

You can see the results by right-clicking and selecting the last element in the list. In KNIME, these few options at the end are always the outputs of that nodes. The CSV reader node outputs only one item—the file table itself.

In this example, I'm reading two CSVs. You can download them from this Git repository.

Performing joins, filtering, etc.—transform.

KNIME has intuitive nodes to perform every kind of data wrangling task. In this example, we're using two of them—joins and row filters. You may have to perform binning, normalization, removing duplicates and nulls, etc.

Transforming a variable and aggregating it is a common type of task you'd perform. This technique is popularly known as the map-reduce operation.

All of them are nodes in KNIME.

I pulled the joiner node from the repository. Using the mouse, I connected the CSV nodes' outputs (right) with the joiner nodes' inputs (left). You can configure this node by selecting the columns of each table to perform the join operation.

Joining tables without coding in KNIME.

Unlike the CSV reader node, the Joiner node has three outputs. If you hover over them, a tooltip explains what each of them is. The first one is the join result. We don't use the second (left unmatched) and the third (right unmatched) in our example.

Next, let's pull the row filter node and connect it with the outputs of the joiner node. We can configure it to remove customers who are one-time purchases. Set the lower bound of the visit_count variable to 2.

Filtering tables without any programming language in KNIME.

Saving the output—load.

The final part of an ETL pipeline is to load the data to persistent storage. We don't want to complicate the example. Hence we write it to a CSV. But, you may have to load it to a database or a data warehouse in real-world projects. Don't worry; KNIME helps in all situations.

I grabbed the CSV writer node and configured it very much the same way we did with the CSV reader.

Saving CSV using KNIME without coding.

This last part concludes the ETL pipeline. It's a crucial task of a data engineer. Find some job descriptions on LinkedIn and see them for yourself.

Performing machine learning tasks without coding.

We have the data clean and ready to build exciting things. In this example, we've taken a market segmentation problem. To do this, we'll be using the K-Means clustering algorithm. Likewise, you can perform almost any machine learning algorithm in KNIME without writing a single line of code.

K-Means create customer groups based on similarities in their attributes. Besides which attributes to use, we can also specify how many groups we need.

Let's pull the k-Means node from the repository and connect it with the output of the Row filter node. We can configure it to group customers into four clusters using their age and visit count.

1_zZqxRjkmoekFG6SknAOPsQ

After executing the node, you can inspect the outputs. You get the cluster label for every customer and summaries for each cluster.

I choose K-Means in this example for its simplicity. For most machine learning applications, you have several other tasks to perform. Retraining a model is a critical one too.

The KNIME's youtube channel has lots of insightful videos t aid in your data science journey.

Visualizing your analysis in KNIME.

Most data science projects' final part is to visualize insights. Business Intelligent (BI) platforms such as Tableau specialize in this area. You can connect KNIME with them for advanced analytics. Yet, the venue itself supports basic visualizations. BI platforms are fantastic for a wider audience. But KNIME's visualization nodes are sufficient for data scientists.

We'll use a scatterplot node to create a chart between the two variables we used for clustering. But before that, let's put a color manager node in the workflow.

Unlike other visualization tools, we need to color our records before plotting them.

You can select colors and the variable to use for color selection. Yet, in this example, we're good with the defaults. The color manager picks the cluster labels for it, and the default colors are good too.

Creating a scatter plot in KNIME.

We can add the scatterplot node to the workflow now. Let's configure it to use age in the x-axis and visit count in y. Also, be sure you tick the 'create image at output' checkbox.

You can now execute the scatterplot node and pull up the image output. You can use an image writer node to export the result to a file too.

Here's how the final workflow looks if you need a reference.

Codeless data pipeline --- KNIME workflow.

Final thoughts

Excellent. We've built an entire data pipeline without a single line of code. It covers ETL, a critical job of a data engineer. Also, we've built machine learning models and visualized their output too.

Excellent. We've built an entire data pipeline without a single line of code. It covers ETL, a critical job of a data engineer. Also, we've built machine learning models and visualized their output too.

Programming is essential for data science is a myth. The two are related. But they don't depend on one another.

I don't advocate avoiding programming altogether. At some point, you need it. For instance, a recent discovery in data science may not be there in KNIME yet. The platform is good to perform what's already there and popular.

Also, if you are a data science researcher, KNIME has only a little use. You need to build your own algorithm with your own lines of code.

For this reason, KNIME itself offers flexibility to the program. The standard installation already has the Java and JavaScript nodes to do it. You can also extend it to use Python and other languages.

The point is, you don't need to code every time.

How we work

Readers support The Analytics Club. We earn through display ads. Also, when you buy something we recommend, we may get an affiliate commission. But it never affects your price or what we pick.

Connect with us