{"id":279,"date":"2021-08-23T00:00:00","date_gmt":"2021-08-23T00:00:00","guid":{"rendered":"https:\/\/tac.debuzzify.com\/?p=279"},"modified":"2023-06-27T06:29:54","modified_gmt":"2023-06-27T06:29:54","slug":"become-a-data-scientist-or-data-engineer-without-coding-skills","status":"publish","type":"post","link":"https:\/\/www.the-analytics.club\/become-a-data-scientist-or-data-engineer-without-coding-skills\/","title":{"rendered":"How to Become a Terrific Data Scientist (+Engineer) Without Coding"},"content":{"rendered":"\n\n\n

If you have dreams of becoming a data scientist or a data engineer<\/i>, you’d probably see a black screen full of codes<\/i> in that dream. Polishing your coding skills may be the popular advice you get on this journey. Yet, surprisingly, it has nothing to do with programming.<\/i><\/p>\n\n\n\n

\n

Data science is the process of making sense from a raw collection of records. A programing language is only a tool. It’s like a container for cooking your meals. But the container itself is not the meal.<\/i><\/b><\/p>\n<\/blockquote>\n\n\n\n

People lose interest in data science because\u00a0some aren’t good at programming.<\/i>\u00a0They couldn’t get their head around even an intuitive language such as Python. Yet, for others, it’s pretty natural.\u00a0But these aren’t inabilities but different abilities.<\/p>\n\n\n\n

This story will\u00a0change your perspective. Even if you can’t or don’t want to program, you can become an exceptional data scientist.\u00a0Critical thinking and some data literacy<\/i>\u00a0will make you even more capable of managing a data project.<\/p>\n\n\n\n

Today we have\u00a0technologies that require no coding skills<\/i>\u00a0to start data science. They also\u00a0have several benefits programmers don’t have<\/i>. Because of their intuitive nature and fewer dependencies, I’d suggest them to everyone aspiring to become a data scientist.<\/p>\n\n\n\n

We’ll discuss the\u00a0KNIME<\/i>\u00a0analytics platform in this post. It requires nothing more than common sense to make sense of data. Another popular alternative is\u00a0Rapidminer<\/i><\/b><\/a>. Both have been around for a while, and many companies use them in production too. Yet, in my opinion, they are still underrated.<\/p>\n\n\n\n

Before moving further, let’s first make you a data scientist and a data engineer.<\/p>\n\n\n\n

Kickstart data science with KNIME without coding a single line.<\/b><\/h1>\n\n\n\n

You can download<\/b><\/a> and install KNIME<\/i> on your computer like any other application. The software is free and open-source.<\/i> You can use it to build data pipelines, data wrangling, training machine learning models, and real-time predictions.<\/i> That’s pretty much the work of most data scientists and data engineers.<\/p>\n\n\n\n

Let’s suppose you’re creating a customer segmentation engine<\/i> for a retail chain. You receive data from two different systems. One is a table containing the customer’s demographic information, and the other is about their buying pattern. Your task is to update the cluster representation every day as you receive new data.<\/p>\n\n\n\n

\u00a0<\/p>\n\n\n

\n
\"Example<\/figure><\/div>\n\n\n

The\u00a0first part of this is an ETL.\u00a0It is the data engineering part in our example. We read data from the different data sources(Extract,) join them and filter (Transform,) and save (Load) it for future reference.<\/p>\n\n\n\n

In the\u00a0second part, we create a K-means<\/i> clustering engine.\u00a0It is the data science part of our example. It reads data from the saved path, performs clustering, and outputs a table. The output table has cluster labels for every customer.<\/p>\n\n\n\n

What do you need to know about KNIME’s Interface?<\/h2>\n\n\n\n

The interface has lots of incredible features. Yet, for this introductory exercise, we’re interested in only two components. The Node repository is at the bottom left corner, and the workflow editor is at the center. The description widget at the right side of the editor is helpful too.<\/p>\n\n\n

\n
\"\"<\/figure><\/div>\n\n\n

The engineering team behind KNIME has done a fantastic job. They have created nodes for almost every activity a data scientist would perform. We can search for any node from the node repository.<\/p>\n\n\n\n

You can drag any of those nodes to the editor. Double-click on any node; you get a configuration window. You can do all the settings the activity requires to function on this window.<\/p>\n\n\n\n

You can pull up instant documentation of any node by clicking on it. It’ll explain all the input requirements and what the node will return.<\/p>\n\n\n\n

Reading data from data sources\u2014extract.<\/h2>\n\n\n\n

There are several ways you can extract data from sources in KNIME. You can read from files, query a database, call a REST endpoint, etc.<\/p>\n\n\n\n

In this example, we read a couple of CSV files from the local filesystem. You can search for the CSV reader node in the node repository and drag it to the editor.<\/p>\n\n\n\n

As you drag it to the main window, you may see the red traffic light below the node. It means we haven’t configured it yet. You can double-click on it and configure it to read from a file path.<\/p>\n\n\n

\n
\"Kinme's<\/figure><\/div>\n\n\n

You may see the indicator is yellow now. It means the node is ready to perform. Right-click on the node and select execute. Now the pointer turns green. The node execution was successful.<\/p>\n\n\n\n

You can see the results by right-clicking and selecting the last element in the list. In KNIME, these few options at the end are always the outputs of that nodes. The CSV reader node outputs only one item\u2014the file table itself.<\/p>\n\n\n\n

In this example, I’m reading two CSVs. You can download them from this Git repository<\/a>.<\/p>\n\n\n\n

Performing joins, filtering, etc.\u2014transform.<\/h2>\n\n\n\n

KNIME has intuitive nodes to perform every kind of data-wrangling task. In this example, we’re using two of them\u2014joins and row filters. You may have to perform binning, normalization, removing duplicates and nulls, etc.<\/p>\n\n\n\n

Transforming a variable and aggregating it is a common type of task you’d perform. This technique is popularly known as the map-reduce operation.<\/p>\n\n\n\n

All of them are nodes in KNIME.<\/p>\n\n\n\n

I pulled the joiner node from the repository. Using the mouse, I connected the CSV nodes’ outputs (right) with the joiner nodes’ inputs (left). You can configure this node by selecting the columns of each table to perform the join operation.<\/p>\n\n\n\n

\u00a0<\/p>\n\n\n

\n
\"Joining<\/figure><\/div>\n\n\n

Unlike the CSV reader node, the Joiner node has three outputs. If you hover over them, a tooltip explains what each of them is. The first one is the join result. We don’t use the second (left unmatched) and the third (right unmatched) in our example.<\/p>\n\n\n\n

Next, let’s pull the row filter node and connect it with the outputs of the joiner node. We can configure it to remove customers who are one-time purchases. Set the lower bound of the visit_count variable to 2.<\/p>\n\n\n\n

\u00a0<\/p>\n\n\n

\n
\"Filtering<\/figure><\/div>\n\n\n

Saving the output\u2014load.<\/h2>\n\n\n\n

The final part of an ETL pipeline is to load the data to persistent storage. We don’t want to complicate the example. Hence we write it to a CSV. But, you may have to load it to a database or a data warehouse in real-world projects. Don’t worry; KNIME helps in all situations.<\/p>\n\n\n\n

I grabbed the CSV writer node and configured it very much the same way we did with the CSV reader.<\/p>\n\n\n\n

\u00a0<\/p>\n\n\n

\n
\"Writing<\/figure><\/div>\n\n\n

This last part concludes the ETL pipeline. It’s a crucial task for a data engineer. Find some job descriptions on LinkedIn and see them for yourself.<\/p>\n\n\n\n

Performing machine learning tasks without coding.<\/h2>\n\n\n\n

We have the data clean and ready to build exciting things. In this example, we’ve taken a market segmentation problem. To do this, we’ll be using the K-Means<\/a> clustering algorithm. Likewise, you can perform almost any machine learning algorithm in KNIME without writing a single line of code.<\/p>\n\n\n\n

K-Means create customer groups based on similarities in their attributes. Besides which attributes to use, we can also specify how many groups we need.<\/p>\n\n\n\n

Let’s pull the k-Means node from the repository and connect it with the output of the Row filter node. We can configure it to group customers into four clusters using their age and visit count.<\/p>\n\n\n\n

\u00a0<\/p>\n\n\n

\n
\"K-Means<\/figure><\/div>\n\n\n

After executing the node, you can inspect the outputs. You get the cluster label for every customer and summaries for each cluster.<\/p>\n\n\n\n

I choose K-Means in this example for its simplicity. For most machine learning applications, you have several other tasks to perform. Retraining a model is a critical one too.<\/p>\n\n\n\n

The KNIME’s youtube channel<\/a> has lots of insightful videos t aid in your data science journey.<\/p>\n\n\n\n

Visualizing your analysis in KNIME.<\/h2>\n\n\n\n

Most data science projects’ final part is to visualize insights. Business Intelligent (BI) platforms such as Tableau specialize in this area. You can connect KNIME<\/a> with them for advanced analytics. Yet, the venue itself supports basic visualizations. BI platforms are fantastic for a wider audience. But KNIME’s visualization nodes are sufficient for data scientists<\/a>.<\/p>\n\n\n\n

We’ll use a scatterplot node to create a chart between the two variables<\/a> we used for clustering. But before that, let’s put a color manager node in the workflow.<\/p>\n\n\n\n

Unlike other visualization tools, we need to color our records before plotting them.<\/p>\n\n\n\n

You can select colors and the variable to use for color selection. Yet, in this example, we’re good with the defaults. The color manager picks the cluster labels for it, and the default colors are good too.<\/p>\n\n\n\n

\u00a0<\/p>\n\n\n

\n
\"Creating<\/figure><\/div>\n\n\n

We can add the scatterplot node to the workflow now. Let’s configure it to use age in the x-axis and visit count in y. Also, be sure you tick the ‘create image at output’ checkbox.<\/p>\n\n\n\n

You can now execute the scatterplot node and pull up the image output. You can use an image writer node to export the result to a file too.<\/p>\n\n\n\n

Here’s how the final workflow looks if you need a reference.<\/p>\n\n\n\n

\u00a0<\/p>\n\n\n

\n
\"The<\/figure><\/div>\n\n\n

Final thoughts<\/h1>\n\n\n\n

Excellent. We’ve built an entire data pipeline without a single line of code. It covers ETL, a critical job of a data engineer. Also, we’ve built machine learning models and visualized their output too.<\/p>\n\n\n\n

Excellent. We’ve built an entire data pipeline without a single line of code. It covers ETL, a critical job of a data engineer. Also, we’ve built machine learning models and visualized their output too.<\/p>\n\n\n\n

Programming is essential for data science is a myth. The two are related. But they don’t depend on one another.<\/p>\n\n\n\n

I don’t advocate avoiding programming altogether. At some point, you need it. For instance, a recent discovery in data science may not be there in KNIME yet. The platform is good for performing what’s already there and popular.<\/p>\n\n\n\n

Also, if you are a data science researcher, KNIME has only a little use. You need to build your own algorithm with your own lines of code.<\/p>\n\n\n\n

For this reason, KNIME itself offers flexibility to the program. The standard installation already has the Java and JavaScript nodes to do it. You can also extend it to use Python<\/a> and other languages.<\/p>\n\n\n\n

The point is you don’t need to code every time.<\/p>\n\n\n\n


\n\n\n\n
\n

Thanks for the read, friend. It seems you and I have lots of common interests. Say Hi to me on LinkedIn<\/strong><\/a>, Twitter<\/strong><\/a>, and Medium<\/strong><\/a>. <\/p>\n\n\n\n

Not a Medium member yet? Please use this link to become a member<\/strong><\/a> because I earn a commission for referring at no extra cost for you.<\/p>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"

Making sense from data is common sense. Coding skills aren\u2019t the superpower of a data scientist or data engineer.<\/p>\n","protected":false},"author":2,"featured_media":20,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_blocks_custom_css":"","_kad_blocks_head_custom_js":"","_kad_blocks_body_custom_js":"","_kad_blocks_footer_custom_js":"","_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"footnotes":""},"categories":[4,10],"tags":[],"taxonomy_info":{"category":[{"value":4,"label":"Data Science"},{"value":10,"label":"Strategy"}]},"featured_image_src_large":["https:\/\/www.the-analytics.club\/wp-content\/uploads\/2023\/06\/1_dda3zvfae7wbquzw4evpag-1024x678.jpg",1024,678,true],"author_info":{"display_name":"Thuwarakesh","author_link":"https:\/\/www.the-analytics.club\/author\/thuwarakesh\/"},"comment_info":0,"category_info":[{"term_id":4,"name":"Data Science","slug":"data-science","term_group":0,"term_taxonomy_id":4,"taxonomy":"category","description":"","parent":0,"count":22,"filter":"raw","cat_ID":4,"category_count":22,"category_description":"","cat_name":"Data Science","category_nicename":"data-science","category_parent":0},{"term_id":10,"name":"Strategy","slug":"opinion-strategy","term_group":0,"term_taxonomy_id":10,"taxonomy":"category","description":"","parent":0,"count":32,"filter":"raw","cat_ID":10,"category_count":32,"category_description":"","cat_name":"Strategy","category_nicename":"opinion-strategy","category_parent":0}],"tag_info":false,"_links":{"self":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/279"}],"collection":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/comments?post=279"}],"version-history":[{"count":3,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/279\/revisions"}],"predecessor-version":[{"id":1292,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/posts\/279\/revisions\/1292"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media\/20"}],"wp:attachment":[{"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/media?parent=279"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/categories?post=279"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.the-analytics.club\/wp-json\/wp\/v2\/tags?post=279"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}