Every profession has their unique toolbox, full of items that are essential to their work. Painters have their brushes and canvas. Bakers have mixers, pans, and ovens. Trades workers have actual toolboxes. And those in a more corporate environment will have a suite of hardware and software necessary to complete the task at hand. Data scientists tend to fall into this last bucket.

I’m of the opinion that one of the most exciting tools in a data scientist’s toolbox are the visualization libraries available for use. …


Clustering is one of the most popular types of unsupervised machine learning. Clustering techniques allow us to group data objects into similar classes in such a way that items within a group share similar characteristics, while items in different groups are not similar at all.

Image for post
Image for post

It could be said that K-means clustering is the most popular non-hierarchical clustering method available to data scientists today.

For K-means, for each of the predetermined number of K clusters (this is the part that makes it a non-hierarchical algorithm), a seed is selected and each data object (row) in the set is assigned to one of the clusters according to the seed it is closest to. Then for each cluster, a new mean is calculated based on the objects which have been assigned and all objects are once again distributed across all K clusters. …


Doing more work with less effort is the name of the game in coding. For data scientists, this is a huge advantage to using tools like Python to assist with extracting as much information from data as possible in an efficient way.

Data exploration is one of the early steps of the data science process. During this step, it is important to gain an understanding of the structure and content of the information you are working with. There are functions like .describe() and .info() …


The comparison between the way things are and the way things ought to be is one that is made frequently. Good ice cream should be inexpensive, if not a free, public good, but oftentimes it is quite expensive. Exercise should be something we all strive for — it makes us feel good and there are health benefits; however, it is often very difficult to find the motivation to make it to the gym. The conflict between an ideal and an observed reality is present in everyday life, and statistics is no exception. …


Many things in life come in a variety of shapes, sizes, flavors, etc. It is this variety that is said to be “the spice of life”. Unfortunately, data scientists often have to save the variety for after hours and get the data they are working with to become rather similar.

Image for post
Image for post

When working with real world data sets, it is common to find yourself working with a mix of continuous variables which span a wide range of information. If you are looking at a data set for student’s academic scores, the range of points for each assignment, quiz, or test might vary — a quiz might be scored out of 20 points while a test might be scored out of 100 points. …


“SQL is an important tool for any data scientist” — is my entry for the understatement of the year.

Image for post
Image for post

As information jockeys, data scientists use SQL to query the data they will need for analysis from established databases. Depending on the demands of your position as a data scientist, you might even be asked to help construct a database from your employer’s data sources, which SQL is also helpful for. …


“There are two kinds of people in this world…”

Are words we’ve all heard before, typically from an elder relative trying to make sense of some bigger picture for you. As cliche as the statement is, demographic information — identifiers which tell us something about the members of a population — are often split into just a few distinct buckets, allowing populations to be split. Gender, car ownership, job industry, being a parent or not, current employment status — these are just a few examples of the myriad data points that can be used to segment groups of people. Choices people make over the course of their lives will deposit them into a number of buckets defined by the data scientist interested in understanding their patterns and habits in the hopes of retaining them as a customer, delivering them relevant product ads, or targeting them for the most beneficial information. …


Once one has a good understanding of the data they have to work with, they next need to decide what they aim to answer with this information. Understanding the problem at hand is part of the Business Understanding step in the Data Science Process.

Image for post
Image for post

A business question with a data solution can often be posed as a hypothesis. For example “Is there a difference in the customer conversion rate between our old website design and a proposed new layout?” Having a hypothesis to test is a must-have before statistical testing can occur.

Two types of hypotheses are exploratory and confirmatory; as the names might suggest, exploratory analysis seeks to uncover the “why” and dig into the data while confirmatory hypotheses are more applicable when you have a pretty good idea of what is going on with the data and need evidence to support thinking. It is important to decide a priori which of your hypotheses belong to these categories. It has been argued that limiting exploratory hypothesis testing can help to increase certainty in results. …


It is widely reported that over 80% of a data scientist’s time is spent cleaning and engineering data. Great effort is put into preparing the information that will feed a data scientist’s models. Including irrelevant information or messy data in the modeling cycle can lead to models that are inaccurate or show false insights.

As such, one of the first steps of data cleaning is understanding what features or attributes will be available for the work and what type of attribute they will be. Data can be measured on four main scales: Interval, Ratio, Nominal, and Ordinal. …


Why it is important to know what your end goal is…

When I set out on a two month journey to complete my capstone project for Flatiron School, effectively ending the first chapter of my story to become a data scientist, I knew that I wanted to have something more tangible than a business presentation to show for the work. It is my belief that the best data scientists are those who can put the information they are able to wrangle into the hands of a client or concerned party and allow them to make their own discoveries. …

Zach Zazueta

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store