Nitendra Gautam

Introduction to Machine Learning

Machine Learning is a field of computer science that uses statistical techniques to train a computer system to learn from data and act according to it. Machine Learning applications are not explicitly programmed but learnt from data.These applications infers patterns and relationships between different variables in a datasets.It then uses this knowledge to predict the outcome from the target datasets.

Common terms used in Machine Learning

This section explains the most common terminology used in context of Machine Learning.


A feature or an independent variable represents an attribute or a property of an observation. Features are also known as dimensions. In a tabular dataset, a row represents an observation and column represents a feature.

For example,consider the below datasets which includes fields such as age,gender,profession,city and income.

age	profession	city	income
25	Accountant	Dallas	100000
30	Teacher	Atlanta	60000
35	Doctor	Houston	15000

Each field in this dataset is a feature in the context of machine learning. Each row in this data is an observation Thus, a dataset with high dimensionality has large number of features.

There are two types features categorical and Numerical.

Categorical Features/variables

A categorical feature or variable is a descriptive feature. It can take on one of a fixed number of discrete values. It represents a qualitative value, which is a name or a label. The values of a categorical feature have no ordering.

Some examples are below.

  • Profession (Banker,Teacher, Waitress, Accountant etc)
  • Country (Nepal,USA,China etc)

Numerical Features/variables

It is a quantitative variable that can take on any numerical value. It describes a measurable quantity as a number. The values in a numerical feature have mathematical ordering.

Numerical features can be further classified into discrete and continuous features. A discrete numerical feature can take on only certain values. A continuous numerical feature can take on any value within a finite or infinite interval.

Some examples are given below.

  • Discrete (number of bedrooms in home/hotel)
  • Continuous(Temperature ,Income)

Label/Dependent Variable

A label or a dependent variable is the final variable that machine learning algorithm learns to predict .

It can be classified into two caregories: categorical and numerical.

  • Categorical It represents a class or category . If we are developing a Machine Learning applications that classifies news articles ,categorical variables can be politics,business,sports or any other news section

  • Numerical

It represents numerical dependent variable.If we are developing an application for house market ,one of numerical dependent variables can be house price.


A model is a mathematical relationship between dependent and independent variables which is used for capturing patterns within a dataset.Once a ML model is developed ,it can used for prediction given some input parameters .

Given the values of the independent variables, it can calculate or predict the value for the dependent variable. A ML algorithm trains a model with data so that this model can predict the label for any new observation.

Training Datasets

It is the data that is used by ML algorithm to train a mathematical model. It is either a historical or known data sets.

Training data can be classified into two categories: labeled and unlabeled.

Labeled dataset is a datasets which has label for each observations .

Unlabeled dataset does not have a column that can be used as a label.

Test Datasets

Test Datasets is used for evaluating the predictive performance of model .

A ML model should not be tested with the training datasets.Generally 80% of the total data sets is used for training a model and remaining 20% is used as test datasets.

Machine Learning Uses

Machine Leaning are used in many applications .

  • Video Game development
  • Driverless Car
  • Spam Filtering in Email
  • Medical Diagnosis
  • Image Recognition
  • Fraud Detection
  • Voice recognition
  • Movie ,music, book, clothes and other online recommendations


Mohammed Guller, Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large Scale Data Analysis, Apress, Berkely, CA, 2015

Machine Learning