Basic Statistical programming using R

Hello Data Experts,

Let me continue from my last blog “Dataset using R” where we discussed how to load Dataset from CSV file and work on basic operations over that loaded dataset.

Given that R is the statistical language, it is important for us to unleash the power of R for statistical analysis. This blog will help one learn statistical calculations such as Mean, Median, Mode, Variance, Standard Deviation and few other formulas. Let us move forward and understand how to get these values.

Before we delve into statistical programming, let us first understand why do we need to calculate Mean, Median, Mode, Variance and Standard Deviation.

Mean represent average value which can get influenced by outlier (higher or lower) value.

Median reflects middle most value with less probability of getting influenced by outlier.

Mode is the maximum repeated value in the dataset.

Standard Deviation derives the level of variation. It is very commonly used as 1 sigma, 2 Sigma, 3 Sigma, 4 Sigma, 5 Sigma and 6 Sigma. Healthcare and Airline industries should strive for 6 Sigma as these industries can have huge impact either on human life or intensity of disaster is  very on ecosystem if 6 Sigma standards are not met.


To keep this session simple, I will populate an object CarsMileage with certain set of values and later derive statistical values. Let us populate CarsMileage with 20 random values, which defines mileage for last 20 weeks.

CarsMileage <- c(12, 14, 12.5, 13.5, 15, 10, 11, 12, 12, 14, 12, 11.5, 12.5, 13.5, 15, 10.5, 15, 12, 14, 14)

Let us calculate Statistical value:



Minimum value is the least value all numbers in the dataset

min(CarsMileage)

This will result in the minimum value i.e., 10


Maximum value is the maximum value all numbers in the dataset

max(CarsMileage)

This will result in the maximum value i.e., 15


Mean value is the average of all values

mean(CarsMileage)

This will result in the mean value i.e., 12.8


Median value is the middle most value of the dataset. Dataset gets sorted in the memory, post that middle value is derived. In case of even number of data points Median is derive as the average of middle 2 values, whereas in case of odd number of values middle most value is the Median.

median(CarsMileage)

Median value of this dataset will be 12.5


Mode is the value which is maximum times there in the dataset for string and data type of numeric.

mode(CarsMileage)

Mode value of this dataset is “numeric”.


Standard Deviation of the dataset

sd(CarsMileage)

Standard Deviation value of this dataset is 1.499123


Variance of this dataset

var(CarsMileage)

Variance value of this dataset is 2.247368


Let us find out the probability of 10 mph using standard normal distribution. We will get to the concept of what standard normal distribution later.

pnorm(10, mean(CarsMileage), sd(CarsMileage))

Similarly, if one would like to understand if standardization value follows normal distribution. What does standard value mean? When in the dataset we have various variables(columns) of different units, performing statistical calculation will get influenced by bigger and higher values.  Like is case of Car Mileage, it is a factor of various variables like Speed of a car, weight of a car, car capacity, Size of the car.  In this case speed of a car will be in 20 digits   where as capacity of the car will be in 4 digits. While performing the statistical calculation Capacity of the car will influence outcome. To bring parity and get reliable outcome, we should first get all variables to same unit.  We will discuss in detail later why and how it impacts the outcome.

qqnorm(CarsMileage)

qqline(CarsMileage)


If in one shot we would like to have a Summary of key statistical values, it is very easy to achieve using R programing language.

summary(CarsMileage)

This will get us below set of details:

Min.   1st Qu.   Median   Mean   3rd Qu.   Max.

10.0   12.0        12.5      12.8    14.0      15.0

I hope first glimpse of statistical power must have been very helpful and make you think what one use to do in schools with so many calculations can be performed using R with single command. Now that we have got key statistical formulas handy we should explore more in coming blogs so stay tuned to grasp concepts. We should explore more statistical power with R programming in coming sessions. In my next blog, I will cover “Graphical representation of statistical values using R Studio”.

Thank you for sparing time and going through this blog I hope it helped you grasp basics of statistical power using R. Kindly share your valuable and kind opinion. Please do not forget to suggest what you would like to understand and hear from me in my future blogs.

Thank you…

Outstanding Outliers:: “AG”.

Outstanding Outlier
Beginning R: The Statistical Programming Language

Advertisements

One thought on “Basic Statistical programming using R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s