Dataset Programming using R

Hello Data Experts,

Let me continue from my last blog “Basic R Programming” where we discussed Basic, Advanced and Relational Operators.

dataset

Let us take step forward and understand how to load datasets. First let me define .csv with the base data and then we will use it for this session. Copy the below data and paste it in a notepad, save it as “Plasma.csv” file.


Number of times Pregnant, Plasma glucose concentration, Diastolic blood pressure, Triceps Skin fold thickness
6,148,72,35
1,85,66,29
8,183,64,0
1,89,66,23
0,137,40,35
5,116,74,0
3,78,50,32
10,115,0,0
2,197,70,45
8,125,96,0
4,110,92,0
10,168,74,0
10,139,80,0
1,189,60,23


Let us find out the working directory set for R Studio by default, by executing below command.

 getwd()

If the result of the above command does not match your source directory location, one can set working directory path by executing below commands
setwd(“C:/ABC/XYZ”)

or

setwd(“C:\\ABC\\XYZ”)

As a best practice one should set the working directory using below command.

setwd()

Kindly note by default windows file path has “\” whereas for R programming it is “/”. Another way to perform the same operation is by using “\\” as path definition.

To load the data from CSV file, we should execute “read.csv()” command. Let us load plasma.csv data into an object. Create an object name “Plasma” and assign read.csv with path to it.

Plasma <- read.csv(“Plasma.csv”)

In Global environment window, we can notice a new row gets added having reference to Observations and Variables
Observation are same as # of Rows
Variables are same as # of columns

Now to understand what is the data that got loaded in the Plasma object, we can execute View command

View(Plasma)

Same outcome can be achieved by another quick operation i.e., by clicking on the row data will be loaded in the data viewer window on another tab. Once data gets loaded in the object one can get the list of columns by executing one of the below commands:

names(Plasma)
colnames(Plasma)

Output will be as below
“Number.of.times.Pregnant” “Plasma.glucose.concentration” “Diastolic.blood.pressure”
“Triceps.Skin.fold.thickness”

Similarly to understand the name of rows one can execute below command
rownames(Plasma)

Output will be as below:
“1” “2” “3” “4” “5” “6” “7” “8” “9” “10” “11” “12” “13” “14”

To get structure details of the loaded data, one can execute below command. It will help one understand number of observations and variables in the dataset along with the datatype for each variable.
str(Plasma)

Output will be as shown below:
‘data.frame’: 14 obs. of 4 variables:
$ Number.of.times.Pregnant : int 6 1 8 1 0 5 3 10 2 8 …
$ Plasma.glucose.concentration: int 148 85 183 89 137 116 78 115 197 125 …
$ Diastolic.blood.pressure : int 72 66 64 66 40 74 50 0 70 96 …
$ Triceps.Skin.fold.thickness : int 35 29 0 23 35 0 32 0 45 0 …

Now that we have the dataset loaded, let us retrieve the values from it as required.

To get data for the second row with header execute below command
Plasma[2,]

To get data for the second column execute below command
Plasma[,2]

To get data we can also execute below command since it was not explicitly stated which row or a column, it will pick up value as reference to the column number by default and all rows
Plasma[2]

To retrieve a specific value from dataset and if we are aware of coordinates, specify Row number and the column number.
Plasma[2,3]

To retrieve data for first 2 rows but value only from third column execute as below
Plasma[1:2,3]

We can also club Data operators with retrieved data as listed below
sum(Plasma[1:2,3])

If it is not always easy to retrieve data based on coordinate, in that case we should be able to get value based on column and row name. This is how we can reference data in the dataset to retrieve all value for a column
Plasma$Number.of.times.Pregnant

I am sure everyone is comfortable with very basic statistical operations on dataset. Now we will do some advance statistical operations on dataset. If we need to add a new calculated column, it is pretty easy to do so. This is how we can add a new calculated column
Plasma$NewCol <- 1 + Plasma$Number.of.times.Pregnant

We can always retrieve subset of the dataset by specifying rows and column as below, Outcome will be all rows with only first 4 columns

Plasma[,1:4]

To retrieve number of rows in the dataset, we can execute following command.

length(Plasma$Number.of.times.Pregnant)

Datasets can be filtered based on certain criteria as shown below

Plasma[Plasma$Number.of.times.Pregnant == 10 & Plasma$Diastolic.blood.pressure > 0,]

I hope this blog helped you understand how to load the dataset, retrieve data, View data, extract rows and columns using r commands. We have understood both Data Operations and Dataset related commands using R now we should be able to move forward with some advance topics using R programming. In my next blog, I will cover “First level statistical programming commands using R Studio”.

Thank you for continuing with me reading through this blog. I hope it was insightful. Kindly share your valuable and kind opinion. Please do not forget to suggest what you would like to understand and hear from me in my future blogs.

Thank you…
Outstanding Outliers:: “AG”.

Outstanding Outlier

 

 

 

Advertisements

2 thoughts on “Dataset Programming using R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s