Sunday, 20 March 2016

Exploring India's Inflation

In this post we will try to plot a basic exploratory plot using ggplot. The data was obtained from world bank. You can download it from here.

So the data has only 2 columns, year and inflation in that year. Plotting it as it is will give this plot:

Listing for the above plot:
inflation = read.csv('india_inflation.csv')
plot(inflation, type = "l")


Thought this plot conveys all the information required, we can see that we can improve this plot in multiple ways. The very first thing we notice is that there are certain peaks in the plot. These inflation values are very hight or very low, and that should be highlighted. So lets categorize the inflation values in categories such as Low, Normal and High.


inflation = read.csv('india_inflation.csv')
inf.mean = mean(inflation$Inflation)
inf.sd = sd(inflation$Inflation)
high = inf.mean + inf.sd
low = inf.mean - inf.sd

# Categorize the data
inflation[,]$Category = "NORMAL"
inflation[inflation$Inflation > high,]$Category = "HIGH"
inflation[inflation$Inflation < low,]$Category = "LOW"


This code calculates the mean inflation and standard deviation. Any values which are out of first standard deviation are labeled high or low accordingly. This can be used to color the graph accordingly.


g = ggplot(inflation, aes(Year, Inflation)) +
    geom_line() +
    geom_abline(slope = 0, intercept = inf.mean, aes(color = "gold", size = 1)) +
    geom_point(aes(color = Category, size = 4)) +
    scale_color_manual(values = c("firebrick4", "firebrick4", "forestgreen")) +
    theme_bw() + theme(legend.position = "none")
g



Tuesday, 19 January 2016

My Journey with plotting in R - Basic Plotting

My Journey with plotting in R


Today, I plotted my first plot in R. I was feeling lost in the beginning, but understood after trying a few examples. I tried with inbuilt data sets in R. But in this post I am going to explain the Nile data set which can be found here.
CSV Description.

Csv file contains the data, 100 records and the doc file contains the description of data.

So let's begin..

Before we plot anything, let's try to understand the data. head function shows first few records in data.

df = read.csv(file = 'Nile.csv', header = TRUE)
head(df)
  X time Nile
1 1 1871 1120
2 2 1872 1160
3 3 1873  963
4 4 1874 1210
5 5 1875 1160
6 6 1876 1160

time is the year of the observation and Nile is water flow in that particular year.

plot(df$time, df$Nile, xlab = 'Year', ylab = 'Flow', type = 'l')


With the above line, the following plot is generated:

The plot function is a versatile plotting tool and can plot many types of plots which we will see in a bit. The above plot looks good but it would be better if we could add average water flow in Nile.

meanWater = mean(df$Nile)
abline(h = meanWater)
 
Looks good, but what if we could add min threshold and max threshold after which the flow in Nile could result into trouble for nearby habitants? Let's add that as standard deviation of the flow. Let's also denote the average flow by green line and standard deviation as red lines since that marks as limit after which problems may arise.
sdWater = sd(df$Nile)
abline(h = meanWater, col = 'green')
abline(h = meanWater + sdWater, col = 'red')
abline(h = meanWater - sdWater, col = 'red')

And the final plot looks something like this:





I think now we can understand the plot more clearly. Year around 1915 was a very dry year and we may conclude that there was a flood in around 1880 and 1895.

Here's the complete script:

df = read.csv(file = 'Nile.csv', header = TRUE)
plot.new()
plot(df$time, df$Nile, xlab = 'Year', ylab = 'Water in Nile', type = 'l')
meanWater = mean(df$Nile)
sdWater = sd(df$Nile)
abline(h = meanWater, col = 'green')
abline(h = meanWater + sdWater, col = 'red')
abline(h = meanWater - sdWater, col = 'red')


Find out about types of plot in my next blog. ;)