How to get started with
R for Data Journalism?

An explorative approach to use R-studio and its packages for data journalism

PART 1: WHY/HOW TO USE RSTUDIO FOR DATA JOURNALISM?


Email LinkedIn

Whether you are new to data journalism or a full blown pro in computer assisted reporting, you might be aware of the fact that we tell more sophisticated data stories as we move on. Why? Possibly because there is so much more opportunity to do it, more and more complex data can be found to work with or maybe it is just because the competition in data journalism increases in general and everyone is keen to present something increasingly better.

Whatever it may be, to meet the demand to work with new data, we increasingly have the option to apply new data visualisations techniques for the web, such as D3 or other new javascript libraries. But what about the analysis part. Tools like Matlab and SPSS are great tools, but can also become extremely expensive and may offer just not the right environment for the fast paced world of data journalism of today. In addition, many tools, inclduing some of the newbies around (e.g. Tableau Software) may either offer too little to do the high level analysis work, or if they really do, simply won't pay off.So what's the alternative?

R

I mostly don't know much about the interesting bits and bytes in the data until I have a gone through a base level analysis, find outliers, or some weird or questionable trends in the data. Yes, MS Excel can help, but how about something that is not crashing all the time and giving you a headache, and can handle much more data while providing a sophisticated environemnt scrape, analysis and output data?

One of the cheapest, and in my opinion, most eloquent ways to discover data for people like data journalists is using R. R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. That's the offical labeling. Here we will misuse it for data journalism.


Interest in "RStudio", and "Data Science" follows a similar trend, and a comparable upwards curve: Since 2011 on Google Search, did the area of data science and number crunching become trendy?


I found R liberating to have such a powerful and well documented open source solution at hand. This short tutorial and resource blog should give people like me, a quick intro into the opportunities of R.

I regret that I have not had the chance to use R to its full extent. As for now, that I do many more data stories for clients, I thought I might as well write a small tutorial how to use R for some finance stock data - and maybe more importantly for our online journalists in the virtual room - how to use it to publish interactive charts on the web. For the latter, we will use plotly, but more on it in the course of the tutorial.

How to Setup? Download R-Studio

Every good tutorial starts with a detailed description on how to download the system, the requirements and other boring setup stuff. Well, first off, I never promised you that this is a good tutorial in the first place, so we will skip over this section rather quickly. If you have already R installed (if you haven't, go to the open source cran.r-project), get the latest version on Rstudio.com and download it for your system. Find a tutorial how to install RStudio also here. If you wonder what the difference is: RStudio is a free and open source integrated development environment (IDE) for R while R is a programming language for statistical computing and graphics.

At the end, I will post some tutorials, guides, additional resources, lists and all kinds of stuff that may help data journalists to get going. Also, I must warm you. The R community, forums and the documentation for R is rather extensive. For me, I felt quite overwhelmed at first. But have no fear. The community is welcoming newbies like you, and help you quickly to learn the dark arts of R. So, let's get started with some data.


PART 2: GET STARTED WITH R FOR YOUR JOURNALISM PRACTICE:

Example with financial stock data

PART 2:Using R to analyse financial stock data


Email LinkedIn

Let's start off with a simple data project. Let us assume you work as a data journalist and were asked to find stores in financial stock data. How daring and exciting one might think (or you think this day can't get worse). Either way, here is one way to do it. With this example, we will go through the basics of R-Studio and judge stock market trends, trends for industries and sectors and eventually will apply one way (out of many) to evaluate individual stocks and whether they may be undervalued or overvalued.

If you wonder, whether this sounds a lot like the example in chapter 4 of the book Practical Data Science Cookbook, then A, you are right, and B, probably too knowledgable for this basic tutorial anyway.

Here is what we do: First, we load the data into RStudio, then we do some basis analysis, then some more, then some plotting on the web, and then some more analysis.

Well, sounds actually a lot like Excel (yes, that annoying program). Let me take you on an adventurous tour and show you the opportunities why R is so different and powerful. Maybe you get enough taste of it to start seeing Excel in a new light. Below, some more reasons why Excel has its limits on fun.

R's main data type is a data frame and ideal for working with structured data. CSV is one of the most common file types, a format you might often encounter in your data journalism. We start off with a .csv file, which you can also download using the button.

We first will read in stock market data, summarise it, clean and investigate the data, create relative valuations for the stocks, then we will screening them and analyse historical prices (via reading in additional data from Yahoo finance). Lastly, we will plot the charts interactively on the web.

Read in the data into RStudio

newData <- read.table('http://www.win-vector.com/dfiles/car.data.csv', sep=',', header=T) Alternatively, especially for large data you can also use the fread command (as of newData <- fread('file path')). It also allows the fast aggregation of large data, including very fast-ordered joins. For the fread command, you would use the data.table package for its function, which allows (very) fast importing of large-structured datasets into R.

You can load in from a URL (also unzip it) and store the result in a new dataframe object. In our case, it's called finviz. If this cvs file has a header line, the command header=T tells R to expect a header line that defines the data column names.sep=',' and specifies the column or field separator as a comma.

Download the data files for this example:

Download the data here, to follow along (it's an example in the book "Practical Data Science Cookbook"). On top of this example, we will build interactive charts for the web with the packages by plotly and ggplot.

What if your data for your DDJ project doesn't come in a CSV file?

Sometimes, the data journalists doesn't find the data in a convenient CSV format. For this, find the following R packages with information how to use them for the different data formats you might encounter:

  • Reading-Excel-Spreadsheets into R
  • R for MongoDB
  • R for JavaScript Object Notation
  • R for Extensible Markup Language
  • R for Structured Query Language

First step is to install a set of different packages we want to use. Make sure you install all if them by highlighting the lines of code below. install.packages("your-package-name") for installing and library(your-package-name) for activating them in R-studio (ctrl + enter on Mac to run). With no red error messages, you should be good to go.

Before we start, make sure you have loaded all the packages we need. You do this by running the following R commands:

install.packages("XML")
								library(XML)
								install.packages("ggplot2")
								library(ggplot2)
								install.packages("plyr")
								library(plyr)
								install.packages("reshape2")
								library(reshape2)
								install.packages("zoo")
								library(zoo)
								install.packages("plotly")
								library(plotly)

First, set your working directory via setwd('Your/R/working/directory') - this will allow to read in the financialdata.csv directly into R, once you have saved it in there. Next, you need to read in the file. As it is a cvs, we can use finviz <- read.csv('financialdata.csv', stringsAsFactors=F) What you get is hopefully no error sign. For more info on read.csv, type ?read.table

setwd('/Users/BH/Desktop/R')
						                maindata <- read.csv("financialdata.csv")

Please note, originally, the cookbook's receipt prescribed to download the data directly from a site called FinViz, but these not so friendly finance people had a great idea to start charging people to be able to download the latest publicly available stock price data in a csv format (Arrrrr!). If you are still keen, go to this Youtube link, and watch how to pull the data anyway by scarping it into your Excel program (someone should write a script for R too :-)

Once you have the data, lets get a feel with the following commands:

summary(finviz)
head(finviz)

summary() is a generic function, and invokes particular methods which depend on the class of the first argument. In our case, we get an output that, for numeric columns, tells us the min, max, mean, median, and 1st and 3rd quartile values. To be honest, for some character fields it is just not useful, but it gives you a great start to see which characters appear most often, and provides the modern data journalist with an idea where to sniff for a good story.

head() returns the start of the dataset ()a vector, matrix, table, data frame or function), and shows us the structure of the data the first 6 rows. In our case we have missing data, which R replaces with NA. Also, we see that we have percentage signs and a whole other stuff in the numeric column. via the command below, we remove them, and set the columns 7 - 68 (yes, thats how many columns this dataset has)to be numerical.

Next we need a cleaning function ()more on gsub, type for ?gsub)

clean_numeric <- function(s){
										  s <- gsub("%|\$|,|\)|\(", "", s)
										  s <- as.numeric(s)
										}

We apply it to the data (more on cbind search for ?cbind). The function applies to column 7 to column 68.

finviz <- cbind(finviz[,1:6],apply(finviz[,7:68], 2, clean_numeric))

Next, lets get a feel for some of the distributions. I calculated quickly the distributions for institutional ownership, profit margins, how man shares were issued (outstanding shares), how the stock price is distributed, and the return on equity (ROE). For the ROE, an important measures for a corporation's profitability, we could learn how efficient companies are in generating profits (and who can make more money with little as possible invested equity = more efficient).

According to Investopedia the profit margin is part of a category of profitability ratios calculated as net income divided by revenue, or net profits divided by sales. Net income or net profit may be determined by subtracting all of a company’s expenses, including operating costs, material costs and tax costs, from its total revenue. Profit margins are expressed as a percentage. It is the measure for how much out of every dollar of sales our listed companies can actually keep in earnings. ()A 20% profit margin, then, means the company has a net income of $0.20 for each dollar of total revenue earned.

hist(finviz$Institutional.Ownership[finviz$Institutional.Ownership<100], breaks=10, main="Institutional.Ownership", xlab="Institutional.Ownership", ylab="in %")
										
										hist(finviz$Profit.Margin[finviz$Profit.Margin>-100], breaks=10, main="Profit Margin", xlab="Profit.Margin", ylab="in %")

										hist(finviz$Shares.Outstanding[finviz$Shares.Outstanding<200], breaks=100, main="Shares.Outstanding", xlab="This is the total number of stock shares the company has issued", ylab="in #")
										
										hist(finviz$Price[finviz$Price<150], breaks=100, main="Price Distribution", xlab="Price", ylab="in $")
										
										hist(finviz$Return.on.Equity[finviz$Return.on.Equity>0 & finviz$Return.on.Equity<50], breaks=10, main="Return on Equity", xlab="Return.on.Equity", ylab="in $")
										

image

Next, we splits out data into subsets, and compute a summary statistics for it. The aggregate command (?aggregate for more info) returns the result in a convenient form. Via the commands below, we can create a sector specific subset for profit margins, which provides us with an average for each sector.

Sector_Avg_Profitmargins <- aggregate(Profit.Margin~Sector,data=finviz,FUN="mean")
										colnames(sector_avg_prices)[2] <- "Sector_Avg_Profitmargins"

Now, to the fun part, lets visualise it via the help of the ggplot2 package, and Plotly.

How to Plot on the web?

To get started, you need to open a free account on plotly.com. You do this by going to Plotly, and by signing up with your email and by creating a password. Then, once logged in, go to this link here and access your plotly_username, and plotly_api_key. This page describes everything again in greater details. Type the following commands and run it one-by-one:

Sys.setenv("plotly_username"="benheubl")
										
										Sys.setenv("plotly_api_key"="xxx")library(ggplot2)
										set.seed(1234)

										library(plotly)

Now that you are set, make sure that you also paste the both lines where it says Sys.setenv with your details into the .Rprofile textile. You can find it by opening your terminal (on Mac) and by pasting the following commands in to find and open your .Rprofile (more info provided here on StackOverflow) Use the commandline:

$touch ~/.Rprofile
										$open ~/.Rprofile

image

Now, execute the command, with the previous steps in mind. For more info on ggplot, run ?ggplot

vizprofitmargins <- ggplot(Sector_avg_profitmargins, aes(x=Sector, y=Sector_Avg_Profitmargins, fill=Sector)) +
										  geom_bar(stat="identity") + ggtitle("Profit Margins Across Industries") +
										  theme(axis.text.x = element_text(angle = 90, hjust = 1))
(viz1 <- ggplotly(vizprofitmargins))

After running it, your browser hopefully opens a new tab, with an interactive plot. For an iFrame, click on "show permalink", copy that and follow the instructions here that your editor can use it in the news/data journalism website CMS. Pretty neat, don't you think?

image

Visually investigating and cleaning data with R

As we can see, that the healthcare sector's profit margins stand out, and my first assumption is that outliers exists that skew the graph (something every journalist should fist assume, regardless). To dig deeper, we want to create a summary of the average for profit margins by industries (note, not sector).

industry_avg_profitmargin <-
										  aggregate(Profit.Margin~Sector+Industry,data=finviz,FUN="mean")
										industry_avg_profitmargin <-
										  industry_avg_prices[order(industry_avg_profitmargin$Sector,industry_avg_profitmargin$Industry),]
										colnames(industry_avg_profitmargin)[3] <- "Industry_Avg_Profitmargin"

The result for industry_avg_profitmargin shall look like the following (get the table view via View(industry_avg_profitmargin) )

image

Next, we create a subset. subset(data) return subsets for vectors, matrices or data frames which meet certain conditions. Our goal is to isolate the industries for only the healthcare sector. For this we run:

industry_chart <- subset(industry_avg_profitmargin,Sector=="Healthcare")

Lets output it again via plotly. The result may offer hopefully a more truthful picture.

p <- ggplot(industry_chart, aes(x=Industry, y=Industry_Avg_Profitmargin, fill=Industry)) +
										  geom_bar(stat="identity") + theme(legend.position="none") + ggtitle("Industry Avg Profit Margins") +
										  theme(axis.text.x = element_text(angle = 90, hjust = 1))
(viz2 <- ggplotly(p))

Profit margins by Drug delivery companies confuse and skewing the picture.

<iframe width="100%" height="700px" frameborder="0" seamless="seamless" scrolling="no" src="https://plot.ly/~benheubl2/252/industry-avg-profit-margins/"> </iframe>

As profit margins from drug delivery companies are so outside of a reasonable scope, we will look at the companies causing such concern by creating another subset for drug delivery companies.

company_chart <- subset(finviz,Industry=="Drug Delivery")

Now we see the 13 companies who find themselves in the Drug Delivery industry category. And again, we visualise it via ggplot2 and and the help of plotly:

p <- ggplot(company_chart, aes(x=Company, y=Profit.Margin, fill=Company)) +
										  geom_bar(stat="identity") + theme(legend.position="none") +
										  ggtitle("Company Avg Profit Margins for Drug Delivery Companies") +
										  theme(axis.text.x = element_text(angle = 90, hjust = 1))
(viz3 <- ggplotly(p))

Emisphere Technologies is the company profile skewed the distribution for our first chart "Industry Avg Profit Margins"

``

Poor EMIS(ticker symbol for Emisphere Technologies) had a bad run since 2011.

image

In order to remove this disturbing profile for a reasonable analysis, we run our data into a new subset, but this time without the ticker for Emisphere Technologies, Inc. I quickly googled the ticker symbol to confirm it is "EMIS", Finance.yahoo.com, like in the company_chart subset.

finviz <- subset(finviz, Ticker!="EMIS")

Lets run the plot again for "Industry Avg Profit Margins" - first we need to re-run the code for sector_avg_profitmargins:

sector_avg_profitmargins <- aggregate(Profit.Margin~Sector,data=finviz,FUN="mean")
										colnames(sector_avg_profitmargins)[2] <- "Sector_Avg_Profitmargins"

Now, lets draw the plot again:

p <- ggplot(sector_avg_profitmargins, aes(x=Sector, y=Sector_Avg_Profitmargins, fill=Sector)) +
										  geom_bar(stat="identity") + ggtitle("Profit Margins Across Industries") +
										  theme(axis.text.x = element_text(angle = 90, hjust = 1))

										(viz4 <- ggplotly(p))

This time we see an adjusted and maybe a more reasonable graphic, which does not exclude other outlier of course (other than "Emisphere Technologies"):

Where should I invest? A Valuation model in 3 steps with R

Next we will build a stock valuation model, by using R: 1. We calculate the sector averages for fields a relative stock valuation (more info on Relative Valuation) 2. We calculate the industry level averages the same way 3. We compare the stock's statistics to the averages to arrive at an index value for each stock that tell us whether it might be undervalued

First thing is to "melt" calculate averages in multiple columns, making the average columns contain the averages for each company sector. The melt function is a generic melt function, and convert an object into a molten data frame.

sector_avg <- melt(finviz, id="Sector")
										sector_avg <- subset(sector_avg,variable%in%c("Price","P.E","PEG","P.S","P.B"))

image

To do the grouping later, all column heading variables are now listed vertically alongside the corresponding values. We remove the ones that have no value (via na.omit) and reformat the data to contains values in a numeric format (via as.numeric).

sector_avg <- (na.omit(sector_avg))
										sector_avg$value <- as.numeric(sector_avg$value)

Now, we can compute the sector averages via dcast (more on ?dcast). It basically adds the columns back horizontally onto the sector_avg dataframe, and allows to show averages by sector (that just repeats itself across the different sectors of course). Then, we add column names accordingly, in column 2 to 6.

sector_avg <- dcast(sector_avg, Sector~variable, mean)
										colnames(sector_avg)[2:6] <- c("SAvgPE","SAvgPEG","SAvgPS","SAvgPB","SAvgPrice")

This way, we get 9 observations, and 6 variable with some neat new column names.

Via the same set of commands, we create a data frame for the industries:

industry_avg <- melt(finviz, id=c("Sector","Industry"))
										industry_avg <- subset(industry_avg,variable %in% c("Price","P.E","PEG","P.S","P.B"))
										industry_avg <- (na.omit(industry_avg))
										industry_avg$value <- as.numeric(industry_avg$value)
										industry_avg <- dcast(industry_avg, Sector+Industry~variable, mean)
										industry_avg <- (na.omit(industry_avg))
										colnames(industry_avg)[3:7] <- c("IAvgPE","IAvgPEG","IAvgPS","IAvgPB","IAvgPrice")

as a result, this gives us 209 observations with 7 variables, again with neat new variable names. Next, our data is merged to the maindata (finviz) set via:

finviz <- merge(finviz, sector_avg, by.x="Sector", by.y="Sector")
										finviz <- merge(finviz, industry_avg, by.x=c("Sector","Industry"), by.y=c("Sector","Industry"))

The next thing int the receipt is to add another 10 rows that will allow us rate whether the stock is under or above the average:

finviz$SPEUnder <- 0
										finviz$SPEGUnder <- 0
										finviz$SPSUnder <- 0
										finviz$SPBUnder <- 0
										finviz$SPriceUnder <- 0
										finviz$IPEUnder <- 0
										finviz$IPEGUnder <- 0
										finviz$IPSUnder <- 0
										finviz$IPBUnder <- 0
										finviz$IPriceUnder <- 0

image

finviz$SPEUnder[finviz$P.E<finviz$SAvgPE] <- 1
										finviz$SPEGUnder[finviz$PEG<finviz$SAvgPEG] <- 1
										finviz$SPSUnder[finviz$P.S<finviz$SAvgPS] <- 1
										finviz$SPBUnder[finviz$P.B<finviz$SAvgPB] <- 1
										finviz$SPriceUnder[finviz$Price<finviz$SAvgPrice] <- 1
										finviz$IPEUnder[finviz$P.E<finviz$IAvgPE] <- 1
										finviz$IPEGUnder[finviz$PEG<finviz$IAvgPEG] <- 1
										finviz$IPSUnder[finviz$P.S<finviz$IAvgPS] <- 1
										finviz$IPBUnder[finviz$P.B<finviz$IAvgPB] <- 1
										finviz$IPriceUnder[finviz$Price<finviz$IAvgPrice] <- 1

Now, stocks with 1 in the different undervalued columns (the 10 columns we added first), indicating that the stock is below the average for the indicator for this industry, while the ones with 0 are above (its a headfuck: undervaluation is great, as you can expect that stocks will rise again, so can now buy cheaply, and hopefully sell with a profit). Next, another column, the RelValIndex column, which adds together vertically all the 1ns to output a final ranking from 1=least undervalued to 10 most undervalued.

finviz$RelValIndex <- apply(finviz[79:88],1,sum)

image

Via a subset called "potentially_undervalued", we pull the ratings together, and we demand to get all the companies with a valuable above 9

potentially_undervalued <- subset(finviz,RelValIndex>=8)[,c(4,5,89)]

With this, we get a data frame of 661 observation plus our three column variables (ticker, company and RelValIndex), which narrows the field down to only a few hundred stocks we should through our money at. Hurray! Note: please hesitate to draw final investment conclusions out of this basic relative valuable model. There are literally no limits of how complex and sophisticated you can make these stock valuation models.

Moving Averages with R & iChart.Finance.Yahoo

We will use moving averages to support our judgement as semi professional analysts. Fist, we filter some companies out, which meet certain criteria (in our opinion, the stocks that meet the criteria for being a good choice investment, whatever that is).

target_stocks <- subset(finviz, Price>0 &
										                          Price<200 &
										                          Volume>1000 &
										                          Country=="USA" &  
										                          EPS..ttm.>0 &
										                          EPS.growth.next.year>0 &
										                          EPS.growth.next.5.years>0 &
										                          Total.Debt.Equity<1 & Beta<1.5 &
										                          Institutional.Ownership<30 &
										                          RelValIndex>8)

A counter is set for a for loop. We now make use of the zoo package to pull data from the Yahoo platform into RStudio.

counter <- 0

Below, We start the for loop and first read in the data via the iChart Finance Yahoo URL (that lets us download historic stock market data). Via a placeholder operation, we dynamically fill in the symbol into the URL function.

Next, all rows are removed with a 0 value in the data frame (via na.omit). The code renames the last column, making sure the Date columnis formatted as a date which R can use and add we add the stock's symbol to the first row of the data frame.

>for (symbol in target_stocks$Ticker){
										  url <- paste0("http://ichart.finance.yahoo.com/table.csv?s=",symbol,"&a=08&b=7&c=1984&d=01&e=23&f=2014&g=d&ignore=.csv")

										stock <- read.csv(url)
										  stock <- na.omit(stock)
										  colnames(stock)[7] <- "AdjClose"
										  stock[,1] <- as.Date(stock[,1])
										  stock <- cbind(Symbol=symbol,stock)

										maxrow <- nrow(stock)-49
										  ma50 <- cbind(stock[1:maxrow,1:2],rollmean(stock$AdjClose,50,align="right"))
										  maxrow <- nrow(stock)-199
										  ma200 <- cbind(stock[1:maxrow,1:2],rollmean(stock$AdjClose,200,align="right"))

Now the moving averages are calculated which will allow us to compare with the daily stock prices we dowloaded from the web. There is a 50-day moving average and a 200-day moving average:

stock <- merge(stock,ma50,by.x=c("Symbol","Date"),by.y=c("Symbol","Date"),all.x=TRUE)
										  colnames(stock)[9] <- "MovAvg50"    
										  stock <- merge(stock,ma200,by.x=c("Symbol","Date"),by.y=c("Symbol","Date"),all.x=TRUE)
										  colnames(stock)[10] <- "MovAvg200"

Price_chart is the result of a melting operation that combines the moving average data frames with the data frame containing the historical stock prices:

price_chart <- melt(stock[,c(1,2,8,9,10)],id=c("Symbol","Date"))

Output via quick plot:

qplot(Date, value, data=price_chart, geom="line", color=variable,
										        main=paste(symbol,"Daily Stock Prices"),ylab="Price")
										  ggsave(filename=paste0("stock_price_",counter,".png"))
price_summary <- ddply(stock, "Symbol", summarise, open=Open[nrow(stock)],
										                         high=max(High),low=min(Low),close=AdjClose[1])

Compile prices and summaries for all symbols into a single data frame

if(counter==0){
										    stocks <- rbind(stock)
										    price_summaries <- rbind(price_summary)
										  }else{
										    stocks <- rbind(stocks, stock)
										    price_summaries <- rbind(price_summaries, price_summary)
										  

										increment our counter by one
										counter <- counter+1
										}

Next, we use ggplotly to plot prices over time for each symbol, for the stocks we filtered out as most interesting to us (complying with the filter we set)

vizfive <- qplot(Date, AdjClose, data=stocks, geom="line", color=Symbol,
										      main="Daily Stock Prices")

										ggplotly()

The result for our stocks below gives a good idea which stock performed well, and which didn't.

Then we create a summary for all the stocks we received

summary <- melt(price_summaries,id="Symbol")
										ggplot(summary, aes(x=variable, y=value, fill=Symbol)) +
										  geom_bar(stat="identity") + facet_wrap(~Symbol)
										ggplotly()

As it depends how many companies you have filtered out (in my case its 12, we get an interactive chart that show the summary data frame, of when the "suitable" stocks opened, closed, its hight and low values.

With these R skills, the right attitude and curiosity as a financial data journalist, you should be able to reproduce similar models quickly and maybe impress your editor.

While now you have the skills to teach your editor, you most certainly will be the star of the evening at any data science party.