Importing and exporting different data types
In R, we can read the files stored from outside the R environment. We can also write the data into files which can be stored and accessed by the operating system. In R, we can read and write different formats of files, such as CSV, Excel, TXT, and so on. In this section, we are going to discuss how to read and write different formats of files.
The required files should be present in the current directory to read them. Otherwise, the directory should be changed to the required destination.
The first step for reading/writing files is to know the working directory. You can find the path of the working directory by running the following code:
>print (getwd())
This will give the paths for the current working directory. If it is not your desired directory, then please set your own desired directory by using the following code:
>setwd("")
For instance, the following code makes the folder C:/Users
the working directory:
>setwd("C:/Users")
How to read and write a CSV format file
A CSV format file is a text file in which values are comma separated. Let us consider a CSV file with the following content from stock-market data:
To read the preceding file in R, first save this file in the working directory, and then read it (the name of the file is Sample.csv
) using the following code:
>data<-read.csv("Sample.csv")
>print(data)
When the preceding code gets executed, it will give the following output:
Date Open High Low Close Volume Adj.Close
1 14-10-2016 2139.68 2149.19 2132.98 2132.98 3228150000 2132.98
2 13-10-2016 2130.26 2138.19 2114.72 2132.55 3580450000 2132.55
3 12-10-2016 2137.67 2145.36 2132.77 2139.18 2977100000 2139.18
4 11-10-2016 2161.35 2161.56 2128.84 2136.73 3438270000 2136.73
5 10-10-2016 2160.39 2169.60 2160.39 2163.66 2916550000 2163.66
Read.csv
by default produces the file in DataFrame format; this can be checked by running the following code:
>print(is.data.frame(data))
Now, whatever analysis you want to do, you can perform it by applying various functions on the DataFrame in R, and once you have done the analysis, you can write your desired output file using the following code:
>write.csv(data,"result.csv")
>output <- read.csv("result.csv")
>print(output)
When the preceding code gets executed, it writes the output file in the working directory folder in CSV format.
Excel is the most common format of file for storing data, and it ends with extension .xls
or .xlsx
.
The xlsx
package will be used to read or write .xlsx
files in the R environment.
Installing the xlsx
package has dependency on Java, so Java needs to be installed on the system. The xlsx
package can be installed using the following command:
>install.packages("xlsx")
When the previous command gets executed, it will ask for the nearest CRAN mirror, which the user has to select to install the package. We can verify that the package has been installed or not by executing the following command:
>any(grepl("xlsx",installed.packages()))
If it has been installed successfully, it will show the following output:
[1] TRUE
Loading required package: rJava
Loading required package: methods
Loading required package: xlsxjars
We can load the xlsx
library by running the following script:
>library("xlsx")
Now let us save the previous sample file in .xlsx
format and read it in the R environment, which can be done by executing the following code:
>data <- read.xlsx("Sample.xlsx", sheetIndex = 1)
>print(data)
This gives a DataFrame output with the following content:
Date Open High Low Close Volume Adj.Close
1 2016-10-14 2139.68 2149.19 2132.98 2132.98 3228150000 2132.98
2 2016-10-13 2130.26 2138.19 2114.72 2132.55 3580450000 2132.55
3 2016-10-12 2137.67 2145.36 2132.77 2139.18 2977100000 2139.18
4 2016-10-11 2161.35 2161.56 2128.84 2136.73 3438270000 2136.73
5 2016-10-10 2160.39 2169.60 2160.39 2163.66 2916550000 2163.66
Similarly, you can write R files in .xlsx
format by executing the following code:
>output<-write.xlsx(data,"result.xlsx")
>output<- read.csv("result.csv")
>print(output)
Web data or online sources of data
The Web is one main source of data these days, and we want to directly bring the data from web form to the R environment. R supports this:
URL <- "http://ichart.finance.yahoo.com/table.csv?s=^GSPC"
snp <- as.data.frame(read.csv(URL))
head(snp)
When the preceding code is executed, it directly brings the data for the S&P500
index into R in DataFrame format. A portion of the data has been displayed by using the head()
function here:
Date Open High Low Close Volume Adj.Close
1 2016-10-14 2139.68 2149.19 2132.98 2132.98 3228150000 2132.98
2 2016-10-13 2130.26 2138.19 2114.72 2132.55 3580450000 2132.55
3 2016-10-12 2137.67 2145.36 2132.77 2139.18 2977100000 2139.18
4 2016-10-11 2161.35 2161.56 2128.84 2136.73 3438270000 2136.73
5 2016-10-10 2160.39 2169.60 2160.39 2163.66 2916550000 2163.66
6 2016-10-07 2164.19 2165.86 2144.85 2153.74 3619890000 2153.74
Similarly, if we execute the following code, it brings the DJI index data into the R environment: its sample is displayed here:
>URL <- "http://ichart.finance.yahoo.com/table.csv?s=^DJI"
>dji <- as.data.frame(read.csv(URL))
>head(dji)
This gives the following output:
Date Open High Low Close Volume Adj.Close
1 2016-10-14 18177.35 18261.11 18138.38 18138.38 87050000 18138.38
2 2016-10-13 18088.32 18137.70 17959.95 18098.94 83160000 18098.94
3 2016-10-12 18132.63 18193.96 18082.09 18144.20 72230000 18144.20
4 2016-10-11 18308.43 18312.33 18061.96 18128.66 88610000 18128.66
5 2016-10-10 18282.95 18399.96 18282.95 18329.04 72110000 18329.04
6 2016-10-07 18295.35 18319.73 18149.35 18240.49 82680000 18240.49
Please note that we will be mostly using the snp
and dji
indexes for example illustrations in the rest of the book and these will be referred to as snp
and dji
.
A relational database stores data in normalized format, and to perform statistical analysis, we need to write complex and advance queries. But R can connect to various relational databases such as MySQL Oracle, and SQL Server, easily and convert the data tables into DataFrames. Once the data is in DataFrame format, doing statistical analysis is easy to perform using all the available functions and packages.
In this section, we will take the example of MySQL as reference.
R has a built-in package, RMySQL
, which provides connectivity with the database; it can be installed using the following command:
>install.packages("RMySQL")
Once the package is installed, we can create a connection object to create a connection with the database. It takes username, password, database name, and localhost name as input. We can give our inputs and use the following command to connect with the required database:
>mysqlconnection = dbConnect(MySQL(), user = '...', password = '...', dbname = '..',host = '.....')
When the database is connected, we can list the table that is present in the database by executing the following command:
>dbListTables(mysqlconnection)
We can query the database using the function dbSendQuery()
, and the result is returned to R by using function fetch()
. Then the output is stored in DataFrame format:
>result = dbSendQuery(mysqlconnection, "select * from <table name>")
>data.frame = fetch(result)
>print(data.fame)
When the previous code gets executed, it returns the required output.
We can query with a filter clause, update rows in database tables, insert data into a database table, create tables, drop tables, and so on by sending queries through dbSendQuery()
.