Data is the core to any analysis. In this chapter, we will talk about various ways to import data into R.
4.1 Packages with built in data set
Many of the packages in R have built in data sets. To gain access of the data, call the package and you can access them by the name of the data sets. For example, we can access the ExerciseHours data set from Lock5withR
library(Lock5withR)head(ExerciseHours)
Year Gender Hand Exercise TV Pulse Pierces
1 4 M l 15 5 57 0
2 2 M l 20 14 70 0
3 3 F r 2 3 70 2
4 1 F l 10 5 66 3
5 1 M r 8 2 62 0
6 1 M r 14 14 62 0
4.2 Data from web
4.2.1 Read in from URL
Some of the data you found online might be files ended with ‘.csv’ or ‘.xls’. In this case, you can directly read them using the URL. For example,
X <-read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",header =FALSE) #Read in iris datahead(X)
Note that this is not a very stable way of reading data as the structure of the website might change, which will result in failure of reading the data.
4.2.2 API / R API package
Some data sources provide APIs to access their data, for example CDC, Census, and Twitter. However, there is a learning curve in utilizing their APIs. Best practices for API packages will help you to get a head start.
The other option is to find packages that handles the API calls for you. For example:
A well-built client will save you a lot of time in retrieving data and should be your first resort.
4.3 Web scraping
Web scraping is mostly considered as the last resort in obtaining data. As we put it in the last section, meaning that you should always explore the possibilities above before turning to web scraping. When scarping, you should
think and investigate legal issues
think about ethical questions
limit bandwidth use
scrape only what you need
To start, you will need to know some backgrounds about the structure of a html page. In a webpage, you can always right click -> inspect to check on the structure. Also, as a sanity check, we recommend using package robotstxt to see if scarping is allowed on a certain webpage. You can simply feed in the URL into function paths_allowed(). We will use the CRAN page for package forcats for the scraping example.
Package rvest is widely used for web scraping. In the following example, read_html() takes the target webpage URL and html_table() extracts all table element in the page.