In your life as a data scientist, you'll often be working with huge databases that contain tables with millions of rows. If you want to do some analyses on this data, it's possible that you only need a fraction of this data. In this case, it's a good idea to send SQL queries to your database, and only import the data you actually need into R.
dbGetQuery()?is what you need. As usual, you first pass the connection object to it. The second argument is an SQL query in the form of a character string. This example selects theagevariable from thepeopledataset where gender equals "male":
dbGetQuery(con, "SELECT age FROM people WHERE gender = 'male'")
Apart from checking equality, you can also check forless thanandgreater thanrelationships, with, just like in R.
# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Import post column of tweats where date is higher than '2015-09-21': latest
latest<-dbGetQuery(con,"select post from tweats where date>'2015-09-21'")
# Print latest
latest
# Create data frame specific
specific<-dbGetQuery(con,"select message from comments where tweat_id=77 and user_id>4")
# Print specific
specific
There are also dedicated SQL functions that you can use in theWHEREclause of an SQL query. For example,CHAR_LENGTH() returns the number of characters in a string.
Of course, SQL does not stop with the the three keywordsSELECT,FROMandWHERE. Another very often used keyword isJOIN, and more specificallyINNER JOIN. Take this call for example:
SELECT name, post
FROM users INNER JOIN tweats on users.id = user_id
WHERE date > "2015-09-19"
Here, theuserstable is joined with thetweatstable. This is possible because theidcolumn in theuserstable corresponds to theuser_idcolumn in thetweatstable. Also notice howname, from theuserstable, andpostanddate, from thetweatstable, can be referenced to without problems.
Can you predict the outcome of the following query?
SELECT post, message
FROM tweats INNER JOIN comments on tweats.id = tweat_id
WHERE tweat_id = 77
You've used?dbGetQuery()?multiple times now. This is a virtual function from the DBI package, but is actually implemented by the RMySQL package. Behind the scenes, the following steps are performed:
Sending the specified query with?dbSendQuery();
Fetching the result of executing the query on the database with?dbFetch();
Clearing the result with?dbClearResult().
# Send query to the database
res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")
# Use dbFetch() twice
dbFetch(res, n = 2)
dbFetch(res)
# Clear res
dbClearResult(res)