How to scrape the web for the list of R release dates?
How to scrape the web for the list of R release dates?
To celebrate the 20,000th question with the r-tag on Stack Overflow, please help me to extract the R release dates from the Wikipedia page.
My attempts:
library(XML) x <- readHTMLTable("http://en.wikipedia.org/wiki/R_(programming_language)")
This doesn't work because the table is in fact a list, not an HTML table.
library(httr) x <- GET("http://en.wikipedia.org/wiki/R_(programming_language)") text <- content(x, "parsed")
This extracts the text, but my xpath
is rusty, so I couldn't extract the relevant release dates.
How can I do this?
PS. The Wikipedia page is the only source I could find, but please feel free to post a solution using canonical source, if there is one.
Answer by Dirk Eddelbuettel for How to scrape the web for the list of R release dates?
Why don't you use the file dates on the canonical ftp archive in Vienna?
Edit: Eg
lynx -dump http://cran.r-project.org/src/base/R-0/ | grep tgz | grep -v http
gets you a table you can parse from R. Gets you file sizes as a benefit. Rinse and repeat for R-1 and R-2 directories.
Answer by Brandon Bertelsen for How to scrape the web for the list of R release dates?
Kind of cheating, now that Dirk has given us an easy table to scrape:
library(XML) theurl <- "http://cran.r-project.org/src/base/R-0/" h <- htmlParse(theurl) h <- readHTMLTable(h) h <- h[[1]] h <- droplevels(h[-c(1,2,30),]) levels(h$Name) <- gsub(".tgz","",levels(h$Name),fixed=TRUE) h
Gives us:
Name Last modified Size Description 3 R-0.49 23-Apr-1997 14:53 959K 4 R-0.50-a1 22-Jul-1997 16:44 1.0M 5 R-0.50-a4 10-Sep-1997 14:31 1.0M 6 R-0.60.0 04-Dec-1997 09:58 1.1M 7 R-0.60.1 07-Dec-1997 02:59 1.1M 8 R-0.61.0 22-Dec-1997 00:00 1.1M 9 R-0.61.1 13-Jan-1998 00:00 1.1M 10 R-0.61.2 18-Mar-1998 00:00 1.1M 11 R-0.61.3 03-May-1998 00:00 1.1M 12 R-0.62.0 15-Jun-1998 00:00 1.2M 13 R-0.62.1 15-Jun-1998 00:00 1.2M 14 R-0.62.2 10-Jul-1998 11:59 1.3M 15 R-0.62.3 28-Aug-1998 11:01 1.3M 16 R-0.62.4 24-Oct-1998 00:00 1.3M 17 R-0.63.0 14-Nov-1998 04:57 1.5M 18 R-0.63.1 05-Dec-1998 02:25 1.5M 19 R-0.63.2 12-Jan-1999 02:21 1.5M 20 R-0.63.3 06-Mar-1999 04:27 1.5M 21 R-0.64.0 08-Apr-1999 01:48 1.5M 22 R-0.64.1 08-May-1999 02:55 1.9M 23 R-0.64.2 05-Jul-1999 21:15 1.9M 24 R-0.65.0 28-Aug-1999 00:18 2.1M 25
Answer by Andrie for How to scrape the web for the list of R release dates?
Edited to include R version 3.0.0 and above
Dirk Eddelbuettel provided the canonical link to the .0 releases of R.
Here is some code that collates the tables from the three separate URLs, one for each major release, and then plot it:
library(XML) library(lattice) getRdates <- function(){ url <- paste0("http://cran.r-project.org/src/base/R-", 0:3) x <- lapply(url, function(x)readHTMLTable(x, stringsAsFactors=FALSE)[[1]]) x <- do.call(rbind, x) x <- x[grep("R-(.*)(\\.tar\\.gz|\\.tgz)", x$Name), c(-1, -5)] x$Release <- gsub("(R-.*)\\.(tar\\.gz|tgz)", "\\1", x$Name) x$Date <- as.POSIXct(x[["Last modified"]], format="%d-%b-%Y %H:%M") x$Release <- reorder(x$Release, x$Date) x } x <- getRdates() dotplot(Release~Date, data=x)
Answer by Ari B. Friedman for How to scrape the web for the list of R release dates?
And building on the work already done:
h <- getRdates() # Find version release rate library(plyr) h <- subset(h,select=c(-Description)) Version <- sub("^R-([0-9a-z.-]+)\\.t.*","\\1",h$Name) h$bigVersion <- as.numeric(sub("^([0-9])\\..+","\\1",Version)) h$smallVersion <- as.numeric(sub("^[0-9]\\.([0-9]+).+","\\1",Version)) h$majorVersion <- as.numeric(paste(h$bigVersion,sprintf( "%02.0f", h$smallVersion ),sep=".")) h <- ddply( h, .(bigVersion,majorVersion), function(x) { x$tinyVersion <- seq(nrow(x)) x }) # Plot plot( majorVersion~Date, data=h, pch=".",cex=3) abline(h=seq(1,2),col="red")
library(lattice) print(xyplot( smallVersion~Date|bigVersion, data=h, pch=".",cex=3))
And comparing all together:
h <- ddply( h, .(bigVersion), function(x) { x$bigElapsedTime <- x$Date - min(x$Date) x }) png("c:/temp/Rplot3.png") plot( smallVersion~bigElapsedTime, data=h, pch=".",cex=3,col=h$bigVersion+1) dev.off()
# How many minor releases per major release > table(rle(h$majorVersion)$lengths, substring(rle(h$majorVersion)$values,1,1)) 0 1 2 1 1 0 0 2 5 5 9 3 1 1 6 4 2 4 1 5 1 0 0
Answer by RHertel for How to scrape the web for the list of R release dates?
It is no longer necessary to scrape Wikipedia for this.
library(rversions) r_versions()
Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72
0 comments:
Post a Comment