Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Saturday, April 30, 2016

How to scrape the web for the list of R release dates?

How to scrape the web for the list of R release dates?


To celebrate the 20,000th question with the -tag on Stack Overflow, please help me to extract the R release dates from the Wikipedia page.

My attempts:

library(XML)  x <- readHTMLTable("http://en.wikipedia.org/wiki/R_(programming_language)")  

This doesn't work because the table is in fact a list, not an HTML table.

library(httr)  x <- GET("http://en.wikipedia.org/wiki/R_(programming_language)")  text <- content(x, "parsed")  

This extracts the text, but my xpath is rusty, so I couldn't extract the relevant release dates.

How can I do this?


PS. The Wikipedia page is the only source I could find, but please feel free to post a solution using canonical source, if there is one.

Answer by Dirk Eddelbuettel for How to scrape the web for the list of R release dates?


Why don't you use the file dates on the canonical ftp archive in Vienna?

Edit: Eg

 lynx -dump http://cran.r-project.org/src/base/R-0/ | grep tgz | grep -v http  

gets you a table you can parse from R. Gets you file sizes as a benefit. Rinse and repeat for R-1 and R-2 directories.

Answer by Brandon Bertelsen for How to scrape the web for the list of R release dates?


Kind of cheating, now that Dirk has given us an easy table to scrape:

library(XML)  theurl <- "http://cran.r-project.org/src/base/R-0/"  h <- htmlParse(theurl)  h <- readHTMLTable(h)  h <- h[[1]]  h <- droplevels(h[-c(1,2,30),])  levels(h$Name) <- gsub(".tgz","",levels(h$Name),fixed=TRUE)    h  

Gives us:

         Name     Last modified Size Description  3      R-0.49 23-Apr-1997 14:53 959K              4   R-0.50-a1 22-Jul-1997 16:44 1.0M              5   R-0.50-a4 10-Sep-1997 14:31 1.0M              6    R-0.60.0 04-Dec-1997 09:58 1.1M              7    R-0.60.1 07-Dec-1997 02:59 1.1M              8    R-0.61.0 22-Dec-1997 00:00 1.1M              9    R-0.61.1 13-Jan-1998 00:00 1.1M              10   R-0.61.2 18-Mar-1998 00:00 1.1M              11   R-0.61.3 03-May-1998 00:00 1.1M              12   R-0.62.0 15-Jun-1998 00:00 1.2M              13   R-0.62.1 15-Jun-1998 00:00 1.2M              14   R-0.62.2 10-Jul-1998 11:59 1.3M              15   R-0.62.3 28-Aug-1998 11:01 1.3M              16   R-0.62.4 24-Oct-1998 00:00 1.3M              17   R-0.63.0 14-Nov-1998 04:57 1.5M              18   R-0.63.1 05-Dec-1998 02:25 1.5M              19   R-0.63.2 12-Jan-1999 02:21 1.5M              20   R-0.63.3 06-Mar-1999 04:27 1.5M              21   R-0.64.0 08-Apr-1999 01:48 1.5M              22   R-0.64.1 08-May-1999 02:55 1.9M              23   R-0.64.2 05-Jul-1999 21:15 1.9M              24   R-0.65.0 28-Aug-1999 00:18 2.1M              25  
R-0.65.1 07-Oct-1999 01:46 2.2M 26 R-0.90.0 22-Nov-1999 18:07 2.3M 27 R-0.90.1 15-Dec-1999 14:05 2.4M 28 R-0.99.0 07-Feb-2000 13:09 2.8M 29 R-0.99.0a 09-Feb-2000 12:28 2.8M

Answer by Andrie for How to scrape the web for the list of R release dates?


Edited to include R version 3.0.0 and above

Dirk Eddelbuettel provided the canonical link to the .0 releases of R.

Here is some code that collates the tables from the three separate URLs, one for each major release, and then plot it:

library(XML)  library(lattice)      getRdates <- function(){    url <- paste0("http://cran.r-project.org/src/base/R-", 0:3)    x <- lapply(url, function(x)readHTMLTable(x, stringsAsFactors=FALSE)[[1]])    x <- do.call(rbind, x)    x <- x[grep("R-(.*)(\\.tar\\.gz|\\.tgz)", x$Name), c(-1, -5)]    x$Release <- gsub("(R-.*)\\.(tar\\.gz|tgz)", "\\1", x$Name)    x$Date <- as.POSIXct(x[["Last modified"]], format="%d-%b-%Y %H:%M")    x$Release <- reorder(x$Release, x$Date)    x  }    x <- getRdates()  dotplot(Release~Date, data=x)  

enter image description here

Answer by Ari B. Friedman for How to scrape the web for the list of R release dates?


And building on the work already done:

h <- getRdates()  # Find version release rate  library(plyr)  h <- subset(h,select=c(-Description))  Version <- sub("^R-([0-9a-z.-]+)\\.t.*","\\1",h$Name)  h$bigVersion <- as.numeric(sub("^([0-9])\\..+","\\1",Version))  h$smallVersion <- as.numeric(sub("^[0-9]\\.([0-9]+).+","\\1",Version))  h$majorVersion <- as.numeric(paste(h$bigVersion,sprintf( "%02.0f", h$smallVersion ),sep="."))  h <- ddply( h, .(bigVersion,majorVersion), function(x) {      x$tinyVersion <- seq(nrow(x))      x  })    # Plot  plot( majorVersion~Date, data=h, pch=".",cex=3)  abline(h=seq(1,2),col="red")  

rates

library(lattice)  print(xyplot( smallVersion~Date|bigVersion, data=h, pch=".",cex=3))  

lattice

And comparing all together:

h <- ddply( h, .(bigVersion), function(x) {     x$bigElapsedTime <- x$Date - min(x$Date)     x  })    png("c:/temp/Rplot3.png")  plot( smallVersion~bigElapsedTime, data=h, pch=".",cex=3,col=h$bigVersion+1)  dev.off()  

all on one plot

# How many minor releases per major release    > table(rle(h$majorVersion)$lengths, substring(rle(h$majorVersion)$values,1,1))        0 1 2    1 1 0 0    2 5 5 9    3 1 1 6    4 2 4 1    5 1 0 0  

Answer by RHertel for How to scrape the web for the list of R release dates?


It is no longer necessary to scrape Wikipedia for this.

library(rversions)  r_versions()  


Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

Related Posts:

0 comments:

Post a Comment

Popular Posts

Fun Page

Powered by Blogger.