Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Friday, January 29, 2016

How can I scrape an HTML table to CSV?

How can I scrape an HTML table to CSV?


The Problem

I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.

A lot of this info would be much more useful if I could put it into a spreadsheet for sorting, averaging, etc. How can I screen-scrape this data to a CSV file?

My First Idea

Since I know jQuery, I thought I might use it to strip out the table formatting onscreen, insert commas and line breaks, and just copy the whole mess into notepad and save as a CSV. Any better ideas?

The Solution

Yes, folks, it really was as easy as copying and pasting. Don't I feel silly.

Specifically, when I pasted into the spreadsheet, I had to select "Paste Special" and choose the format "text." Otherwise it tried to paste everything into a single cell, even if I highlighted the whole spreadsheet.

Answer by mkoeller for How can I scrape an HTML table to CSV?


  • Select the the HTML table in your tools's UI and copy it into the clipboard (if that's possible
  • Paste it into Excel.
  • Save as CSV file

However, this is a manual solution not an automated one.

Answer by James Van Huis for How can I scrape an HTML table to CSV?


Quick and dirty:

Copy out of browser into Excel, save as CSV.

Better solution (for long term use):

Write a bit of code in the language of your choice that will pull the html contents down, and scrape out the bits that you want. You could probably throw in all of the data operations (sorting, averaging, etc) on top of the data retrieval. That way, you just have to run your code and you get the actual report that you want.

It all depends on how often you will be performing this particular task.

Answer by Will Rickards for How can I scrape an HTML table to CSV?


Have you tried opening it with excel? If you save a spreadsheet in excel as html you'll see the format excel uses. From a web app I wrote I spit out this html format so the user can export to excel.

Answer by Gene T for How can I scrape an HTML table to CSV?


http://ouseful.wordpress.com/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/

http://groups.google.com/group/ruby-talk-google/browse_thread/thread/cfae0aa4b14e5560?hl=nn

Answer by andy for How can I scrape an HTML table to CSV?


If you're screen scraping and the table you're trying to convert has a given ID, you could always do a regex parse of the html along with some scripting to generate a CSV.

Answer by Thorvaldur for How can I scrape an HTML table to CSV?


using python:

for example imagine you want to scrape forex quotes in csv form from some site like:fxquotes

then...

from BeautifulSoup import BeautifulSoup  import urllib,string,csv,sys,os  from string import replace    date_s = '&date1=01/01/08'  date_f = '&date=11/10/08'  fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'  fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'  cur1,cur2 = 'USD','AUD'  fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1  fx_url = fx_url +'&expr=' + cur2 +  '&expr2=' + cur2 + fx_url_end  data = urllib.urlopen(fx_url).read()  soup = BeautifulSoup(data)  data = str(soup.findAll('pre', limit=1))  data = replace(data,'[
','')  data = replace(data,'
]','') file_location = '/Users/location_edit_this' file_name = file_location + 'usd_aus.csv' file = open(file_name,"w") file.write(data) file.close()


edit: to get values from a table: example from: palewire

from mechanize import Browser  from BeautifulSoup import BeautifulSoup    mech = Browser()    url = "http://www.palewire.com/scrape/albums/2007.html"  page = mech.open(url)    html = page.read()  soup = BeautifulSoup(html)    table = soup.find("table", border=1)    for row in table.findAll('tr')[1:]:      col = row.findAll('td')        rank = col[0].string      artist = col[1].string      album = col[2].string      cover_link = col[3].img['src']        record = (rank, artist, album, cover_link)      print "|".join(record)  

Answer by Christian Payne for How can I scrape an HTML table to CSV?


Excel can open a http page.

Eg:

  1. Click File, Open

  2. Under filename, paste the URL ie: http://stackoverflow.com/questions/259091/how-can-i-scrape-an-html-table-to-csv

  3. Click ok

Excel does its best to convert the html to a table.

Its not the most elegant solution, but does work!

Answer by dkretz for How can I scrape an HTML table to CSV?


Even easier (because it saves it for you for next time) ...

In Excel

Data/Import External Data/New Web Query

will take you to a url prompt. Enter your url, and it will delimit available tables on the page to import. Voila.

Answer by Juan A. Navarro for How can I scrape an HTML table to CSV?


This is my python version using the (currently) latest version of BeautifulSoup which can be obtained using, e.g.,

$ sudo easy_install beautifulsoup4  

The script reads HTML from the standard input, and outputs the text found in all tables in proper CSV format.

#!/usr/bin/python  from bs4 import BeautifulSoup  import sys  import re  import csv    def cell_text(cell):      return " ".join(cell.stripped_strings)    soup = BeautifulSoup(sys.stdin.read())  output = csv.writer(sys.stdout)    for table in soup.find_all('table'):      for row in table.find_all('tr'):          col = map(cell_text, row.find_all(re.compile('t[dh]')))          output.writerow(col)      output.writerow([])  

Answer by n8henrie for How can I scrape an HTML table to CSV?


Two ways come to mind (especially for those of us that don't have Excel):

  • Google Spreadsheets has an excellent importHTML function:
    • =importHTML("http://example.com/page/with/table", "table", index
    • Index starts at 1
    • I recommend a copy and paste values shortly after import
    • File -> Download as -> CSV
  • Python's superb Pandas library has handy read_html and to_csv functions
    • Here's a basic Python3 script that prompts for the URL, which table at that URL, and a filename for the CSV.

Answer by Aviad for How can I scrape an HTML table to CSV?


Basic Python implementation using BeautifulSoup, also considering both rowspan and colspan:

from BeautifulSoup import BeautifulSoup    def table2csv(html_txt):     csvs = []     soup = BeautifulSoup(html_txt)     tables = soup.findAll('table')       for table in tables:         csv = ''         rows = table.findAll('tr')         row_spans = []         do_ident = False           for tr in rows:             cols = tr.findAll(['th','td'])               for cell in cols:                 colspan = int(cell.get('colspan',1))                 rowspan = int(cell.get('rowspan',1))                   if do_ident:                     do_ident = False                     csv += ','*(len(row_spans))                   if rowspan > 1: row_spans.append(rowspan)                   csv += '"{text}"'.format(text=cell.text) + ','*(colspan)               if row_spans:                 for i in xrange(len(row_spans)-1,-1,-1):                     row_spans[i] -= 1                     if row_spans[i] < 1: row_spans.pop()               do_ident = True if row_spans else False               csv += '\n'           csvs.append(csv)         #print csv       return '\n\n'.join(csvs)  


Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

0 comments:

Post a Comment

Popular Posts

Powered by Blogger.