UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError
. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.
One of the sections of code that is causing problems is shown below:
agent_telno = agent.find('div', 'agent_contact_number') agent_telno = '' if agent_telno is None else agent_telno.contents[0] p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
Here is a stack trace produced on SOME strings when the snippet above is run:
Traceback (most recent call last): File "foobar.py", line 792, in p.agent_info = str(agent_contact + ' ' + agent_telno).strip() UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.
Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?
Answer by agf for UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
You need to read the Python Unicode HOWTO. This error is the very first example.
Basically, stop using str
to convert from unicode to encoded text / bytes.
Instead, properly use .encode()
to encode the string:
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()
or work entirely in unicode.
Answer by Andbdrew for UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
This is a classic python unicode pain point! Consider the following:
a = u'bats\u00E0' print a => bats?
All good so far, but if we call str(a), let's see what happens:
str(a) Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:
a.encode('utf-8') => 'bats\xc3\xa0' print a.encode('utf-8') => bats?
Voil\u00E0!
The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.
For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html
Answer by Phil LaNasa for UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I've actually found that in most of my cases, just stripping out those characters is much simpler:
s = mystring.decode('ascii', 'ignore')
Answer by maxpolk for UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale
$ echo $LANG en_US.utf8 $ echo $LC_ALL C $ python -c "print (u'voil\u00e0')" Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128) $ export LC_ALL='en_US.utf8' $ python -c "print (u'voil\u00e0')" voil? $ unset LC_ALL $ python -c "print (u'voil\u00e0')" voil?
Answer by Max Korolevsky for UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I found elegant work around for me to remove symbols and continue to keep string as string in follows:
yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
It's important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here:
>>> 'City: Malmö'.encode('ascii', 'ignore').decode('ascii') 'City: Malm'
Answer by Animesh for UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
For me, what worked was:
BeautifulSoup(html_text,from_encoding="utf-8")
Hope this helps someone.
Answer by kenorb for UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
The problem is that you're trying to print a unicode character, but your terminal doesn't support it.
You can try installing language-pack-en
package to fix that:
sudo apt-get install language-pack-en
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).
On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.
Then when writing the code, make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
If you've still a problem, double check your system configuration, such as:
your locale file (
/etc/default/locale
), which should have e.g.LANG="en_US.UTF-8"
value of
LANG
/LC_CTYPE
in shell
Demonstrating the problem and solution in fresh VM.
Initialize and provision the VM (e.g. using
vagrant
):vagrant init ubuntu/vivid64; vagrant up; vagrant ssh
Printing unicode characters (such as trade mark sign like
?
):$ python -c 'print(u"\u2122");' Traceback (most recent call last): File "
", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128) Now installing
language-pack-en
:$ sudo apt-get -y install language-pack-en The following extra packages will be installed: language-pack-en-base Generating locales... en_GB.UTF-8... /usr/sbin/locale-gen: done Generation complete.
Now problem is solved:
$ python -c 'print(u"\u2122");' ?
Answer by Parag Tyagi -morpheus- for UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
Simple helper functions found here.
def safe_unicode(obj, *args): """ return the unicode representation of obj """ try: return unicode(obj, *args) except UnicodeDecodeError: # obj is byte string ascii_text = str(obj).encode('string_escape') return unicode(ascii_text) def safe_str(obj): """ return the byte string representation of obj """ try: return str(obj) except UnicodeEncodeError: # obj is unicode return unicode(obj).encode('unicode_escape')
Answer by pepoluan for UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:
# 'value' contains the problematic data unic = u'' unic += value value = unic
I had this idea after reading Ned's presentation.
I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.
Answer by Andriy Ivaneyko for UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
Add line below at the beginning of your script ( or as second line):
# -*- coding: utf-8 -*-
That's definition of python source code encoding. More info in PEP 263.
Answer by Ashwin for UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
well i tried everything but it did not help, after googling around i figured the following and it helped. python 2.7 is in use.
# encoding=utf8 import sys reload(sys) sys.setdefaultencoding('utf8')
Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72
0 comments:
Post a Comment