Metahandler scraping non ASCII

The_Silencer

Developer
Jan 29, 2013
48
0
0
N/A
My development add-on has been working great scraping the site and pulling meta data until the site started allowing non ASCII format names and the metahandler stops at the first one. Here is an example of one of the names 'Antikörper'
Just to get my add-on working again I started filtering such as '([a-zA-Z \d # $ . : / = * , - " " ! @ & ?]+)' trying to add as many special characters as there are in the movie list. It scrapes the list and pulls metadata however there has to be a better way of fixing the non ASCII issue?

--UPDATE--
After spending the weekend working on this and asking questions from the awesome devs here at xbmchub, I was able to come up with a solution that worked with my add-on. This is the code I used to encode by default the website I am scraping *helps to match the website encoding*

import sys
reload (sys) *had to reload sys or would get error*
sys.setdefaultencoding('iso8859-1') *iso8859-1 is the encoding for the website I am scraping*

It was stated in python documentation this is not the best way to use encoding, however was easy and does work for my situation. Any feedback or recommendation are wanted.
 
Last edited:

voinage

Banned
May 9, 2012
574
0
0
Find out what the website encoding is then use that to decode.

name.decode('whatever') other than that I often have to escape special characters or language specific.
 

Eldorado

Moderator
Staff member
May 7, 2012
990
0
16
Some might not have noticed, but you when use the t0mm0 common methods to grab urls, it does try to find the encoding in the html and decodes it

I started doing what Bstrdsmkr does in 1Ch**nel and clean the html with htmlparser and encode to utf-8.. solving all my problems so far
 

voinage

Banned
May 9, 2012
574
0
0
I looked at that when i was playing with sql.
Got tired of you a text factory, annoying blah.

Why not add that to tommo.common / addon.common.
It's bloody useful, that way common can import HTMLparser and the like and it can just be called simply.

clear(annoying_text,encode) & clear(annoying_text,decode) or textzap() LOL
 

Eldorado

Moderator
Staff member
May 7, 2012
990
0
16
It's a good idea!

I was thinking of adding to and expanding the whole get html portion of it.. so that we can do more than just get the html of a url

Just wasn't sure if html parse and then encode to utf-8 should be forced, or optional? I always start to think... 'could there be a scenario where someone would not want it done like that?'... and then I drive myself crazy trying to make a decision :)
 

voinage

Banned
May 9, 2012
574
0
0
Yep, you should expand that brother.

htmlparse then give them the option to specify: utf-8,ascii,iso,latin-1

The default for python is ascii.
This helps : sys.setdefaultencoding('utf-8') then python always tries for utf-8.
 

Bstrdsmkr

New member
Mar 16, 2012
763
0
0
I'd say default it to utf-8 with an option to pass in an encoding. The "right" thing to do is to keep everything in unicode until a special case comes along, but unfortunately the special cases aren't so special =(