CSS Selector/HTML Parser in XBMC

Hamid_PaK

Developer
Apr 16, 2015
20
0
0
Hi Guys,

I've been working on a script to parse HTML in XBMC. I realize that there is not very straight forward way in XBMC (because XBMC is using Python 2.6) to parse HTML and select the HTML DOM. I wrote this class which is called HTMLParserEx based on HTMLParser in Python to parse the HTML. And also wrote another very simple class that works like CSS Selector but really simple.
BTW I tried the HTML Parser code on a few bad HTML mark-ups, and it works. However if the HTML code is really messy, it doesn't work properly. Because of the base XML Parser in Python. However the CSS Selector is still usefull

Supported CSS Selector:
Code:
#id           : div#this_is_the_id or simply #the-ID
.class        : a.this_is_the_class or div.the-class
[attr=""]     : [key*=part-of-value]  or [key=exact-value] or [key^=start-with-in-value] or [key]
tag           : div or img
Refining the filter (second find will only go through the first selected DOMs):
Code:
g = cssSelector(root)
g.find('a[href*=forumdisplay.php?]')
g.find('img')
Here is the code:
http://pastebin.com/JFgRTN0r

And this is sample code to use it. In this code we compare this class with an RE example.

Code:
import HTMLParserEx

htmlSource = ''

timer = time.time()

with httpConn() as conn:
  htmlSource = conn.request('http://forum.kodi.tv/index.php').read()

print 'DONWLOAD TIME:', str(round(( time.time() - timer ) * 1000 ))
timer = time.time()

parser = HTMLParserEx.HTMLParserEx()
parser.feed( htmlSource )
root = parser.close()

print 'PARSE TIME:', str(round(( time.time() - timer ) * 1000 ))
timer = time.time()

g = HTMLParserEx.cssSelector(root)
g.find('a[href*=forumdisplay.php?]')

print 'DOM CSS SELECTOR FILTER TIME:', str(round(( time.time() - timer ) * 1000 ))
timer = time.time()

a = re.compile(r'(<a\b[^h]*href="forumdisplay\.php\?[^>]*>(.*?)</a>)', re.DOTALL | re.IGNORECASE).findall( htmlSource )

print 'DOM REGEX FILTER TIME:', str(round(( time.time() - timer ) * 1000 ))

for idx, el in enumerate( g.selected ):
  print str( idx ) + " " + str( el.tag ) + ' ' + el.text + ' ' + g.toString( el )
Results:
Code:
DONWLOAD TIME: 534.0
PARSE TIME: 36.0
DOM CSS SELECTOR FILTER TIME: 4.0
DOM REGEX FILTER TIME: 1.0
Hope it would be useful. Share your thoughts.
 
Thread starter Similar threads Forum Replies Date
S Retro Gaming 0
L Addon Requests 5