No scrape

The_Silencer

Developer
Jan 29, 2013
48
0
0
N/A
Some thing I have be struggling with is things I do not want to scrape, I use .+? and it works great in a something like:
<span class=".+?" id=".+?" title=".+?">(.+?)</span>. However using the same logic to shrink HTML code its does not work. If I have:

<span class=".+?" id=".+?" title=".+?">(.+?)</span></td>\n <td class="linkTdQuality">(.+?)</td>\n <td class="linkTdVideo">\n <div class=".+?">\n <div class=".+?"> </div>\n <div class="rating-imdb" style=".+?"> </div>\n </div>\n <div class="linkVotes">(.+?)</div>

Some of the mombo jumbo between <span class and <div class="linkVotes"> change between the holsters so I try to use:<span class=".+?" id=".+?" title=".+?">(.+?)</span></td>.+?<div class="linkVotes">(.+?)</div>

I get no results when I try to broaden the no scrape area, something else I should be using?
 
Hopefully you've found an answer in the past 2 months. Hopefully some new devs can find this useful at least. Btw, it also helps to use * in place of + when theres a chance that there might be nothing there at times with how sometimes there will be a space before the end of a bracket ="" > and sometimes not =""> and sometimes be a space and slash ="" /> in cases where you actually want data captured from both sides of it. there's also \s and \s+ for blank spacing (spaces and tabs) and [0-9A-Za-z]+ can be used often times for finding video IDs and [0-9A-Za-z\-]+ to include -'s and [0-9A-Za-z\_]+ to include underscores. I'm really not very good at the | (pipe) string use so I'll refrain from confusing users on that. also in re.search or re.compile (maybe others) you can use
, re.IGNORECASE | re.MULTILINE | re.DOTALL)
to get some differant out comes (best to do some trial and error runs to get used to them as they can also slow down the process speed. Also you can do one set of brackets ( ) inside of another ( ) like
<a href="(http://www.([0-9A-Za-z\.]+)/.+?)" />
which would return the first one as the first ( being the whole url and then the second opening bracket ( as being the domain. It's also good to research what symbols do and don't need a \ before them to be considered THEMSELF, a SPECIAL COMMAND, or something else... at times rather it's outside of a () or [] or inside () or inside a [] or inside a () inside a () when using a | (pipe). For that kind of stuff i'd suggest reading http://www.regular-expressions.info/reference.html

For for a quick easy sheet:
. = any character ( with exception of things like line returns i.e. \n \r )
+ = 1 or more
* = 0 or more
\s = White Space
\n = line return (next line)
\d = a number
\D = a charcter / letter