1
laiwei 2012-03-01 16:11:52 +08:00 via Android
|
2
flyphy OP @laiwei 不是太懂python,只会php.
刚考虑了下用正则提取</span> <span 之间的字符串就行,能否指导如何匹配 |
3
lcxz 2012-03-01 16:19:23 +08:00
用 正规则表达式 将div内的标签去掉就剩下你想要的内容了
|
5
phus 2012-03-01 16:28:15 +08:00
HTML = u'''\
<div class="c"> <span class="cmt"><a href="...">游完1200才閃</a> 对 我 说:</span> 你好,转发的赠书大概什么时候送到,上海的,谢谢 <span class="ct">2011-09-16 21:17:35</span> <a href="....." class="cc">回复他 </a> <a href="......." class="cc">共3条对话</a> </div> ''' def main(): tree = lxml.etree.fromstring(HTML, lxml.etree.HTMLParser()) print ''.join(x.strip() for x in tree.xpath('//div[@class="c"]/text()')) |
6
linlinqi 2012-03-01 16:32:55 +08:00
php的话,看看phpQuery. http://code.google.com/p/phpquery/
|
7
orzzzzz 2012-03-01 17:45:32 +08:00
simpledom里find(".cmt")后,取innerText?
|