写的一个TextPatternParser，可用于采集, 无需正则

★わ浪漫少帅 · 发表于 2012-8-22 10:10:36

PHP复制代码

//EXAMPLES
$htmlPage = 'tab.html';
$htmlPage = file_get_contents($htmlPage);
$htmlTitle = TextPatternParser::parseBetweenText($htmlPage, '<span class="switchBtn">', '</span>');
print_r($htmlTitle);

$arrayOfImagesOnpage = TextPatternParser::parseBetweenText($htmlPage, 'class="miniature"><img src="', '"Apple-tab-span" style="white-space:pre"> //$text = the string to parse values from
//$beginText = the begining value to search for, parseBetweenText looks between $beginText and $endText
//$endText = end value to look for
//$removeSpace - perform trim on value
//$removeHtmlTags - removes any html from end result array values
//$firstResultOnlyNoArray - only will return one result - the first one instead of an array

public static function parseBetweenText(
$text,
$beginText,
$endText,
$removeSpace=true,
$removeHtmlTags=true,
$firstResultOnlyNoArray=false) {
$results = array();
$endPos = 0;
while(true) {
$beginPos = stripos($text, $beginText, $endPos);
if($beginPos===false) break;
$beginPos = $beginPos+strlen($beginText);
$endPos = stripos($text, $endText, $beginPos);
if($endPos===false) break;
$result = substr($text, $beginPos, $endPos-$beginPos);
if($removeSpace){
$result = str_replace("\t","",$result);
$result = str_replace("\n","",$result);
$result = preg_replace("/ /"," ",$result);
$result = preg_replace("~[\s]{2}?[\t]?~i"," ",$result);
$result = str_replace(" "," ",$result);
$result = trim($result);
}
if($removeHtmlTags){
$result = strip_tags($result);
}
if($firstResultOnlyNoArray) return $result;
if($result != '') $results[] = $result;
}
return ($firstResultOnlyNoArray && empty($results) ? '' : $results) ;
}

}

复制代码

sdink · 发表于 2012-8-22 10:45:44

phpquery 百度一下，

★わ浪漫少帅 · 发表于 2012-8-22 13:30:07

sdink 发表于 2012-8-22 10:45
phpquery 百度一下，

pq确实好用，但是对于一些比较简单没必要去使用它,其实采集还有很多开源的，如snoopy，针对某个需求选适合自己的才最好

sdink · 发表于 2012-8-23 08:36:26

★わ浪漫少帅发表于 2012-8-22 13:30
pq确实好用，但是对于一些比较简单没必要去使用它,其实采集还有很多开源的，如snoopy，针对某个需求选适 ...

snoopy还是没有phpquery 好用，他就像jquery一样，选择器太强了,可怜本人没有案例。以上个人观点。

★わ浪漫少帅 · 发表于 2012-8-23 10:54:11

sdink 发表于 2012-8-23 08:36
snoopy还是没有phpquery 好用，他就像jquery一样，选择器太强了,可怜本人没有案例。以上个人观点。 ...

说强大当然是pq强大了，我以前用过一段时间，也帮其作了不少拓展，它是通过分析dom节点的方式,而DOMDocument则是专门用来处理html/xml。它提供了强大xpath选择器及其他很多html/xml操作函数,也提供了一个htmlsql的类使其通过sql的语法也能实现采集，htmlSQL的网络操作使用了Snoopy.class.php，操作字符串则没有用到；我现在不是讨论哪个强大，而是分享一个采集用的源码，仅此而已

sdink · 发表于 2012-8-23 11:33:00

★わ浪漫少帅发表于 2012-8-23 10:54
说强大当然是pq强大了，我以前用过一段时间，也帮其作了不少拓展，它是通过分析dom节点的方式,而DOMDocum ...

感谢分享

futi · 发表于 2013-8-2 14:26:13

simplehtmldom 还不错呀

smartweb · 发表于 2014-5-2 11:48:06

我只用过snoopy，看来要重新学一下了

		自动登录	找回密码
密码			入住 CI 中国社区

[插件 Plugin] 写的一个TextPatternParser，可用于采集, 无需正则

相关帖子