Search HTML for 2 phrases (ignoring all tags) and strip everything else
Hello ?????!
random code random codeLorem ipsum.
';Then I have two sentences stored in variables:
$begin = 'Hello ?????!'; $end = 'Lorem ipsum.';
I want to search $html
for these two sentences, and strip everything before and after them. So $html
will become:
$html = 'Hello ?????! random code random code
Lorem ipsum.';
How can I achieve this? Note that the $begin
and $end
variables do not have html tags but the sentences in $html
very likely do have tags as shown above.
Maybe a regex approach?
What I've tried so far
A
strpos()
approach. The problem is that$html
contains tags in the sentences, making the$begin
and$end
sentences not match. I canstrip_tags($html)
before runningstrpos()
, but then I will obviously end up with$html
without the tags.Search part of variable, like
Hello
, but that's never safe and will give many matches.
Answer by Druzion for Search HTML for 2 phrases (ignoring all tags) and strip everything else
You could try this RegEx:
(.*?) # Data before sentences (to be removed) ( # Capture Both sentences and text in between H.*?e.*?l.*?l.*?o.*?\s # Hello[space] (<.*?>)* # Optional Opening Tag(s) ?.*??.*??.*??.*??.*? # ????? (<\/.*?>)* # Optional Closing Tag(s) (.*?) # Optional Data in between sentences (<.*?>)* # Optional Opening Tag(s) L.*?o.*?r.*?e.*?m.*?\s # Lorem[space] (<.*?>)* # Optional Opening Tag(s) i.*?p.*?s.*?u.*?m.*? # ipsum ) (.*) # Data after sentences (to be removed)
Substituting with the 2nd
Capture Group
The Regex can be shortened to:
(.*?)(H.*?e.*?l.*?l.*?o.*?\s(<.*?>)*?.*??.*??.*??.*??.*?(<\/.*?>)*(.*?)(<.*?>)*L.*?o.*?r.*?e.*?m.*?\s(<.*?>)*i.*?p.*?s.*?u.*?m.*?)(.*)
Answer by Tim007 for Search HTML for 2 phrases (ignoring all tags) and strip everything else
Just for fun
).*/s"; $str = " \n \n
Hello Moto!\n random code\n random code\n
Lorem ipsum.
\n \n \n "; $subst = "$1"; $result = preg_replace($re, $subst, $str); echo $result."\n"; ?>
Input
$begin = 'Hello Moto!'; $end = 'Lorem ipsum.';
Hello Moto! random code random code
Lorem ipsum.
Output
Hello Moto! random code random code
Lorem ipsum.
Answer by Paul for Search HTML for 2 phrases (ignoring all tags) and strip everything else
This might by far not be the optimal solution, but I love cracking my head about such "riddles", so here's my approach.
Hello Lydia!
random code random code Lorem ipsum.
'; $begin = 'Hello Lydia!'; $end = 'Lorem ipsum.'; $begin_chars = str_split($begin); $end_chars = str_split($end); $begin_re = ''; $end_re = ''; foreach ($begin_chars as $c) { if ($c == ' ') { $begin_re .= '(\s|(<[a-z/]+>))+'; } else { $begin_re .= $c . '(<[a-z/]+>)?'; } } foreach ($end_chars as $c) { if ($c == ' ') { $end_re .= '(\s|(<[a-z/]+>))+'; } else { $end_re .= $c . '(<[a-z/]+>)?'; } } $re = '~(.*)((' . $begin_re . ')(.*)(' . $end_re . '))(.*)~ms'; $result = preg_match( $re, $subject , $matches ); $start_tag = preg_match( '~(<[a-z/]+>)$~', $matches[1] , $stmatches ); echo $stmatches[1] . $matches[2];
This outputs:
Hello Lydia!
random code random code Lorem ipsum.
This is matching this case, but I think it would require some more logic to escape regex special chars like periods.
In general, what this snippet does:
- Splitting the strings into array, each array value representing a single character. This needs to be done because
Hello
needs to matchHello
as well. - To do that, for the regex part an additional
(<[a-z/]+>)?
is inserted after each character with a special case for the space character.
Answer by Wiktor Stribiżew for Search HTML for 2 phrases (ignoring all tags) and strip everything else
Here is a short, yet - I believe - working solution based on a lazy dot matching regex (that can be improved by creating a longer, unrolled regex, but should be enough unless you have really large chunks of text).
$html = "\n\nH
ello ? ????!\nrandom code\nrandom code\nLorem ipsum.
\n\n "; $begin = 'Hello ?????!'; $end = 'Lorem ipsum.'; $begin = preg_replace_callback('~\s++(?!\z)|(\s++\z)~u', function ($m) { return !empty($m[1]) ? '' : ' '; }, $begin); $end = preg_replace_callback('~\s++(?!\z)|(\s++\z)~u', function ($m) { return !empty($m[1]) ? '' : ' '; }, $end); $begin_arr = preg_split('~(?=\X)~u', $begin, -1, PREG_SPLIT_NO_EMPTY); $end_arr = preg_split('~(?=\X)~u', $end, -1, PREG_SPLIT_NO_EMPTY); $reg = "(?s)(?:<[^<>]+>)?(?:&#?\\w+;)*\\s*" . implode("", array_map(function($x, $k) use ($begin_arr) { return ($k < count($begin_arr) - 1 ? preg_quote($x, "~") . "(?:\s*(?:<[^<>]+>|&#?\\w+;))*" : preg_quote($x, "~"));}, $begin_arr, array_keys($begin_arr))) . "(.*?)" . implode("", array_map(function($x, $k) use ($end_arr) { return ($k < count($end_arr) - 1 ? preg_quote($x, "~") . "(?:\s*(?:<[^<>]+>|&#?\\w+;))*" : preg_quote($x, "~"));}, $end_arr, array_keys($end_arr))); echo $reg .PHP_EOL; preg_match('~' . $reg . '~u', $html, $m); print_r($m[0]);
See the IDEONE demo
Algorithm:
- Create a dynamic regex pattern by splitting the delimiter strings into single graphemes (since these can be Unicode characters, I suggest using
preg_split('~(?) and imploding back by adding an optional tag matching pattern
(?:<[^<>]+>)?
. - Then,
(?s)
enables a DOTALL mode when.
matches any character including a newline, and.*?
will match 0+ characters from the leading to trailing delimiter.
Regex details:
'~(? matches every location other than at the start of the string before each grapheme
- (sample final regex)
(?s)(?:<[^<>]+>)?(?:&#?\w+;)*\s*H(?:\s*(?:<[^<>]+>|&#?\w+;))*e(?:\s*(?:<[^<>]+>|&#?\w+;))*l(?:\s*(?:<[^<>]+>|&#?\w+;))*l(?:\s*(?:<[^<>]+>|&#?\w+;))*o(?:\s*(?:<[^<>]+>|&#?\w+;))* (?:\s*(?:<[^<>]+>|&#?\w+;))*?(?:\s*(?:<[^<>]+>|&#?\w+;))*?(?:\s*(?:<[^<>]+>|&#?\w+;))*?(?:\s*(?:<[^<>]+>|&#?\w+;))*?(?:\s*(?:<[^<>]+>|&#?\w+;))*?(?:\s*(?:<[^<>]+>|&#?\w+;))*\!(?:\s*(?:<[^<>]+>|&#?\w+;))*
+ (.*?)
+ L(?:\s*(?:<[^<>]+>|&#?\w+;))*o(?:\s*(?:<[^<>]+>|&#?\w+;))*r(?:\s*(?:<[^<>]+>|&#?\w+;))*e(?:\s*(?:<[^<>]+>|&#?\w+;))*m(?:\s*(?:<[^<>]+>|&#?\w+;))* (?:\s*(?:<[^<>]+>|&#?\w+;))*i(?:\s*(?:<[^<>]+>|&#?\w+;))*p(?:\s*(?:<[^<>]+>|&#?\w+;))*s(?:\s*(?:<[^<>]+>|&#?\w+;))*u(?:\s*(?:<[^<>]+>|&#?\w+;))*m(?:\s*(?:<[^<>]+>|&#?\w+;))*\.
- the leading and trailing delimiters with optional subpatterns for tag matching and a (.*?)
(capturing might not be necessary) inside. ~u
modifier is necessary since Unicode strings are to be processed. - UPDATE: To account for 1+ spaces, any whitespace in the
begin
and end
patterns can be replaced with \s+
subpattern to match any kind of 1+ whitespace characters in the input string. - UPDATE 2: The auxiliary
$begin = preg_replace('~\s+~u', ' ', $begin);
and $end = preg_replace('~\s+~u', ' ', $end);
are necessary to account for 1+ whitespace in the input string. - To account for HTML entities, add another subpattern to the optional parts:
&#?\\w+;
, it will also match
and {
like entities. It is also prepended with \s*
to match optional whitespace, and quantified with *
(can be zero or more).
Answer by v7d8dpo4 for Search HTML for 2 phrases (ignoring all tags) and strip everything else
How about this?
$escape=array('\\'=>1,'^'=>1,'?'=>1,'+'=>1,'*'=>1,'{'=>1,'}'=>1,'('=>1,')'=>1,'['=>1,']'=>1,'|'=>1,'.'=>1,'$'=>1,'+'=>1,'/'=>1); $pattern='/'; for($i=0;isset($begin[$i]);$i++){ if(ord($c=$begin[$i])<0x80||ord($c)>0xbf){ if(isset($escape[$c])) $pattern.="([ \t\r\n\v\f]*<\\/?[a-zA-Z]+>[ \t\r\n\v\f]*)*\\$c"; else $pattern.="([ \t\r\n\v\f]*<\\/?[a-zA-Z]+>[ \t\r\n\v\f]*)*$c"; } else $pattern.=$c; } $pattern.="(.|\n|\r)*"; for($i=0;isset($end[$i]);$i++){ if(ord($c=$end[$i])<0x80||ord($c)>0xbf){ if(isset($escape[$c])) $pattern.="([ \t\r\n\v\f]*<\\/?[a-zA-Z]+>[ \t\r\n\v\f]*)*\\$c"; else $pattern.="([ \t\r\n\v\f]*<\\/?[a-zA-Z]+>[ \t\r\n\v\f]*)*$c"; } else $pattern.=$c; } $pattern[17]='?'; $pattern.='(<\\/?[a-zA-Z]+>)?/'; preg_match($pattern,$html,$a); $match=$a[0];
Answer by Dávid Horváth for Search HTML for 2 phrases (ignoring all tags) and strip everything else
I really wanted to write a regex solution. But I am preceeded with some nice and complex solutions. So, here is a non-regex solution.
Short explanation: The major problem is keeping HTML tags. We could easily search text, if HTML tags were stripped. So: strip these! We can easily search in the stripped content, and produce a substring we want to cut (see 'Begin-End' part). Then, try to cut this substring from the HTML while keeping the tags.
Advantages:
- Searching is easy and independent from HTML, you can search with regex too if you need
- Requirements are scalable: you can easily add full multibyte support, support for entities and white-space collapse, and so on
- Relatively fast (it is possible, that a direct regex can be faster)
- Does not touch original HTML, and adaptable to other markup languages
A static utility class for this scenario:
Explanation comments will be added later!
class HtmlExtractUtil { const FAKE_MARKUP = '<>'; const MARKUP_PATTERN = '#<[^>]+>#u'; static public function extractBetween($html, $startTextToFind, $endTextToFind) { $strippedHtml = preg_replace(self::MARKUP_PATTERN, '', $html); $startPos = strpos($strippedHtml, $startTextToFind); $lastPos = strrpos($strippedHtml, $endTextToFind); if ($startPos === false || $lastPos === false) { return ""; } $endPos = $lastPos + strlen($endTextToFind); if ($endPos <= $startPos) { return ""; } return self::extractSubstring($html, $startPos, $endPos); } static public function extractSubstring($html, $startPos, $endPos) { preg_match_all(self::MARKUP_PATTERN, $html, $matches, PREG_OFFSET_CAPTURE); $start = -1; $end = -1; $previousEnd = 0; $stripPos = 0; $matchArray = $matches[0]; $matchArray[] = [self::FAKE_MARKUP, strlen($html)]; foreach ($matchArray as $match) { $diff = $previousEnd - $stripPos; $textLength = $match[1] - $previousEnd; if ($start == (-1)) { if ($startPos >= $stripPos && $startPos < $stripPos + $textLength) { $start = $startPos + $diff; } } if ($end == (-1)) { if ($endPos > $stripPos && $endPos <= $stripPos +
Usage:
$html = ' Any string before
Hello ?????!
random code random code Lorem ipsum.
Any string after
'; $startTextToFind = 'Hello ?????!'; $endTextToFind = 'Lorem ipsum.'; $extractedText = HtmlExtractUtil::extractBetween($html, $startTextToFind, $endTextToFind); header("Content-type: text/plain; charset=utf-8"); echo $extractedText . "\n";
Answer by Steve Chambers for Search HTML for 2 phrases (ignoring all tags) and strip everything else
PHP solution:
$html = ' Hello ?????!
random code random code Lorem ipsum.
'; $begin = 'Hello ?????!'; $end = 'Lorem ipsum.'; $matchHtmlTag = '(?:<.*?>)?'; $matchAllNonGreedy = '(?:.|\r?\n)*?'; $matchUnescapedCharNotAtEnd = '([^\\\\](?!$)|\\.(?!$))'; $matchBeginWithTags = preg_replace( $matchUnescapedCharNotAtEnd, '$0' . $matchHtmlTag, preg_quote($begin)); $matchEndWithTags = preg_replace( $matchUnescapedCharNotAtEnd, '$0' . $matchHtmlTag, preg_quote($end)); $pattern = '/' . $matchBeginWithTags . $matchAllNonGreedy . $matchEndWithTags . '/'; preg_match($pattern, $html, $matches); $html = $matches[0];
Generated regex ($pattern):
H(?:<.*?>)?e(?:<.*?>)?l(?:<.*?>)?l(?:<.*?>)?o(?:<.*?>)? (?:<.*?>)??(?:<.*?>)??(?:<.*?>)??(?:<.*?>)??(?:<.*?>)??(?:<.*?>)?!(?:.|\r?\n)*?L(?:<.*?>)?o(?:<.*?>)?r(?:<.*?>)?e(?:<.*?>)?m(?:<.*?>)? (?:<.*?>)?i(?:<.*?>)?p(?:<.*?>)?s(?:<.*?>)?u(?:<.*?>)?m(?:<.*?>)?\.
Answer by trincot for Search HTML for 2 phrases (ignoring all tags) and strip everything else
Regular expressions have their limitations when it comes to parsing HTML. Like many have done before me, I will refer to this famous answer.
Potential Problems when relying on Regular Expressions
For instance, imagine this tag appears in the HTML before the part that must be extracted:
This comes before the match
Many regexp solutions will stumble over this, and return a string that starts in the middle of this opening p
tag.
Or consider a comment inside the HTML section that has to be matched:
Or, some loose less-than and greater-than signs appear (let's say in a comment, or attribute value):
What will those regexes do with that?
These are just examples... there are countless other situations that pose problems to regular expression based solutions.
There are more reliable ways to parse HTML.
Load the HTML into a DOM
I will suggest here a solution based on the DOMDocument interface, using this algorithm:
Get the text content of the HTML document and identify the two offsets where both sub strings (begin/end) are located.
Then go through the DOM text nodes keeping track of the offsets where these nodes fit in. In the nodes where either of the two bounding offsets are crossed, a predefined delimiter (
|
) is inserted. That delimiter should not be present in the HTML string. Therefore it is doubled (||
,||||
, ...) until that condition is met;Finally split the HTML representation by this delimiter and extract the middle part as the result.
Here is the code:
function extractBetween($html, $begin, $end) { $dom = new DOMDocument(); // Load HTML in DOM, making sure it supports UTF-8; double HTML tags are no problem $dom->loadHTML(' ' . $html); // Get complete text content $text = $dom->textContent; // Get positions of the beginning/ending text; exit if not found. if (($from = strpos($text, $begin)) === false) return false; if (($to = strpos($text, $end, $from + strlen($begin))) === false) return false; $to += strlen($end); // Define a non-occurring delimiter by repeating `|` enough times: for ($delim = '|'; strpos($html, $delim) !== false; $delim .= $delim); // Use XPath to traverse the DOM $xpath = new DOMXPath($dom); // Go through the text nodes keeping track of total text length. // When exceeding one of the two offsets, inject a delimiter at that position. $pos = 0; foreach($xpath->evaluate("//text()") as $node) { // Add length of node's text content to total length $newpos = $pos + strlen($node->nodeValue); while ($newpos > $from || ($from === $to && $newpos === $from)) { // The beginning/ending text starts/ends somewhere in this text node. // Inject the delimiter at that position: $node->nodeValue = substr_replace($node->nodeValue, $delim, $from - $pos, 0); // If a delimiter was inserted at both beginning and ending texts, // then get the HTML and return the part between the delimiters if ($from === $to) return explode($delim, $dom->saveHTML())[1]; // Delimiter was inserted at beginning text. Now search for ending text $from = $to; } $pos = $newpos; } }
You would call it like this:
// Sample input data $html = ' This comes before the match
Hey! Hello ?????!
random code random code Lorem ipsum. la la la
This comes after the match
'; $begin = 'Hello ?????!'; $end = 'Lorem ipsum.'; // Call $html = extractBetween($html, $begin, $end); // Output result echo $html;
Output:
Hello ?????! random code random code
Lorem ipsum.
You'll find this code is also easier to maintain than regex alternatives.
See it run on eval.in.
Answer by Quasimodo's clone for Search HTML for 2 phrases (ignoring all tags) and strip everything else
There are several different approaches to do a content search on HTML source. They all have advantages and disadvantages. If the structure in unknown code is an issue, the safest way would be to use an XML parser, however, those are complex and therefore rather slow.
Regular expressions are designed for text processing. Although regexp is not the quickest thing due to overhead, preg_
functions are a reasonable compromise to keep code small and concise while not paying to much performance impact if and only if you prevent patterns becoming too complex.
Analysis of HTML structures is doable by recursive regular expressions. Since the slow down the processing and are hard to debug I prefer to code the base logic in PHP and utilize preg_
functions to do smaller quick tasks.
Here is an solution in OOP, a tiny class intended to process many searches on the same HTML source. It is already an approach to handle extended similar problems like adding preceding and succeeding content until next tag boundary. It does not claim to be a perfect solution yet, but it is easily extendable.
The logic is: Pay some runtime for initialization to store tag positions relative to plain text, strip tags and store the strings between <...>
and sums of length as well. Then on each content search match the needles with plain content. Locate the start/end position in the HTML source by binary search.
Binary search works like that: A sorted list is required. You store the index of first and last element+1. Calculate the average by an addition and integer division by 2. Division and floor is performantly done by a right bitshift. If the found value is to low, set the less index var to the current index, else the greater one. Stop on index difference 1. If you search an exact value, break early on element found. 0,(14+1) => 7 ; 7,15 => 11 ; 7,11 => 9 ; 7,9 => 8 ; 8-7 = diff.1 Instead of 15 iterations only 4 are done. The greater the start value is, the more time is exponentially saved.
PHP class:
set_html($html); } public function set_html($html) { $this->html = $html; $regexp = '~<.*?>~su'; preg_match_all($regexp, $html, $this->tags, PREG_PATTERN_ORDER | PREG_OFFSET_CAPTURE); $this->tags = $this->tags[0]; # we use exact the same algorithm to strip html $this->heystack = preg_replace($regexp, '', $html); # convert positions to plain content $sum_length = 0; foreach($this->tags as &$tag) { $tag['pos_in_content'] = $tag[1] - $sum_length; $tag['sum_length' ] = $sum_length += strlen($tag[0]); } # zero length dummy tags to mark start/end position of strings not beginning/ending with a tag array_unshift($this->tags , [0 => '', 1 => 0, 'pos_in_content' => 0, 'sum_length' => 0 ]); array_push ($this->tags , [0 => '', 1 => strlen($html)-1]); } public function translate_pos_plain2html($content_position) { # binary search $idx = [true => 0, false => count($this->tags)-1]; while(1 < $idx[false] - $idx[true]) { $i = ($idx[true] + $idx[false]) >>1; // integer half of both array indexes $idx[$this->tags[$i]['pos_in_content'] <= $content_position] = $i; // hold one index less and the other greater } $this->current_tag_idx = $idx[true]; return $this->tags[$this->current_tag_idx]['sum_length'] + $content_position; } public function &find_content($needle_start, $needle_end = '', $result_modifiers = self::RESULT_NO_MODIFICATION) { $needle_start = preg_quote($needle_start, '~'); $needle_end = '' == $needle_end ? '' : preg_quote($needle_end , '~'); if((self::MATCH_BLANK_MULTIPLE | self::MATCH_BLANK_AS_WHITESPACE) & $result_modifiers) { $replacement = self::MATCH_BLANK_AS_WHITESPACE & $result_modifiers ? '\s' : ' '; if(self::MATCH_BLANK_MULTIPLE & $result_modifiers) { $replacement .= '+'; $multiplier = '+'; } else $multiplier = ''; $repl_pattern = "~ $multiplier~"; $needle_start = preg_replace($repl_pattern, $replacement, $needle_start); $needle_end = preg_replace($repl_pattern, $replacement, $needle_end); } $icase = self::MATCH_CASE_INSENSITIVE & $result_modifiers ? 'i' : ''; $search_pattern = "~{$needle_start}.*?{$needle_end}~su$icase"; preg_match_all($search_pattern, $this->heystack, $matches, PREG_PATTERN_ORDER | PREG_OFFSET_CAPTURE); foreach($matches[0] as &$match) { $pre = $post = ''; $pos_start = $this->translate_pos_plain2html($match[1]); if(self::RESULT_PREPEND_TAG_CONTENT & $result_modifiers) $pos_start = $this->tags[$this->current_tag_idx][1] +( self::RESULT_PREPEND_TAG & $result_modifiers ? 0 : strlen ($this->tags[$this->current_tag_idx][0]) ); elseif(self::RESULT_PREPEND_TAG & $result_modifiers) $pre = $this->tags[$this->current_tag_idx][0]; $pos_end = $this->translate_pos_plain2html($match[1] + strlen($match[0])); if(self::RESULT_APPEND_TAG_CONTENT & $result_modifiers) { $next_tag = $this->tags[$this->current_tag_idx+1]; $pos_end = $next_tag[1] +( self::RESULT_APPEND_TAG & $result_modifiers ? strlen ($next_tag[0]) : 0); } elseif(self::RESULT_APPEND_TAG & $result_modifiers) $post = $this->tags[$this->current_tag_idx+1][0]; $match = $pre . substr($this->html, $pos_start, $pos_end - $pos_start) . $post; }; return $matches[0]; } }
Some test case:
$html_source = get($_POST['html'], <<< ___ He said: "Hello ?????!"
random code random code Lorem ipsum. foo bar
0 comments:
Post a Comment