search_simplify

  1. drupal
    1. 5
    2. 6
    3. 7
    4. 8
Versions
5 – 8 search_simplify($text)

Simplifies a string according to indexing rules.

Parameters

$text Text to simplify.

Return value

Simplified text.

See also

hook_search_preprocess()

▾ 7 functions call search_simplify()

SearchQuery::parseSearchExpression in modules/search/search.extender.inc
Parses the search query into SQL conditions.
SearchSimplifyTestCase::testSearchSimplifyPunctuation in modules/search/search.test
Tests that search_simplify() does the right thing with punctuation.
SearchSimplifyTestCase::testSearchSimplifyUnicode in modules/search/search.test
Tests that all Unicode characters simplify correctly.
SearchTokenizerTestCase::testNoTokenizer in modules/search/search.test
Verifies that strings of non-CJK characters are not tokenized.
SearchTokenizerTestCase::testTokenizer in modules/search/search.test
Verifies that strings of CJK characters are tokenized.
search_index_split in modules/search/search.module
Simplifies and splits a string into tokens for indexing.
search_simplify_excerpt_match in modules/search/search.module
Find words in the original text that matched via search_simplify().

Code

modules/search/search.module, line 411

<?php
function search_simplify($text) {
  // Decode entities to UTF-8
  $text = decode_entities($text);

  // Lowercase
  $text = drupal_strtolower($text);

  // Call an external processor for word handling.
  search_invoke_preprocess($text);

  // Simple CJK handling
  if (variable_get('overlap_cjk', TRUE)) {
    $text = preg_replace_callback('/[' . PREG_CLASS_CJK . ']+/u', 'search_expand_cjk', $text);
  }

  // To improve searching for numerical data such as dates, IP addresses
  // or version numbers, we consider a group of numerical characters
  // separated only by punctuation characters to be one piece.
  // This also means that searching for e.g. '20/03/1984' also returns
  // results with '20-03-1984' in them.
  // Readable regexp: ([number]+)[punctuation]+(?=[number])
  $text = preg_replace('/([' . PREG_CLASS_NUMBERS . ']+)[' . PREG_CLASS_PUNCTUATION . ']+(?=[' . PREG_CLASS_NUMBERS . '])/u', '\1', $text);

  // Multiple dot and dash groups are word boundaries and replaced with space.
  // No need to use the unicode modifer here because 0-127 ASCII characters
  // can't match higher UTF-8 characters as the leftmost bit of those are 1.
  $text = preg_replace('/[.-]{2,}/', ' ', $text);

  // The dot, underscore and dash are simply removed. This allows meaningful
  // search behavior with acronyms and URLs. See unicode note directly above.
  $text = preg_replace('/[._-]+/', '', $text);

  // With the exception of the rules above, we consider all punctuation,
  // marks, spacers, etc, to be a word boundary.
  $text = preg_replace('/[' . PREG_CLASS_UNICODE_WORD_BOUNDARY . ']+/u', ' ', $text);

  // Truncate everything to 50 characters.
  $words = explode(' ', $text);
  array_walk($words, '_search_index_truncate');
  $text = implode(' ', $words);

  return $text;
}
?>