Question

Onları temizlemek için bir yönetici izin vermek için alanlarda bir dizi yakın yinelenen değerleri bulmaya çalışıyorum.

Ben eşleşen am iki kriter vardır

Bir dizi tamamen diğer içinde bulunan, ve kendi uzunluğunun en az 1/4 olmasıdır
Dizeleri bir düzenleme mesafeyi iki dizeleri toplam uzunluğunun% 5'inden az olması

Sözde-PHP kodu:

foreach($values as $value){
$matches = array();
foreach($values as $match){
  if(
    (
      $value['length'] < $match['length']
      &&
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      $match['length'] < $value['length']
      &&
      $match['length'] * 4 > $value['length']
      &&
      stripos($value['value'], $match['value']) !== false
    )
    ||
    (
      abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($match['changes'] = levenshtein($value['value'], $match['value']))
      &&
      $match['changes'] * 20 <= ($value['length'] + $match['length'])
      )
    ){
      $matches[] = &$match;
    }
}
// output matches for current outer loop value
}

I've tried to reduce calls to the comparatively expensive stripos and levenshtein functions where possible, which has reduced the execution time quite a bit. However, as an O(n^2) operation this just doesn't scale to the larger sets of values and it seems that a significant amount of the processing time is spent simply iterating through the arrays.

Değerler birkaç setleri bazı özellikleri ameliyat ediliyor

Total   | Strings      | # of matches per string |          |
Strings | With Matches | Average | Median |  Max | Time (s) |
--------+--------------+---------+--------+------+----------+
    844 |          413 |     1.8 |      1 |   58 |    140   |
    593 |          156 |     1.2 |      1 |    5 |     62   | 
    272 |          168 |     3.2 |      2 |   26 |     10   |
    157 |           47 |     1.5 |      1 |    4 |      3.2 |
    106 |           48 |     1.8 |      1 |    8 |      1.3 |
     62 |           47 |     2.9 |      2 |   16 |      0.4 |

Böyle olduğundan ben kriterleri kontrol süresini azaltmak için ne yapabilirim, ve daha da önemlisi bana (örneğin, giriş değerlerini ön-işleme) tarafından gerekli kriterleri çek sayısını azaltmak için herhangi bir yolu vardır başka şeyler vardır düşük seçicilik?

Edit: Uygulanan çözüm

// $values is ordered from shortest to longest string length
$values_count = count($values); // saves a ton of time, especially on linux
for($vid = 0; $vid < $values_count; $vid++){
for($mid = $vid+1; $mid < $values_count; $mid++){ // only check against longer strings
  if(
    (
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      ($match['length'] - $value['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($changes = levenshtein($value['value'], $match['value']))
      &&
      $changes * 20 <= ($value['length'] + $match['length'])
      )
    ){
      // store match in both directions
      $matches[$vid][$mid] = true;
      $matches[$mid][$vid] = true;
    }

}
}
// Sort outer array of matches alphabetically with uksort()
foreach($matches as $vid => $mids){
  // sort inner array of matches by usage count with uksort()
  // output matches
}

Answer 1

Önce uzunluğu (O (N)) tarafından dizeleri sipariş ve sonra sadece substringler veya daha büyük dizeleri olarak daha küçük dizeleri denetlemek, artı tek farkı çok büyük olmadığı için dize çiftleri levenshtein ile kontrol edebilir.

Sen zaten bu denetimleri gerçekleştirmek, ancak uzunluğu ilk önceden seçilmesi ise şimdi, tüm N x N çiftleri için bunu önce kontrol etmek çiftleri azaltmaya yardımcı olacaktır. Başarısız olur sadece testleri içeriyor olsa bile, N x N döngü kaçının.

For substring matching you could further improve by creating an index for all smaller items, and update this accordingly as you parse larger items. The index should can form a tree structure branching on letters, where each word (string) forms a path from root to leaf. This way you can find if any of the words in the index compare to some string to match. For each character in your match string try to proceed any pointers in the tree index, and create a new pointer at the index. If a pointer can not be proceeded to a following character in the index, you remove it. If any pointer reaches a leaf note, you've found a substring match. Implementing this is, I think, not difficult, but not trivial either.

Answer 2

Eğer iç döngü sıkarak bir anlık% 100 iyileşme alabilirsiniz. Eğer sonuçlarında yinelenen eşleşmeleri almıyor musunuz?

Bir önişlem adım için ben (ki, eğer stripos kullanarak konum göz önüne alındığında, ben büyük olasılıkla olduğunu düşünüyorum, a-z0-9 gibi küçük karakter setinizi varsayarak) geçmesi ve karakter frekanslarını hesaplamak istiyorum. Sonra yerine dizileri (pahalı) karşılaştırarak daha frekansları (ucuz) karşılaştırın. Bu size yaşamak ya yanlış pozitif vermek, ya da şu anda ayıklamak lazım testi takın olacaktır.

Yakın yinelenen değeri arama optimize

2 Cevap

etiketler