Question

PHP basit bir ileri dizinleyiciyi uygulamak için arıyorum. Evet PHP pek görev için en iyi araç olduğunu anlıyorum, ama ben yine de yapmak istiyorum. Bunun ardındaki mantık basit: Ben bir tane istiyorum, ve PHP.

Bize birkaç temel varsayımlar yapalım:

The entire Interweb consists of about five thousand HTML and/or plain-text documents. Each document resides within a particular domain (UID). No other proprietary/arcane formats exist in our imaginary cavemanesque Interweb.
Bizim müthiş PHP tabanlı ileri indeksleme algoritmasının sonucu çizgisinde olmalıdır:

UID1 -> index.html -> helen ile, o, oldu, şampiyon, çiller

UID1 -> hiperbağlarda -> tavuk, çiftçiler, eve gitmek, yemek, koyun

UID2 -> blah.html -> gelecek, hafta, on, badgerwatch

UID2 -> gah.txt -> bir, bir, ve, bir, değil,, numberwang

Ideally, I would love to see solutions that take into account, even at their most elementary, the concepts of tokenization/word boundary disambiguation/part-of-speech-tagging. Of course, I do realise this is wishful thinking, and therefore will humble any worthy attempts at parsing said imaginary documents by:

Extracting the real textual content stuff within the document as a list of words in the order in which they are presented.
All the while, ignoring any garbage such as <script> and <html> tags to compute a list of UIDs (which could be, for instance, a domain) followed by document name (the resource within the domain) and finally the list of words for that document. I do realise that HTML tags play an important role in the semantic placement of text within a document, but at this stage I do not care.
Bear in mind a solution that can build the list of words WHILE reading the document is cooler that one which needs to read in the whole document first.

Bu aşamada, ben depolama wheres veya hows umurumda değil. 'Baskı' deyimleri, hatta ilkel bir set yeterli olacaktır.

Şimdiden teşekkürler, bu yeterince açık olmuştur umarım.

Answer 1

Bakmak

http://simplehtmldom.sourceforge.net/

Sen gibi somthing yok

$p = new Simple_dom_parser();
$p->load("www.page.com");
$p->find("body")->plaintext;

And that will give you all the text. Want to iterate over just the links

foreach ($p->find("a") as $link)
{
    echo $link->innerText;
}

It is very usefull and powerfull. Check it out.

Answer 2

Ben yapmak için çalışıyoruz ne tamamen açık değilim, ama oldukça kolay basit bir sonuç alabilirsiniz sanmıyorum:

(a good introduction) geçerli HTML olacak emin olmak için Tidy aracılığıyla sayfasını çalıştırın.
(Dahil) önce her şeyi atmak <body>.
Step through the document one character at a time.
1. Karakter ise '<' Eğer bir görene kadar, aşağıdaki karakterlerle şey yapmıyoruz '>' (HTML atlar)
2. Karakter bir "sözcük karakteri" (alfanümerik, tire, muhtemelen daha fazla) ise, "geçerli kelime" ekleyebilir.
3. Karakter bir "non-word karakter" (noktalama, uzay, muhtemelen daha fazla) ise, ileri endeksinde kelime listesine "geçerli kelime" ekleyin ve "geçerli kelime" temizleyin.
Eğer </body> vurmak kadar yukarıdaki yapmak.

Yani bu konuda gerçekten, sen <script> etiketleri (endeksli olmalıdır kelime olarak javascript dikkate istemiyorum) gibi şeyler işlemek için bazı istisnalar eklemek olabilir, ama bu sana bir temel vermelidir vadeli endeks.

Nasıl bir PHP ileri endeksi uygulanması hakkında gidebilir?

2 Cevap

etiketler