Question

Bana (: www.example.com ana başlayarak) belirli bir etki alanındaki tüm sayfaların bir listesini verecek php kullanarak bir tarayıcı oluşturmak istiyorum.

Ben bu php nasıl yapabilirim?

Ben ardışık belirli bir sayfa başlayan ve dış bağlantıları hariç bir web sitesindeki tüm sayfaları bulmak için nasıl bilmiyorum.

Answer 1

Genel yaklaşım için, bu soruların cevaplarını kontrol:

PHP, sadece file_get_contents() ile uzak bir URL almak gerekir. Bazı için (bkz this question preg_match() <a href=""> etiketleri bulmak ve onları dışarı URL ayrıştırmak için normal bir ifade kullanarak HTML naif bir ayrıştırma yapmak olabilir tipik yaklaşımlar).

Ayrıca URL'ler ettik sayfaya göreli olabilir unutmayın - ham href niteliğini ayıklamak sonra, bunu bileşenler girmeye ve almak istediğiniz onun bir URL olmadığını anlamaya parse_url() kullanabilirsiniz zorlama.

Hızlı olsa da, bir regex olsa HTML ayrıştırma en iyi yolu değildir - aynı zamanda DOM classes Örneğin, sen getirme HTML ayrıştırmak için deneyebilirsiniz:

$dom = new DOMDocument();
$dom->loadHTML($content);

$anchors = $dom->getElementsByTagName('a');

if ( count($anchors->length) > 0 ) {
    foreach ( $anchors as $anchor ) {
        if ( $anchor->hasAttribute('href') ) {
            $url = $anchor->getAttribute('href');

            //now figure out whether to processs this
            //URL and add it to a list of URLs to be fetched
        }
    }
}

Son olarak, kendiniz yazmak yerine, kullanmak verebilecek diğer kaynaklar için bu soru ayrıca bkz.

is there a good web crawler library available for PHP or Ruby?

Answer 2

Genel bakış

İşte paletli temelleri üzerinde bazı notlar vardır.

It is a console app - It doesn't need a rich interface, so I figured a console application would do. The output is done as an html file and the input (what site to view) is done through the app.config. Making a windows app out of this seemed like overkill.
The crawler is designed to only crawl the site it originally targets. It would be easy to change that if you want to crawl more than just a single site, but that is the goal of this little application.
Originally the crawler was just written to find bad links. Just for fun I also had it collect information on page and viewstate sizes. It will also list all non-html files and external urls, just in case you care to see them.
The results are shown in a rather minimalistic html report. This report is automatically opened in Internet Explorer when the crawl is finished.

Bir Html Page Metin Başlarken

Bir tarayıcı binanın ilk önemli parçası çıkıyor ve web kapalı html getiriliyor için mekanizmadır (yerel site çalıştıran varsa, ya da yerel makine.). Çok başka gibi. NET framework yerleşik bu çok şeyi yapmak için sınıfları vardır.

    private static string GetWebText(string url)
    {
        HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
        request.UserAgent = "A .NET Web Crawler";
        WebResponse response = request.GetResponse();
        Stream stream = response.GetResponseStream();
        StreamReader reader = new StreamReader(stream);
        string htmlText = reader.ReadToEnd();
        return htmlText;
    }

The HttpWebRequest class can be used to request any page from the internet. The response (retrieved through a call to GetResponse()) holds the data you want. Get the response stream, throw it in a StreamReader, and read the text to get your html. for Reference: http://www.juicer.headrun.com

Temel web tarama soru: Nasıl php kullanarak bir web sitesindeki tüm sayfaların bir listesini oluşturmak için?

2 Cevap

etiketler