Basic web mapper
Sometimes it is useful to have an automated tool to get the full web map of your site. Perhaps not your own web site, since you have already implemented some kind of automatic generation and notification to Google (have not yet?), but a client’s one.
There are a few tools to map an external web site, I tried some in my particular case. They were just adware, or demos, or they obscured the links in the final report… Yeah, of course, sometimes a $30 license is worth it, but you might not want to acquire a new piece of proprietary software every time you need a new feature, might you?
So I decided to write it myself in PHP, not for the money, but for the fun
The architecture
First of all, a bit of planning the application:

Web mapper work flow diagram
Web mapper work flow diagram (Visio).
Let’s take a look at this work flow diagram. First of all, the mapper reads the entry url address in search of links. The blue color represents the entry and exit points of the application.
Then it reads the HTML code, parses it in search for links, and if found, it runs a loop on each one of them. The green color means the actions taken within the loop. Basically, this involves saving the link into the database, and, if the address is not external (that means, it points to a page within the same domain name), then reading it and starting the process again for this new url.
This practice is commonly known as ‘recursion’, and is represented in orange. When a link within a page is followed, the application has just entered a new recursion level, then reads the HTML code and seeks for links, starting again a new loop. Think of it like a set of Russian dolls: every link of a page contains a page with more links, so there we start a new entire loop from an iteration of the previous one, and so on.
The recursion will end when a page has no links, or all its links are external. Then the application has found a leaf page, nothing to do deeper than that, and can return one level up. When all the leaf pages are read, this algorithm will return to the main entry level, when, after going through all links found, will finish.
This, depending on the web site structure, can take a few minutes to complete all the mapping, so that is better to use a fast and reliable programming language. In my case, I have used PHP for simplicity’s sake, since is the language I am using daily, but on a test server where I can increase execution time. For more intensive uses of this script, I would recommend to have it translated into another language, or to run it from console with the PHP interpreter, not from the web server itself.
In addition, this architecture is really basic, that means that a lot of things can be improved. For instance, a check has been added to avoid following links twice. This, however, is not so easy to implement, since inner links can follow different conventions (“www.domain.com”, “domain.com”, “sub.domain.com”, “/”, “./”). Thus, the code presented below should be used for academic and learning purposes, I can not guarantee that it is going to work perfectly at all.
The code
Let’s present the code now. It has been written for the Code Igniter framework, but can be easily ported to no matter what architecture or language. A few conventions:
- It is based on the class ‘spider’, whose main method ‘crawl()’ launches the initial entry point, and the inner and protected method ‘_crawl()’ implements the recursion.
- All inner and protected methods are preceded by an underscore (_).
- A few configuration attributes allow a sandbox definition, limiting the recursion levels and the domains to be crawled.
- As said before, it is based on Code Igniter, but in this case that is only to provide an easy interface with the database. You can just replace all instances of $this->ci->db with your own MySQL connection class.
Using the class is very simple:
1 2 3 4 5 6 7 8 9 10 | $spider = new spider(); $spider->crawl( array( 'entry' => 'http://jorgealbaladejo.com', 'domains' => array( 'jorgealbaladejo.com' , 'www.jorgealbaladejo.com', 'jorgealbaladejo.ch', 'jorgealbaladejo.es' ) ) ); $spider->printResults(); |
And finally, the core class code. It is commented so it may be easy to understand, eventually. However, comments, improvements and doubts are welcome, so if you find I am doing something not clearly enough, or you would have just done it in another way, please share your comments!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 | /** * Class spider * Creates a spider which crawls the internet * */ class spider { /** * @var Code Igniter object for database access * */ var $ci; /** * @var maximum recursivity depth * */ var $max_depth = 100; /** * @var default domains to restrict crawling * */ var $domains = array('jorgealbaladejo.com','vwww.jorgealbaladejo.com'); /** * @var indentation level * */ var $indent = 20; /** * @var internal links array * */ var $links = array(); /** * Constructor * */ public function spider() { $this->ci =& get_instance(); // load links in database to internal list $this->_loadLinks(); } /** * Entry point for recursive function crawl * * @param object $params * @param array $params['domains'] * @param string $params['entry'] * * @return boolean * */ public function crawl($params) { if (isset($params['domains'])) { $this->domains = $params['domains']; } if (isset($params['entry'])) { $this->_crawl($params['entry']); return true; } return false; } /** * Recursive crawling function * * @param string $page * @param int $parent [optional] * @param int $depth of current page [optional] * * @return void * */ protected function _crawl($page, $parent = 0, $depth = 0) { // vars $doc = ''; $title = ''; $desc = ''; $out = NULL; $links = array(); $link = ''; // // correct url $page = $this->_buildUrl($page,$parent); // avoid reading the same url twice if (array_key_exists($page,$this->links)) { return false; } // only read info for inner pages if (in_array($this->_getDomain($page),$this->domains)) { // read page content $doc = $this->_getHTTPRequest($page); // get page meta data list($title, $desc) = $this->_analyzePage($doc,$page); } // log into inner array $this->links[$page] = array( 'url' => $page , 'title' => $title , 'description' => $desc ); // write into database $this->ci->db->query('INSERT INTO web.map VALUES("",' . '"' . $page . '",' . '"' . $title . '",' . '"' . $desc . '",' . '"' . $parent . '") '); $parent = $this->ci->db->insert_id(); $this->_printLine($title,$desc,$page,$depth); // avoid recursion for external domains if (!in_array($this->_getDomain($page),$this->domains)) { return; } // now get links and launch recursively $links = $this->_getInnerLinks($doc); foreach($links as $link) { if ($depth < $this->max_depth) { $this->_crawl($link,$parent,$depth+1); } } return; } /** * Prints the current links array * * */ public function printResults() { ksort($this->links); foreach ($this->links AS $link) { $this->_printLine($link['title'],$link['description'],$link['url'],0); } } /** * Preloads the existent links in database * */ private function _loadLinks() { $result = $this->ci->db->query('SELECT url,title,description FROM web.map ORDER BY url ASC'); if ($result->num_rows()) { $results = $result->result(); foreach($results as $r) { $this->links[$r->url] = array( 'url' => $r->url , 'title' => $r->title , 'description' => $r->description ); } } } /** * Reads a document for html links * * @param string $doc [optional] * * @return array of links * */ private function _getInnerLinks($doc = '') { $return = array(); $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; $match = null; if(preg_match_all("/$regexp/siU", $doc, $matches, PREG_SET_ORDER)) { foreach($matches as $match) { $return[] = $match[2]; } } return $return; } /** * Prints a line for the current url * * @param int * */ private function _printLine($title = '', $desc = '', $page = '', $depth = 0) { // do not indent at this time $depth = 0; // echo '<p>' . '<h3 style="margin-left:' . ($depth*$this->indent) . 'px">' . stripslashes($title) . '</h3>' . '<span style="margin-left:' . ($depth*$this->indent) . 'px">' . substr(stripslashes($desc),0,100) . '</span>' . '<h4 style="margin-left:' . ($depth*$this->indent) . 'px">' . $page . '</h4>' . '</p>'; } /** * Analyzes a page to get meta information * * @param string $doc(cument) * @param string $page * * @return list($title,$description); * */ private function _analyzePage($doc, $page) { // $title = ''; $desc = ''; // if (eregi ("<title>(.*)</title>", $doc, $out)) { $title = addslashes($out[1]); if (strlen($out[1])) { $title = substr($title,0,50); } } $out = @get_meta_tags($page); if (isset($out['description'])) { $desc = addslashes($out['description']); } return array($title,$desc); } /** * Completes a page link with the domain if needed * * @param string $page * @param int $parent * * @return string corrected url * */ private function _buildUrl($page,$parent) { // prepare url if relative if (!eregi('http://',$page)) { // root relative path if (strpos($page,'/') == 0) { $page = 'http://' . $this->domains[0] . $page; } // page relative path else { if (strpos($page,'mailto:')<0 && strpos($page,'javascript:')<0) { $page = $this->_getUrlByID($parent) . '/' . $page; } } } // trim final slash $page = trim($page,'/'); return $page; } /** * Gets url for a given ID * * @param int url id * * @return string the url * */ private function _getUrlByID($id) { $results = $this->ci->db->query('SELECT url FROM web.map WHERE ID = "' . $id . '"'); if ($results->num_rows()) { $result = $results->result(); return $result[0]->url; } return ''; } /** * Gets domain name from a URL * * @param string url * * @return string domain name * */ private function _getDomain($url = '') { // get host name from URL preg_match('@^(?:http://)?([^/]+)@i', $url, $matches); if (isset($matches[1])) { return $matches[1]; } else { return NULL; } } /** * HTTP helper function.<br /> * Loads an http request and returns result. * * @param string $url to request * * @return string the result * */ private function _getHTTPRequest($url = '') { // vars $html = ''; $old = ''; $file = ''; // //configuration $timeout = 15; // // execution try { $file = @fopen($url, 'rb'); if ($file) { $html = stream_get_contents($file); fclose($file); } } catch(Exception $e) { $html = 'error'; } // return ($html); } } |



