<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Jorge Albaladejo &#187; Web Crawlers</title>
	<atom:link href="http://jorgealbaladejo.com/category/web-architecture/web-crawlers/feed/" rel="self" type="application/rss+xml" />
	<link>http://jorgealbaladejo.com</link>
	<description>Hard &#38; Soft design...</description>
	<lastBuildDate>Sat, 10 Dec 2011 16:23:02 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Basic web mapper</title>
		<link>http://jorgealbaladejo.com/2010/08/21/basic-web-mapper/</link>
		<comments>http://jorgealbaladejo.com/2010/08/21/basic-web-mapper/#comments</comments>
		<pubDate>Sat, 21 Aug 2010 08:00:15 +0000</pubDate>
		<dc:creator>Jorge Albaladejo</dc:creator>
				<category><![CDATA[Php]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[Web Architecture]]></category>
		<category><![CDATA[Web Crawlers]]></category>
		<category><![CDATA[web crawler]]></category>
		<category><![CDATA[web robot]]></category>
		<category><![CDATA[web spider]]></category>

		<guid isPermaLink="false">http://jorgealbaladejo.com/?p=554</guid>
		<description><![CDATA[Sometimes it is useful to have an automated tool to get the full web map of your site. Perhaps not your own web site, since you have already implemented some kind of automatic generation and notification to Google (have not yet?), but a client&#8217;s one. There are a few tools to map an external web [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">Sometimes it is useful to have an automated tool to get the full web map of your site. Perhaps not your own web site, since you have already implemented some kind of automatic generation and notification to Google (have not yet?), but a client&#8217;s one.</p>
<p style="text-align: justify;">There are a few tools to map an external web site, I tried some in my particular case. They were just adware, or demos, or they obscured the links in the final report&#8230; Yeah, of course, sometimes a $30 license is worth it, but you might not want to acquire a new piece of proprietary software every time you need a new feature, might you?</p>
<p style="text-align: justify;">So I decided to write it myself in PHP, not for the money, but for the fun <img src='http://jorgealbaladejo.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<h3><span id="more-554"></span></h3>
<h3 style="margin-top: 30px">The architecture</h3>
<p>First of all, a bit of planning the application:</p>
<div id="attachment_557" class="wp-caption aligncenter" style="width: 465px"><img src="http://jorgealbaladejo.com/wp-content/uploads/2010/08/web_mapper.png" alt="Web mapper work flow" class="size-full wp-image-557" style="border: 5px solid white;" title="Web mapper work flow" width="455" height="942" /><p class="wp-caption-text">Web mapper work flow diagram</p></div>
<p style="text-align: center;"><a href="http://jorgealbaladejo.com/wp-content/uploads/2010/08/web_mapper.zip">Web mapper work flow diagram (Visio)</a>.</p>
<p style="text-align: justify;">&nbsp;</p>
<p style="text-align: justify;">Let&#8217;s take a look at this work flow diagram. First of all, the mapper reads the entry url address in search of links. The blue color represents the entry and exit points of the application.</p>
<p style="text-align: justify;">Then it reads the HTML code, parses it in search for links, and if found, it runs a loop on each one of them. The green color means the actions taken within the loop. Basically, this involves saving the link into the database, and, if the address is not external (that means, it points to a page within the same domain name), then reading it and starting the process again for this new url.</p>
<p style="text-align: justify;">This practice is commonly known as &#8216;recursion&#8217;, and is represented in orange. When a link within a page is followed, the application has just entered a new recursion level, then reads the  HTML code and seeks for links, starting again a new loop. Think of it like a set of <a href="http://en.wikipedia.org/wiki/Russian_dolls" title="Russian dolls" target="_blank" rel="external">Russian dolls</a>: every link of a page contains a page with more links, so there we start a new entire loop from an iteration of the previous one, and so on.</p>
<p style="text-align: justify;">The recursion will end when a page has no links, or all its links are external. Then the application has found a leaf page, nothing to do deeper than that, and can return one level up. When all the leaf pages are read, this algorithm will return to the main entry level, when, after going through all links found, will finish.</p>
<p style="text-align: justify;">This, depending on the web site structure, can take a few minutes to complete all the mapping, so that is better to use a fast and reliable programming language. In my case, I have used PHP for simplicity&#8217;s sake, since is the language I am using daily, but on a test server where I can increase execution time. For more intensive uses of this script, I would recommend to have it translated into another language, or to run it from console with the PHP interpreter, not from the web server itself.</p>
<p style="text-align: justify;">In addition, this architecture is really basic, that means that a lot of things can be improved. For instance, a check has been added to avoid following links twice. This, however, is not so easy to implement, since inner links can follow different conventions (&#8220;www.domain.com&#8221;, &#8220;domain.com&#8221;, &#8220;sub.domain.com&#8221;, &#8220;/&#8221;, &#8220;./&#8221;). Thus, the code presented below should be used for academic and learning purposes, I can not guarantee that it is going to work perfectly at all.</p>
<h3 style="margin-top: 30px">The code</h3>
<p style="text-align: justify;">Let&#8217;s present the code now. It has been written for the Code Igniter framework, but can be easily ported to no matter what architecture or language. A few conventions:</p>
<ul>
<li>It is based on the class &#8216;spider&#8217;, whose main method &#8216;crawl()&#8217; launches the initial entry point, and the inner and protected method &#8216;_crawl()&#8217; implements the recursion.</li>
<li>All inner and protected methods are preceded by an underscore (_).</li>
<li>A few configuration attributes allow a sandbox definition, limiting the recursion levels and the domains to be crawled.</li>
<li>As said before, it is based on Code Igniter, but in this case that is only to provide an easy interface with the database. You can just replace all instances of $this->ci->db with your own <a href="http://jorgealbaladejo.com/2007/05/24/clase-dbhandler-para-manejar-datos-de-una-base-de-datos/" title="Reading data from a MySQL database">MySQL connection class</a>.</li>
</ul>
<p style="text-align: justify;">Using the class is very simple:</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?" rel="external"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p554code3'); return false;">View Code</a> PHP</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p5543"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code" id="p554code3"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$spider</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> spider<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$spider</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">crawl</span><span style="color: #009900;">&#40;</span> <a href="http://www.php.net/array" rel="external"><span style="color: #990000;">array</span></a><span style="color: #009900;">&#40;</span>  <span style="color: #0000ff;">'entry'</span>		<span style="color: #339933;">=&gt;</span> <span style="color: #0000ff;">'http://jorgealbaladejo.com'</span><span style="color: #339933;">,</span>
			<span style="color: #0000ff;">'domains'</span>	<span style="color: #339933;">=&gt;</span> <a href="http://www.php.net/array" rel="external"><span style="color: #990000;">array</span></a><span style="color: #009900;">&#40;</span>	<span style="color: #0000ff;">'jorgealbaladejo.com'</span> <span style="color: #339933;">,</span>
							<span style="color: #0000ff;">'www.jorgealbaladejo.com'</span><span style="color: #339933;">,</span>
							<span style="color: #0000ff;">'jorgealbaladejo.ch'</span><span style="color: #339933;">,</span>
							<span style="color: #0000ff;">'jorgealbaladejo.es'</span>	<span style="color: #009900;">&#41;</span> 
			<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000088;">$spider</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">printResults</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></td></tr></table></div>

<p>&nbsp;</p>
<p>And finally, the core class code. It is commented so it may be easy to understand, eventually. However, comments, improvements and doubts are welcome, so if you find I am doing something not clearly enough, or you would have just done it in another way, please share your comments! <img src='http://jorgealbaladejo.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>&nbsp;</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?" rel="external"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p554code4'); return false;">View Code</a> PHP</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p5544"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
</pre></td><td class="code" id="p554code4"><pre class="php" style="font-family:monospace;"><span style="color: #009933; font-style: italic;">/**
 * Class spider
 * Creates a spider which crawls the internet
 * 
 */</span>
<span style="color: #000000; font-weight: bold;">class</span> spider
<span style="color: #009900;">&#123;</span>
	<span style="color: #009933; font-style: italic;">/**
	 * @var Code Igniter object for database access
	 * 
	 */</span>
	<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #000088;">$ci</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * @var maximum recursivity depth
	 * 
	 */</span>
	<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #000088;">$max_depth</span> 	<span style="color: #339933;">=</span> <span style="color: #cc66cc;">100</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * @var default domains to restrict crawling
	 * 
	 */</span>
	<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #000088;">$domains</span> 	<span style="color: #339933;">=</span> <a href="http://www.php.net/array" rel="external"><span style="color: #990000;">array</span></a><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'jorgealbaladejo.com'</span><span style="color: #339933;">,</span><span style="color: #0000ff;">'vwww.jorgealbaladejo.com'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * @var indentation level
	 * 
	 */</span>
	<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #000088;">$indent</span> 	<span style="color: #339933;">=</span> <span style="color: #cc66cc;">20</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * @var internal links array
	 * 
	 */</span>
	<span style="color: #000000; font-weight: bold;">var</span> <span style="color: #000088;">$links</span> 	<span style="color: #339933;">=</span> <a href="http://www.php.net/array" rel="external"><span style="color: #990000;">array</span></a><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * Constructor
	 * 
	 */</span>
	<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">function</span> spider<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		<span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">ci</span> <span style="color: #339933;">=&amp;</span> get_instance<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>	
&nbsp;
		<span style="color: #666666; font-style: italic;">// load links in database to internal list</span>
		<span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span>_loadLinks<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>	
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * Entry point for recursive function crawl
	 * 
	 * @param object 	$params
	 * @param array 	$params['domains']
	 * @param string	$params['entry']
	 * 
	 * @return boolean
	 *  
	 */</span>
	<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">function</span> crawl<span style="color: #009900;">&#40;</span><span style="color: #000088;">$params</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><a href="http://www.php.net/isset" rel="external"><span style="color: #990000;">isset</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$params</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">'domains'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
		<span style="color: #009900;">&#123;</span>
			<span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">domains</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$params</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">'domains'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>	
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><a href="http://www.php.net/isset" rel="external"><span style="color: #990000;">isset</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$params</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">'entry'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
		<span style="color: #009900;">&#123;</span>
			<span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span>_crawl<span style="color: #009900;">&#40;</span><span style="color: #000088;">$params</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">'entry'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>	
&nbsp;
			<span style="color: #b1b100;">return</span> <span style="color: #009900; font-weight: bold;">true</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #b1b100;">return</span> <span style="color: #009900; font-weight: bold;">false</span><span style="color: #339933;">;</span>		
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/** 
	 * Recursive crawling function
	 * 
	 * @param string	$page
	 * @param int 		$parent [optional]
	 * @param int 		$depth of current page [optional]
	 * 
	 * @return void
	 * 
	 */</span>
	<span style="color: #000000; font-weight: bold;">protected</span> <span style="color: #000000; font-weight: bold;">function</span> _crawl<span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #339933;">,</span> <span style="color: #000088;">$parent</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span> <span style="color: #000088;">$depth</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		<span style="color: #666666; font-style: italic;">// vars</span>
		<span style="color: #000088;">$doc</span> 	<span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
		<span style="color: #000088;">$title</span> 	<span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
		<span style="color: #000088;">$desc</span>  	<span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
		<span style="color: #000088;">$out</span> 	<span style="color: #339933;">=</span> <span style="color: #009900; font-weight: bold;">NULL</span><span style="color: #339933;">;</span>
		<span style="color: #000088;">$links</span> 	<span style="color: #339933;">=</span> <a href="http://www.php.net/array" rel="external"><span style="color: #990000;">array</span></a><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #000088;">$link</span> 	<span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
		<span style="color: #666666; font-style: italic;">//</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// correct url</span>
		<span style="color: #000088;">$page</span> 	<span style="color: #339933;">=</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span>_buildUrl<span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #339933;">,</span><span style="color: #000088;">$parent</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// avoid reading the same url twice</span>
		<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><a href="http://www.php.net/array_key_exists" rel="external"><span style="color: #990000;">array_key_exists</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #339933;">,</span><span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">links</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
		<span style="color: #009900;">&#123;</span>
			<span style="color: #b1b100;">return</span> <span style="color: #009900; font-weight: bold;">false</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// only read info for inner pages</span>
		<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><a href="http://www.php.net/in_array" rel="external"><span style="color: #990000;">in_array</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span>_getDomain<span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span><span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">domains</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
		<span style="color: #009900;">&#123;</span>			
			<span style="color: #666666; font-style: italic;">// read page content</span>
			<span style="color: #000088;">$doc</span> 	<span style="color: #339933;">=</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span>_getHTTPRequest<span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
			<span style="color: #666666; font-style: italic;">// get page meta data</span>
			<a href="http://www.php.net/list" rel="external"><span style="color: #990000;">list</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$title</span><span style="color: #339933;">,</span> <span style="color: #000088;">$desc</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span>_analyzePage<span style="color: #009900;">&#40;</span><span style="color: #000088;">$doc</span><span style="color: #339933;">,</span><span style="color: #000088;">$page</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// log into inner array</span>
		<span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">links</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$page</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <a href="http://www.php.net/array" rel="external"><span style="color: #990000;">array</span></a><span style="color: #009900;">&#40;</span>	<span style="color: #0000ff;">'url'</span> 		<span style="color: #339933;">=&gt;</span> <span style="color: #000088;">$page</span> <span style="color: #339933;">,</span>
						<span style="color: #0000ff;">'title'</span>		<span style="color: #339933;">=&gt;</span> <span style="color: #000088;">$title</span> <span style="color: #339933;">,</span>
						<span style="color: #0000ff;">'description'</span> 	<span style="color: #339933;">=&gt;</span> <span style="color: #000088;">$desc</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// write into database		</span>
		<span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">ci</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">db</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">query</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'INSERT INTO web.map VALUES(&quot;&quot;,'</span> <span style="color: #339933;">.</span>
					<span style="color: #0000ff;">'&quot;'</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$page</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">'&quot;,'</span> <span style="color: #339933;">.</span>
					<span style="color: #0000ff;">'&quot;'</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$title</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">'&quot;,'</span> <span style="color: #339933;">.</span>
					<span style="color: #0000ff;">'&quot;'</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$desc</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">'&quot;,'</span> <span style="color: #339933;">.</span>
					<span style="color: #0000ff;">'&quot;'</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$parent</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">'&quot;) '</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #000088;">$parent</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">ci</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">db</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">insert_id</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>	
&nbsp;
		<span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span>_printLine<span style="color: #009900;">&#40;</span><span style="color: #000088;">$title</span><span style="color: #339933;">,</span><span style="color: #000088;">$desc</span><span style="color: #339933;">,</span><span style="color: #000088;">$page</span><span style="color: #339933;">,</span><span style="color: #000088;">$depth</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>	
&nbsp;
&nbsp;
		<span style="color: #666666; font-style: italic;">// avoid recursion for external domains</span>
		<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span><a href="http://www.php.net/in_array" rel="external"><span style="color: #990000;">in_array</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span>_getDomain<span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span><span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">domains</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
		<span style="color: #009900;">&#123;</span>
			<span style="color: #b1b100;">return</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// now get links and launch recursively</span>
		<span style="color: #000088;">$links</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span>_getInnerLinks<span style="color: #009900;">&#40;</span><span style="color: #000088;">$doc</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #b1b100;">foreach</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$links</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$link</span><span style="color: #009900;">&#41;</span> 
		<span style="color: #009900;">&#123;</span> 
			<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$depth</span> <span style="color: #339933;">&lt;</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">max_depth</span><span style="color: #009900;">&#41;</span>
			<span style="color: #009900;">&#123;</span>
				<span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span>_crawl<span style="color: #009900;">&#40;</span><span style="color: #000088;">$link</span><span style="color: #339933;">,</span><span style="color: #000088;">$parent</span><span style="color: #339933;">,</span><span style="color: #000088;">$depth</span><span style="color: #339933;">+</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>	
			<span style="color: #009900;">&#125;</span>				
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #b1b100;">return</span><span style="color: #339933;">;</span>		
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * Prints the current links array
	 * 
	 *  
	 */</span>
	<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">function</span> printResults<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		<a href="http://www.php.net/ksort" rel="external"><span style="color: #990000;">ksort</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">links</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">links</span> <span style="color: #b1b100;">AS</span> <span style="color: #000088;">$link</span><span style="color: #009900;">&#41;</span>
		<span style="color: #009900;">&#123;</span>
			<span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span>_printLine<span style="color: #009900;">&#40;</span><span style="color: #000088;">$link</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">'title'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span><span style="color: #000088;">$link</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">'description'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span><span style="color: #000088;">$link</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">'url'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * Preloads the existent links in database 
	 * 
	 */</span>
	<span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">function</span> _loadLinks<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		<span style="color: #000088;">$result</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">ci</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">db</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">query</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'SELECT url,title,description FROM web.map ORDER BY url ASC'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$result</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">num_rows</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
		<span style="color: #009900;">&#123;</span>
			<span style="color: #000088;">$results</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$result</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">result</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
			<span style="color: #b1b100;">foreach</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$results</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$r</span><span style="color: #009900;">&#41;</span>
			<span style="color: #009900;">&#123;</span>
				<span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">links</span><span style="color: #009900;">&#91;</span><span style="color: #000088;">$r</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">url</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <a href="http://www.php.net/array" rel="external"><span style="color: #990000;">array</span></a><span style="color: #009900;">&#40;</span>	<span style="color: #0000ff;">'url'</span> 		<span style="color: #339933;">=&gt;</span> <span style="color: #000088;">$r</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">url</span> <span style="color: #339933;">,</span>
								<span style="color: #0000ff;">'title'</span> 	<span style="color: #339933;">=&gt;</span> <span style="color: #000088;">$r</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">title</span> <span style="color: #339933;">,</span>
								<span style="color: #0000ff;">'description'</span> 	<span style="color: #339933;">=&gt;</span> <span style="color: #000088;">$r</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">description</span> <span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>		
			<span style="color: #009900;">&#125;</span>
		<span style="color: #009900;">&#125;</span>		 
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * Reads a document for html links
	 * 
	 * @param string $doc [optional]
	 * 
	 * @return array of links
	 *  
	 */</span>
	<span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">function</span> _getInnerLinks<span style="color: #009900;">&#40;</span><span style="color: #000088;">$doc</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		<span style="color: #000088;">$return</span> <span style="color: #339933;">=</span> <a href="http://www.php.net/array" rel="external"><span style="color: #990000;">array</span></a><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #000088;">$regexp</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;&lt;a\s[^&gt;]*href=(<span style="color: #000099; font-weight: bold;">\&quot;</span>??)([^<span style="color: #000099; font-weight: bold;">\&quot;</span> &gt;]*?)<span style="color: #000099; font-weight: bold;">\\</span>1[^&gt;]*&gt;(.*)&lt;\/a&gt;&quot;</span><span style="color: #339933;">;</span>
		<span style="color: #000088;">$match</span>  <span style="color: #339933;">=</span> <span style="color: #009900; font-weight: bold;">null</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #b1b100;">if</span><span style="color: #009900;">&#40;</span><a href="http://www.php.net/preg_match_all" rel="external"><span style="color: #990000;">preg_match_all</span></a><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;/<span style="color: #006699; font-weight: bold;">$regexp</span>/siU&quot;</span><span style="color: #339933;">,</span> <span style="color: #000088;">$doc</span><span style="color: #339933;">,</span> <span style="color: #000088;">$matches</span><span style="color: #339933;">,</span> PREG_SET_ORDER<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> 
		<span style="color: #009900;">&#123;</span> 	
			<span style="color: #b1b100;">foreach</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$matches</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$match</span><span style="color: #009900;">&#41;</span> 
			<span style="color: #009900;">&#123;</span> 
				<span style="color: #000088;">$return</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$match</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">2</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
			<span style="color: #009900;">&#125;</span> 
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #b1b100;">return</span> <span style="color: #000088;">$return</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * Prints a line for the current url
	 * 
	 * @param int
	 * 
	 */</span>
	<span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">function</span> _printLine<span style="color: #009900;">&#40;</span><span style="color: #000088;">$title</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">,</span> <span style="color: #000088;">$desc</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">,</span> <span style="color: #000088;">$page</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">,</span> <span style="color: #000088;">$depth</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		<span style="color: #666666; font-style: italic;">// do not indent at this time</span>
		<span style="color: #000088;">$depth</span> <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span>
		<span style="color: #666666; font-style: italic;">//</span>
&nbsp;
		<span style="color: #b1b100;">echo</span> <span style="color: #0000ff;">'&lt;p&gt;'</span> <span style="color: #339933;">.</span>
				<span style="color: #0000ff;">'&lt;h3 style=&quot;margin-left:'</span> <span style="color: #339933;">.</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$depth</span><span style="color: #339933;">*</span><span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">indent</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">'px&quot;&gt;'</span> <span style="color: #339933;">.</span> 
					<a href="http://www.php.net/stripslashes" rel="external"><span style="color: #990000;">stripslashes</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$title</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">.</span> 
				<span style="color: #0000ff;">'&lt;/h3&gt;'</span> <span style="color: #339933;">.</span> 
				<span style="color: #0000ff;">'&lt;span style=&quot;margin-left:'</span> <span style="color: #339933;">.</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$depth</span><span style="color: #339933;">*</span><span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">indent</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">'px&quot;&gt;'</span> <span style="color: #339933;">.</span>
					<a href="http://www.php.net/substr" rel="external"><span style="color: #990000;">substr</span></a><span style="color: #009900;">&#40;</span><a href="http://www.php.net/stripslashes" rel="external"><span style="color: #990000;">stripslashes</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$desc</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">100</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">.</span>
				<span style="color: #0000ff;">'&lt;/span&gt;'</span> <span style="color: #339933;">.</span> 
				<span style="color: #0000ff;">'&lt;h4 style=&quot;margin-left:'</span> <span style="color: #339933;">.</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$depth</span><span style="color: #339933;">*</span><span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">indent</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">'px&quot;&gt;'</span> <span style="color: #339933;">.</span> 
					<span style="color: #000088;">$page</span> <span style="color: #339933;">.</span> 
				<span style="color: #0000ff;">'&lt;/h4&gt;'</span> <span style="color: #339933;">.</span>
			<span style="color: #0000ff;">'&lt;/p&gt;'</span><span style="color: #339933;">;</span>	
	<span style="color: #009900;">&#125;</span>	
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * Analyzes a page to get meta information
	 * 
	 * @param string $doc(cument)
	 * @param string $page
	 * 
	 * @return list($title,$description);
	 *  
	 */</span>
	<span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">function</span> _analyzePage<span style="color: #009900;">&#40;</span><span style="color: #000088;">$doc</span><span style="color: #339933;">,</span> <span style="color: #000088;">$page</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		<span style="color: #666666; font-style: italic;">//</span>
		<span style="color: #000088;">$title</span> 	<span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
		<span style="color: #000088;">$desc</span> 	<span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
		<span style="color: #666666; font-style: italic;">//</span>
&nbsp;
		<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><a href="http://www.php.net/eregi" rel="external"><span style="color: #990000;">eregi</span></a> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;&lt;title&gt;(.*)&lt;/title&gt;&quot;</span><span style="color: #339933;">,</span> <span style="color: #000088;">$doc</span><span style="color: #339933;">,</span> <span style="color: #000088;">$out</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> 
		<span style="color: #009900;">&#123;</span>
			<span style="color: #000088;">$title</span> <span style="color: #339933;">=</span> <a href="http://www.php.net/addslashes" rel="external"><span style="color: #990000;">addslashes</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$out</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>	
&nbsp;
			<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><a href="http://www.php.net/strlen" rel="external"><span style="color: #990000;">strlen</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$out</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
			<span style="color: #009900;">&#123;</span>
				<span style="color: #000088;">$title</span> <span style="color: #339933;">=</span> <a href="http://www.php.net/substr" rel="external"><span style="color: #990000;">substr</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$title</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span><span style="color: #cc66cc;">50</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>	
			<span style="color: #009900;">&#125;</span>			
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #000088;">$out</span> 	<span style="color: #339933;">=</span> <span style="color: #339933;">@</span><a href="http://www.php.net/get_meta_tags" rel="external"><span style="color: #990000;">get_meta_tags</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><a href="http://www.php.net/isset" rel="external"><span style="color: #990000;">isset</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$out</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">'description'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
		<span style="color: #009900;">&#123;</span>
			<span style="color: #000088;">$desc</span> 	<span style="color: #339933;">=</span> <a href="http://www.php.net/addslashes" rel="external"><span style="color: #990000;">addslashes</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$out</span><span style="color: #009900;">&#91;</span><span style="color: #0000ff;">'description'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #b1b100;">return</span> <a href="http://www.php.net/array" rel="external"><span style="color: #990000;">array</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$title</span><span style="color: #339933;">,</span><span style="color: #000088;">$desc</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * Completes a page link with the domain if needed
	 * 
	 * @param string $page
	 * @param int 	 $parent
	 * 
	 * @return string corrected url
	 *  
	 */</span>
	<span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">function</span> _buildUrl<span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #339933;">,</span><span style="color: #000088;">$parent</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		<span style="color: #666666; font-style: italic;">// prepare url if relative</span>
		<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span><a href="http://www.php.net/eregi" rel="external"><span style="color: #990000;">eregi</span></a><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'http://'</span><span style="color: #339933;">,</span><span style="color: #000088;">$page</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
		<span style="color: #009900;">&#123;</span>
			<span style="color: #666666; font-style: italic;">// root relative path</span>
			<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><a href="http://www.php.net/strpos" rel="external"><span style="color: #990000;">strpos</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #339933;">,</span><span style="color: #0000ff;">'/'</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">==</span> <span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span>
			<span style="color: #009900;">&#123;</span>
				<span style="color: #000088;">$page</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">'http://'</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">domains</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$page</span><span style="color: #339933;">;</span>	
			<span style="color: #009900;">&#125;</span>
			<span style="color: #666666; font-style: italic;">// page relative path</span>
			<span style="color: #b1b100;">else</span>
			<span style="color: #009900;">&#123;</span>
				<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><a href="http://www.php.net/strpos" rel="external"><span style="color: #990000;">strpos</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #339933;">,</span><span style="color: #0000ff;">'mailto:'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">&lt;</span><span style="color: #cc66cc;">0</span> <span style="color: #339933;">&amp;&amp;</span> <a href="http://www.php.net/strpos" rel="external"><span style="color: #990000;">strpos</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #339933;">,</span><span style="color: #0000ff;">'javascript:'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">&lt;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#41;</span>
				<span style="color: #009900;">&#123;</span>
					<span style="color: #000088;">$page</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span>_getUrlByID<span style="color: #009900;">&#40;</span><span style="color: #000088;">$parent</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">'/'</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$page</span><span style="color: #339933;">;</span>	
				<span style="color: #009900;">&#125;</span>				
			<span style="color: #009900;">&#125;</span>			
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// trim final slash</span>
		<span style="color: #000088;">$page</span> <span style="color: #339933;">=</span> <a href="http://www.php.net/trim" rel="external"><span style="color: #990000;">trim</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$page</span><span style="color: #339933;">,</span><span style="color: #0000ff;">'/'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #b1b100;">return</span> <span style="color: #000088;">$page</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * Gets url for a given ID
	 * 
	 * @param int url id
	 * 
	 * @return string the url
	 *  
	 */</span>
	<span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">function</span> _getUrlByID<span style="color: #009900;">&#40;</span><span style="color: #000088;">$id</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		<span style="color: #000088;">$results</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$this</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">ci</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">db</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">query</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'SELECT url FROM web.map WHERE ID = &quot;'</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$id</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">'&quot;'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$results</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">num_rows</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
		<span style="color: #009900;">&#123;</span>
			<span style="color: #000088;">$result</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$results</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">result</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
			<span style="color: #b1b100;">return</span> <span style="color: #000088;">$result</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">-&gt;</span><span style="color: #004000;">url</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #b1b100;">return</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * Gets domain name from a URL
	 * 
	 * @param string url
	 * 
	 * @return string domain name
	 * 
	 */</span>
	<span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">function</span> _getDomain<span style="color: #009900;">&#40;</span><span style="color: #000088;">$url</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		<span style="color: #666666; font-style: italic;">// get host name from URL</span>
		<a href="http://www.php.net/preg_match" rel="external"><span style="color: #990000;">preg_match</span></a><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'@^(?:http://)?([^/]+)@i'</span><span style="color: #339933;">,</span> <span style="color: #000088;">$url</span><span style="color: #339933;">,</span> <span style="color: #000088;">$matches</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><a href="http://www.php.net/isset" rel="external"><span style="color: #990000;">isset</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$matches</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
		<span style="color: #009900;">&#123;</span>
			<span style="color: #b1b100;">return</span> <span style="color: #000088;">$matches</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>	
		<span style="color: #009900;">&#125;</span>
		<span style="color: #b1b100;">else</span>
		<span style="color: #009900;">&#123;</span>
			<span style="color: #b1b100;">return</span> <span style="color: #009900; font-weight: bold;">NULL</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #009900;">&#125;</span>
&nbsp;
&nbsp;
	<span style="color: #009933; font-style: italic;">/**
	 * HTTP helper function.&lt;br /&gt;
	 * Loads an http request and returns result.
	 *
	 * @param string $url to request
	 *
	 * @return string the result
	 *
	 */</span>
	<span style="color: #000000; font-weight: bold;">private</span> <span style="color: #000000; font-weight: bold;">function</span> _getHTTPRequest<span style="color: #009900;">&#40;</span><span style="color: #000088;">$url</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		<span style="color: #666666; font-style: italic;">// vars</span>
		<span style="color: #000088;">$html</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
		<span style="color: #000088;">$old</span>  <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>
		<span style="color: #000088;">$file</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">;</span>	
		<span style="color: #666666; font-style: italic;">//</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">//configuration</span>
		<span style="color: #000088;">$timeout</span> 			<span style="color: #339933;">=</span> <span style="color: #cc66cc;">15</span><span style="color: #339933;">;</span>
		<span style="color: #666666; font-style: italic;">//</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// execution</span>
		try
		<span style="color: #009900;">&#123;</span>
&nbsp;
  			<span style="color: #000088;">$file</span> <span style="color: #339933;">=</span> <span style="color: #339933;">@</span><a href="http://www.php.net/fopen" rel="external"><span style="color: #990000;">fopen</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$url</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'rb'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
			<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$file</span><span style="color: #009900;">&#41;</span>
			<span style="color: #009900;">&#123;</span>
				<span style="color: #000088;">$html</span> <span style="color: #339933;">=</span> <a href="http://www.php.net/stream_get_contents" rel="external"><span style="color: #990000;">stream_get_contents</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$file</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				<a href="http://www.php.net/fclose" rel="external"><span style="color: #990000;">fclose</span></a><span style="color: #009900;">&#40;</span><span style="color: #000088;">$file</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #009900;">&#125;</span>						
&nbsp;
		<span style="color: #009900;">&#125;</span>
		catch<span style="color: #009900;">&#40;</span>Exception <span style="color: #000088;">$e</span><span style="color: #009900;">&#41;</span>
		<span style="color: #009900;">&#123;</span>
			<span style="color: #000088;">$html</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">'error'</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
		<span style="color: #666666; font-style: italic;">//</span>
&nbsp;
		<span style="color: #b1b100;">return</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$html</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

]]></content:encoded>
			<wfw:commentRss>http://jorgealbaladejo.com/2010/08/21/basic-web-mapper/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Working with Web Robots (Crawlers)</title>
		<link>http://jorgealbaladejo.com/2007/11/06/comenzando-a-trabajar-con-web-robots-web-crawlers/</link>
		<comments>http://jorgealbaladejo.com/2007/11/06/comenzando-a-trabajar-con-web-robots-web-crawlers/#comments</comments>
		<pubDate>Tue, 06 Nov 2007 12:43:39 +0000</pubDate>
		<dc:creator>Jorge Albaladejo</dc:creator>
				<category><![CDATA[Web Architecture]]></category>
		<category><![CDATA[Web Crawlers]]></category>
		<category><![CDATA[crawler]]></category>
		<category><![CDATA[spider]]></category>
		<category><![CDATA[web robot]]></category>

		<guid isPermaLink="false">http://labs.abc-webs.net/2007/11/06/comenzando-a-trabajar-con-web-robots-web-crawlers/</guid>
		<description><![CDATA[Some useful information to start with when attempting to work with web spiders. Just to learn the basis, these links could be useful for those to begin dealing with this subject: http://www.robotstxt.org/ &#62; Basic information and links to known robots and open source projects http://www.robotstxt.org/wc/faq.html &#62; The FAQ http://en.wikipedia.org/wiki/Web_crawler &#62; Some additional information and links [...]]]></description>
			<content:encoded><![CDATA[<p>Some useful information to start with when attempting to work with web spiders. Just to learn the basis, these links could be useful for those to begin dealing with this subject:</p>
<ul>
<li><a href="http://www.robotstxt.org/" target="_blank" rel="external">http://www.robotstxt.org/</a> &gt; Basic information and links to known robots and open source projects</li>
<li><a href="http://www.robotstxt.org/wc/faq.html" target="_blank" rel="external">http://www.robotstxt.org/wc/faq.html</a> &gt; The FAQ</li>
<li><a href="http://en.wikipedia.org/wiki/Web_crawler" target="_blank" rel="external">http://en.wikipedia.org/wiki/Web_crawler</a> &gt; Some additional information and links to open source bots</li>
<li><a href="http://www.ficstar.com/web_grabber/web_crawler.html" target="_blank" rel="external">http://www.ficstar.com/web_grabber/web_crawler.html</a> &gt; A private crawler software</li>
<li><a href="http://www.newprosoft.com/web-spider.htm" target="_blank" rel="external">http://www.newprosoft.com/web-spider.htm</a> &gt; Another private crawler software with a competitive price</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://jorgealbaladejo.com/2007/11/06/comenzando-a-trabajar-con-web-robots-web-crawlers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

