Crawling<br>(gather data from web)
Basic challenge: there is no central index of URLs of interest. <br>
Secondary challenges:
Some websites return the same content as a new URL at each visit.<br>
Some pages never return status 'done' on access.<br>
Some websites are not intended to be crawled. <br>
Much web content is generated on-the-fly from databases, <br>which can be costly for the content provider, so excessive numbers of visits to a site are unwelcome. <br>
Some content has a short lifespan. <br>
Some regions and content providers have low bandwidth.
URLs L:
Every page is visited eventually
Synonym URLs are disregarded
Significant or dynamic pages are visited sufficiently frequently<br>
The crawler isn't cycling indefinitely in a single web site (caught in a crawler trap)
Parsing<br>(translate data into a canonical form)
The aim of parsing is to reduce a web page, or a query, to a sequence of tokens
Canonicalisation<br>(having a single representation of a text)
Tokenisation<br>(decomposing the larger document into smaller information-bearing units)
Normalisation<br>(a form which is generally representative of other instances of the same idea)
Tokenisation steps
Folding case
transfer to lowercase
Stripping punctuation
remove non-alphabetic characters
Splitting into tokens
base on whitespace
Zoning
segment documents to discrete zones e.g. title, anchor text, headings
Indexing<br>(Build efficient data structure)
inverted index: <br>a collection of lists, one per term, recording the identifiers of the documents containing that term. <br>
Querying<br>(process data in response to queries)
Boolean Querying
fetch each term
use intersection for AND
use union for OR
take complement for NOT
ignore within-document frequencies
Ranked Querying
accumulator<br>(the partial sum of the dot product)
calculate Wq,t and fetch the inverted list for t
calculate Wd,t and set A=A+Wq,t*Wd,t
set A=A/W
identify greatest A and return corresponding documents
accumulator cost
limiting:<br>processing the most important terms first, namely, the ones with highest IDF<br>
thresholding:<br>Only create an accumulator when the TF-IDF value is above some threshold<br>
Ass-ons
Phrase Queries
returning only documents where the query terms appear in sequence
use a "positional index": an index where the positions of terms (indices within the document)
Link Analysis<br>
a weighting (or re-ranking) procedure
based on the link structure of the Web
assumption: popular pages are more likely to be relevant
PageRank
we begin at all pages with equal probability
traversal
follow one of the outgoing links from this page, with even probability<br>
teleport
navigating via entering a URL, with even probability