Make a Search Engine in PHP and MySQL
Why would you want to make a search engine anyway?
There already is a search engine to rule them all. You can use Google
to find just about anything in the Internet and I doubt you will ever
have the same computing and storage capabilities as the big G.
So why then make your own search engine?
To make money of course!
... and to become famous as the creator of the next big search engine
or because as a programmer or engineer you like challenges. Making a
search engine for the public Internet is tricky and if you're like me
you like to solve tricky problems.
The third application is a customized, high speed site search for you
thousands of pages website. An indexed search engine will be a lot
a full text search function and if Google's site search isn't flexible
for your site you can make your own search functionality.
THE BASICS OF SEARCH
The basis of any BIG search engine is a word to web page index,
basically a long list of words and how well they relate to different
To make a search engine you have to do four things:
- Decide what pages to fetch and fetch them
- Parse out words, phrases and links from the page
- Give a score to every keyword or key phrase
indicating how well the phrase relates to that pages and store the
scores in the search engine index
- Provide a way for users to query the index and
get a list of matching web pages
This is not hard for a seasoned programmer. It can
be done in a day if you know regular expressions and have some
experience with HTML and databases.
Now you have a working search engine, just add a lot of computers and
hard drives and you'll soon index all of the Internet. If you're not
prepared to go that far a one terabyte disk will hold an index of about
50 million pages.
HOW TO SCORE PAGES
After completing basic search functionality there's a lot of work
before anyone will want to use your new machine.
An index is not enough. What's challenging is how to score pages to
give the end user the search results that's most relevant to his idea
of what hi is searching for.
You'll need to decide how much weight to put on keywords in the tile
tag, description and main web page contents. To make good scoring you
will also want to boost keywords found in the URL of the page and check
the anchor text of inbound links.
Keeping track of inbound links is the most useful and most challenging
of the above, you'll need to keep a separate database table with info
on all links between pages you index.
WHAT TO INDEX AND NOT TO INDEX
Other obstacles you will find when you start indexing real Internet
content is the fact that there is wast amounts of useless junk floating
around everywhere and eventually your index will become full of spam,
affiliate pages, parked domains, work in progress homepages without
content, link farms used by search engine optimizers, mirror sites
using data feeds to create thousands of pages with product listings or
other reproduced content etc, etc...
When indexing from the Internet you will have to find ways to filter
out the junk content from what people are actually reading and
To start with you could limit how deep into sub directories you crawl,
how many link hops from a domain index page you crawl and how many
links per web page to allow.
There's a million ways, both right and wrong to write HTML and when you
index from the Internet you will need to handle all of them.
When parsing keywords from pages you not only need to handle the
complete HTML standard but also all the non-standard ways that is
unofficially supported by Internet browsers.
To be able to read all pages you will also need to parse client side
This is a large part of the work on a general search engine, to be able
to read all sorts of content.
WHY SO MANY URLS?
Finally you'll need to deal with the fact that many websites have many
URLS pointing to the same web page. Just look at this example:
All those URLs point to the same web page. If you don't make special
code to handle that you'll soon have 4 results in your search engine
(one for every URL) all going to the same page. Users will not like you.
There is also the possibility of query strings where a session ID after
the question mark in the URL will create almost infinite URLs for the
same web page.
To the search engine there will be a really big number of pages all
containing the same content.
The quick fix of course is to not index pages that include a query
string. Or to strip the query string from pages. This works but will
also remove a lot of legitimate content (think forums) from your index.
You now have all the information you need to make a site search engine.
If you're going for a general Internet search engine there's a lot more
details you need to include. Like robots.txt, site maps, redirects,
proxies, recognizing content types, advanced ranking algorithms as well
as handling terabytes of data.
I'll cover more detail in a future article. Good luck with your next
search engine project. engine algorithms.
Article Source: http://www.articlesbase.com/programming-articles/
About the Author
Simon Byholm is building a new search engine
where he will test and describe new and old search algorithms. Simon is
a software engineer living on the west coast of Finland and has a B.Sc
degree in telecommunication and a burning interest for search