WebIndex

A Site Search Tool for the Masses

Copyright© 2020, Erick Engelke

http://erickengelke.com/webindex

 

WebIndex is an easy-to-install and inexpensive Web Search Engine Tool suitable for most sites, from small personal sites to busy corporate sites.

 

Features

 

WebIndex creates fast ranked pages based on user specified search words. 

 

The system can be used for FREE with Ads, or you can pay the license fee and have an Ad-free site.

 

It can inspect and index most unencrypted document types:

·      HTML

·      Word (.docx)

·      Excel (.xlsx)

·      Powerpoint (.pptx)

·      Many PDFs (not all encrypted ones)

 

You can specify one or more sites or subsites and the system will only index documents in sites or subsites you specify.  Also, exclude sites or subdirectories specifically.

 

The server can be any MacOS, Linux or Unix web server with just PHP 7.3 and MySQL or MariaDB.  You will be up and running in minutes.

 

The client is pure HTML5 + JavaScript.

 

The banner picture is customizeable for your site.

 

Sample Search Results

Search results are quickly displayed and prioritized by the context of the search words in the online pages.  Results shown includes Nag text for the unregistered version.

 

Graphical user interface, text, application

Description automatically generated

 

Configuration

Configuration takes only about 10 minutes for a typical Mac, Linux or Unix administrator.

 

1.     You will need Apache, either MySQL or MariaDB and PHP 7.3 with ADO and ZipArchive installed on the server.

2.     Download the latest source files and untar to a fresh directory (eg. /usr/local/webindex) from https://erickengelke.com/webindex.tar

3.     There are two subdirectory levels below the first.  One is web, which will contain the files visible to your users.  The other is server, and you will want to chown 700 that subdirectory so other users cannot see it.  You may move those subdirectories anywhere on your system, or you can leave them where they are.   

4.     Either create a symbolic link from your HTTP or HTTPD serving subdirectory to the web directory, or copy the contents of the web directory to a subtree of your HTTP or HTTPD subdirectory.

5.     Edit the server subdirectory’s serverconfig.php and set $webdirectory to point to the subdirectory holding the web pages, and set $serverdirectory to the path to main directory.
eg. $serverdirectory = “/usr/local/webindex”;
and $webdirectory = “/var/httpd”;
NOTE: the subdirectory names do not contain a trailing / slash.

6.     Configure the database with MySQL or MariaDB. 
eg. create database webindex
You will need an account called webindex@localhost with read/write permissions granted to database webindex@localhost.  The default assumed password is HappyGilmore, but you should change it for production use.  These values are set in web’s config.php

7.     Install the database structure, by using MySQL’s source command from the server subdirectory:

-       use webindex;

-       source sqlconfiguration.sql;

8.     Edit the web subdirectory’s config.php

a.     Set the database password in $db_pass if you didn’t use the default

b.     Set the $site_name to your Site’s textual name

c.     Set $default_site to your base url.  Eg.
$default_site = “https://microsoft.com/Windows”;

d.     You can set an optional $first_url, which will be the starting point.  By default it starts with the whole $default_site.

e.     Erase any backup files in the web subdirectory (eg. config.bak or config.php~) because they would contain your database password.

f.      Run a test by going to the main server directory and typing
php site_start.php
This should cause the system to index a few pages, it will print the URL and Title of each page as it goes. If it doesn’t, check that your URLs in config.php are correct.  It may continue for a few minutes.  By default it stops every 5 minutes or when it runs out of work.  You can press Ctrl-C to stop the spider process.

g.     To add other web servers or other sites, go to the server directory and run php site_add_site.php
eg. php site_add_site.php https://microsoft.com

Note the you must add both http://microsoft.com and https://microsoft.com if you want both encrypted and unencrypted sites.  Most sites are https only these days.

Since you will have added other sites, it is necessary to rescan all the scanned web pages to automatically add any URLs that would point to the new sites.  This is easily accomplished with
php site_clean.php

h.     Now try running php site_cron.php.   You shouldn’t get any errors nor any output, because cron jobs would fill up the syslog tables if we output in them.

i.      Now you have a choice:  If this is your company web site, you will use cron to look for new updated pages on an intermittent basis.  However, if you just are running your own personal pages, you only need to run the php site_start.php program occasionally to update the site index when you manually update your site and don’t need to use cron.

                                               i.     If Setup your Unix crontab to call php path_to_file/site_cron.php every 5 minutes.  The cron.lock file will be updated each time cron starts a new job.

9.     Your system is now configured, congratulations.  Point your web browser on a network computer to index.php in the web subdirectory.  If you see an error, you must fix the problem.

10.  You should now be able to search for several words that are indexed already.  Note that the demo license is limited and contains a nag screen and advertising.  If you do not want ads, you must register for the full legal license, it’s inexpensive and it’s required for an ad-free experience!

 

 

Other commands

1.     site_clean.php – erase all the data, leaving the site names intact.

2.     site_cron.php – the script you call from a cron tab.  Runs for 5 minutes or less.  So schedule it every 5 or 6 minutes.

3.     site_popular.php – list URLs in descending order of their frequency of being clicked

4.     site_search_words.php Call with some words to search and do a command line search.
php site_search_words.php computer windows

5.     site_summary.php – print out a summary of the number of words and pages indexed

6.     site_urls – list the urls indexed

7.     site_words – list the indexed words

 

Important Note:

WebIndex indexes all pages in its path if they meet the search criteria of supported sites.  But sometimes you might have confidential internal information and do not want it indexed.  Eg. https://asdf.com/internal may contain private data. 

To exclude subdirectory tree, edit the sites table in the MySQL database, or mark it as recording=0.

Eg.  insert into sites (site, recording) values (‘https://asdf.com/internal’,0);

Then php site_clean.php to clean out the database of any pages you want hidden.

 

Registering the WebIndex Software

Registration is required for legal usage if you do not want ads.  The costs are affordable:

 

·      Personal Web Site: $40 US. (Up to 500 pages)

·      Departmental Web Site: $100 US (Up to 20,000 pages)

·      Small Company: $200 US (Up to 50,000 pages )

·      Corporate Web Site: $1,000 US (Unlimited pages)

 

Contact: erickengelke@gmail.com for details.  Include a copy of your config.php in the Email as the license code depends on its values.

 

We can perform custom coding for your requirements.  If the suggestion is of general use to everyone, there will be no charge.

 

 

Personal Webspaces

If you purchase or are given a personal web page and wish to use WebIndex to organize your files, you are in a different situation from a company where many people are updating files at random times. 

 

It is not necessary to set up a cron job.  Just run the site_start.php script every time you update a document.  Make it part of your routine.

 

Enterprise Performance

The system is easily capable of scaling up to meet most business sizes.  The spider function can operate on a separate machine from the web search tool.  All tasks can run multithreaded. Contact the author for more information.

 

Disk Requirements

WebIndex is careful with disk space.

 

A sample site with 1,874 documents and 10,011 words required 33 megabytes of space for all the index files.

 

In our experience, the approximate required size is:

    Document count x 16Kb + distinct words * 1Kb

 

   128K 28 Nov 00:51 sites.ibd

   8.0M 28 Nov 04:04 webpage.ibd

    25M 28 Nov 04:04 webpageassociation.ibd

    11M 28 Nov 04:04 webword.ibd

 

If you want to keep space consumed small, set $cache_documents = 0 in the web’s config file.  This will keep less information about saved web pages. 

 

Defining $cache_document = 1; will offer more information to search users, often showing the context of the found information.  By default, the system caches the first 4 kilobytes of useful data.  You can reduce this by half by setting $cache_len = 2048;

 

Copyright © 2020, Erick Engelke : erickengelke@gmail.com