This text was written in December 2004 by dirk@dirkgomez.de for MIT Sloan School. It is copyright MIT Sloan School.
Vision Statement
To create a user-friendly, extensible, and scalable search package for .LRN.
Introduction
There currently is no working search solution for .LRN. This document will describe how to implement a search package for .LRN for Oracle 8i, which is the database system in use in MIT Sloan School. The described approach should be portable to PostgreSQL.
Oracle provides a search engine out of the box which has renamed been renamed a few times: Oracle Text, Oracle Context, Oracle Intermedia. I'll use the term Intermedia in this text.
Intranet Search vs. Internet Search
Intranet search is different from internet search:
- Intranet search has to take permissioning into account before it displays a result.
- You can use additional metadata like type of an object, when was an object created, how often has it been accessed etc. - and you can trust this metadata.
- You can access "real data" in a database and don't have to parse websites and then try to filter out navigational metadata or search engine tricks.
- Because an intranet search talks to the database fields in a database directly, it knows the value, source and semantics of data much better than an external solution.
- Intranet search works on a finite set of data: you can make better assumptions about the expected results, both in quality and in quantity.
Why not use Google (or Convera etc) search? They cost! Google offers an search appliance where the Google crawler requests your data via HTTP requests. But every page in .LRN is personalized and there is object-level permissioning, and the Google crawler will always only see an object's representation tailored to it.
Why not use htdig which is free? Because it runs outside of the database using HTTP to get documents to be indexed - it simply doesn't know about .LRN's permissioning: the search page would list results without considering permissions.
If you run an internet site which is mostly publically available, then you may be much better off with using an Internet search engine by using e.g. Google Free WebSearch or by using htdig and fine-tuning it to your needs.
Current state of the site-wide-search and the search package
There is no working search package for .LRN - the main reason is probably the effort of integrating the about 10 packages into a search solution. The site-side-search package in the OpenACS CVS repository has not been touched in about three years and the search package currently only works with PostgreSQL. Porting the search package to Oracle doesn't look too difficult, but the real work is making the .LRN packages work with it.
ACS3 and ACES, the predecessor to .LRN, had working search packages. When ArsDigita made the transition from ACS3 to ACS4, most of the packages were rewritten from scratch. When ArsDigita dropped ACS TCL altogether, many packages remained unfinished and site-wide-search is among them. Furthermore, this code has not been changed for almost three years, thus not reflecting the many data model changes.
I know of three search solutions based on OpenACS4 systems: the ShareNet system which is hopelessly forked and not freely available, the Greenpeace International system which is just hopelessly forked and the Greenpeace Cyberactivist Toolkit search which should be very similar in scope and functionality and currently only works with PostgreSQL.
My experience with the ACS3 version of search was pretty good and about one year ago I did a proof of concept port to a heavily customized version .LRN and achieved results pretty quickly. The approach described in this document will be a mixture of my past experience and the discussions I have conducted in the past months.
Lists of Requirements - MIT Sloan
This list of requirements was collected by Deirdre Kane from MIT Sloan School.
- Users have the ability to execute keyword searches against all public content and content in communities of which they are members.
- Search results should be returned in relevance ranked order with the ability to re-sort by date.
- Users should have the ability to search within a set of search results to refine their queries.
- Users should be able to search within a single community or group of communities.
- Users should be able to choose the document types in which to conduct the search.
- Search results should not contain results for content the user does not have permission to access.
- Different search types and/or repositories should be separated using a tabbed UI like Google (example tabs are documents, news, faq, people etc).
- Spellchecker integrated in search (like Google).
- The search system should log the queries entered to allow later evaluation of the effectiveness of the system.
- Index binary files like Word, Excel, or PDF
Other requirements under evaluation
- Ability for administrators to manage search results to direct important pages to the top of the search results
The Implementation Process
In short, these steps are necessary to create a search package:
- Discuss, agree on, implement the necessary changes to the OpenACS core.
- Implement the search "platform code".
- Implement more feature-rich or customized versions of search.
- Discuss, agree on, set up a process that makes sure that future versions of .LRN packages do not get out of sync with search.
Use-cases and User-scenarios
General Scenarios
- A user is searching for some particular content in a .LRN system.
- A user is searching for some particular content in a particular communities and all of its subcommunities in a .LRN system.
- A user is searching for another user of a .LRN system.
- A user is searching for particular content in a particular filetype.
User Interface
Every user-facing page will have a search field and a drop-down list that allows the user to scope the search. The following example assumes you are on a page of the TCL 101 class:
Every search request will be a GET request with the search form fields having the previous request's values as their defaults and the page itself describing the query and then listing the relevant pages so that continual refinement of a query is possible. It is possible to reorder the relevant items by date.
The result page should be similar to Google's result page: The title of the found page as a link, a more verbose blurb, the target URL and the target's name of the community or group:
Clean delete of packages in Elearning Introduction 2004
irc, maybe also somewhere in this forum), that deletion of data in general is a bad idea
http://some.tld/elearning/introduction/2004/forums/message-view?message_...
How do I get OpenFTS/Search working? in OpenACS Hacking 2003
installed at http:// www.thedesignexperience.org/search/ You can search the weblog from the #openacs irc channel there.
http://some.tld/openacs/hacking-2003/calendar/cal-item-view?item_id=24874
etp page links returned by search in OpenACS Hacking 2004
IRC channel and look for DavB. Thanks Jun, that actually looks kinda straightforward - I will
http://some.tld/openacs/hacking-2003/wiki-view?object_id=68487
ns_conn content payload in OpenACS Hacking 2004
IRC channel fairly regularly (openacs on opennet). As far as reconciling AOLserver changes made here
http://some.tld/openacs/hacking-2004/forums/message-view?message_id=68432
(Note that the community name should be a non-obvious link, too, non-obvious to not clutter the user interface.)
Search and Advanced Search page
The search and the advanced search page are virtual URLs mounted under community-name/search and community-name/advanced-search respectively. They should look like the Google search page and offer "All" - "Documents" - "People" - "News" - "FAQ" as its tabs and "All your communities", "This community", "This community and all its subcommunities" as radio buttons. Advanced search gives a user more options on how to exclude or include file types, limit the search to a particular date interval etc.
Spell Checker
I personally have no experience with spellcheckers. Here's an approach which I think is feasible: check the words against a dictionary, then if the search result is well beyond 10 rows and not scoped to the current community, present an alternative search string. I am sure that the spell checker's quality needs to be monitored for quality first and I propose to move te spell checker to a phase II.
Monitoring Search
Admins need to be able to look at searches that were run by users: search string, search url stub and the objects found by this search run.
This should be presented as a list of URLs that let the admin rerun the search query with the original user's credentials (note: this particular feature is limited to site-wide admins) to allow for tuning the search queries.
Here's an (ugly) example (without the pagination which will be necessary for the real system):
Discussion Points
This is a high-level description of what happens when you run a search query:
For all objects I have read permission on:
- Is the object's content relevant for my query?
- Is the object in the communities I want to search in?
The resulting objects need to be ordered then by either relevance or last modification date and rendered as links with a explanatory blurb.
This translates into the following requirements for the OpenACS/.LRN core:
- What does last_modified in acs_objects mean exactly per object? E. g. for the initial answer in a forum thread: is it the date of the last answer or the date of this particular message's last edit?
- Add a way to inexpensively retrieve the community_id to which an object belongs. Currently it is package-specific to figure out to which community a random object belongs. Note that augmenting ad_conn is not enough, this problem has to be addressed at the data model level.
- It is recommended that a central table called acs_objects_description or site_wide_index be added which holds an object's name and a few other columns which will be of use for the site-wide-package. This table will be filled by all the packages which feed the search package, usually with triggers.
- Every package which is "search"able needs a get_url and - depending on whether there is a central field for an object name - a get_name procedure on the TCL level. I suggest putting those procedures into a TCL namespace e. g. search::calendar for a calendar object, search::cal_item for a calendar item object, search::forum for a forum object, search::forum_message for a forum message. Currently, the approach is to use the acs-service-contracts which I find unnecessarily complicated for this purpose.
- Every package-specific permission type that allows a user to "read" something must inherit from the site-wide "read" permission type.
- There must not be permissioning islands: a user that has "read" on an object will be able to see the item in the search results regardless of private_p or published_p etc. flags.
- Items which a user cannot "read" will never be shown in the search result regardless of really_public_p etc. flags.
I do not know to which extent either of these items has been addressed in the latest stable .LRN release. This needs to be discussed and solved prior to implementing a search package.
Future data model changes may possibly break search: the community needs to find a way to propagate package changes quickly to the search module. I am not quite sure yet where to put the helper procs: it may be good to make them part of the respective package, so that it is clear, that a data model change may incur a change to the helper proc as well. On the other hand this is probably a nightmare for the maintainer of the search package.
Use get_url or site_node.url or /o or /object/?
Search will show an object's URL and thus need to retrieve it inexpensively. The site_node.url which operates on the database layer has had pretty bad performance especially with big site node trees, but it is certainly the most convenient to use. get_url works on the TCL layer and has pretty good performance, but I think it is a little less convenient.
We could also use the "redirect trick" which I brought up in the OpenACS forums a few times: we'd need to create a page like /object which takes an object_id and then figures out the object_type and the object's URL and URL stub and redirects to it. The /object approach offers by far the best performance because all you have to do is append two strings. We'd lose the speaking URLs though which I find unacceptable unless both site_node.url and get_url turn out to not scale well enough.
How to store and retrieve an object's Description
Oracle offers a package called CTX_DOC which helps in creating a verbose description with highlighted search terms. From reading the docs it seems that you need to do the following to put CTX_DOC to use: firstly use the CTX_DOC.Filter procedure to populate a table - the "result table" - with documents rendered in HTML or plain text. Then pass in the name of the result table, the object_id, and the query search term into either the Highlight or Markup function of the CTX_DOC package. From reading the docs it wasn't quite clear to me what these two functions do exactly - this needs to be tested in code. Anyway, we'd have to add this "result table".
Alternatively on a past project we just added a description field to a central table - called acs_objects_description - which took up to 4000 bytes and was kept uptodate the packages themselves who were also responsible for synthesizing the content: e. g. the calendar package would not only use the description field, but also add the start and end date to the central description.
I prefer using CTX_DOC.
The Tasks
Necessary Data Model or Core Code Changes
The datastore field has to be added to a central table, it is the place holder column for Intermedia. Candidates would be acs_objects or cr_items or newly created tables like acs_objects_description or site_wide_index.
Suggested Data Model or Core Code Changes
Information not provided by the search SQL query should be retrieved by a calling a TCL proc in a object_type-dependant namespace e.g.:
- get_url object_id - for given object_id retrieve the URL of the object
- get_name object_id - for given object_id retrieve the name of the object
- get_community_name object_id - for given object_id retrieve the name of the community
The namespace convention should be search::${object_type_name}::get_url etc.
How To Index
The .LRN search package will use Intermedia's USER_DATASTORE object which allows synthesizing of documents during indexing with stored procedure. This way we can create indexed text which is composed of information which is stored in many tables and/or rows like author, location of an event, original question etc. This is achieved by the following code snippet (stolen from ACES-Sloan) which creates a preference that holds the name for a synthesizer procedure called sws_user_proc_&1 in the CTXSYS schema - The &1 is replaced at creation time by the database username. - for a user_datastore:
begin ctx_ddl.create_preference('sws_user_datastore', 'user_datastore'); ctx_ddl.set_attribute('sws_user_datastore', 'procedure', 'sws_user_proc_&1'); end; /
sws_user_proc_&1 itself is only a wrapper procedure in the CTXSYS schema and you need to grant the necessary permissions on this procedure to &1. It is called once for each row indexed and it has to provide the document's text in its second parameter.
CREATE OR replace procedure sws_user_proc_&1 ( rid IN ROWID, tlob IN OUT nocopy clob ) AS BEGIN &1..sws_user_datastore_proc(rid, tlob); END; / show errors; grant execute on sws_user_proc_&1 to &1; grant ctxapp to &1;
The datastore proc in the user's schema will look like this:
CREATE OR replace procedure sws_user_datastore_proc ( rid IN ROWID, tlob IN OUT nocopy clob ) IS v_object_id in acs_objects.object_id%TYPE default null, v_object_type in acs_objects.object_type%TYPE default 'acs_object', BEGIN SELECT object_type, object_id INTO v_object_type, v_object_id FROM acs_objects WHERE rid = ROWID; dbms_lob.trim(tlob, 0); -- handle the object types IF v_object_type = 'forums_message' THEN forums_message_sws_helper(rid, tlob, v_object_id); ELSIF v_object_type = 'cal_item' THEN cal_item_sws_helper(rid, tlob, v_object_id); ELSIF ... END IF; END; / show errors;
The procedure that creates the text to be indexed may roughly look like this (code is untested at the time of writing of writing):
CREATE OR replace procedure cal_item_sws_helper (rid IN ROWID, tlob IN OUT nocopy clob, p_object_id IN acs_objects.object_id%TYPE) IS -- declare all the necessary variables BEGIN select to_char(start_date, 'YYYY-MM-DD HH24:MI:SS') as start_date_ansi, to_char(end_date, 'YYYY-MM-DD HH24:MI:SS') as end_date_ansi, nvl(e.name, a.name) as name, nvl(e.description, a.description) as description, cal_item_types.type as item_type, c.calendar_name, o.creation_user INTO v_start_date_ansi, v_end_date_ansi, v_name, v_description, v_item_type, v_calendar_name, v_creation_user from acs_activities a, acs_events e, timespans s, time_intervals t, cal_items, cal_item_types, calendars c, acs_objects o where e.timespan_id = s.timespan_id and s.interval_id = t.interval_id and e.activity_id = a.activity_id and e.event_id = p_object_id and cal_items.cal_item_id= p_object_id and cal_item_types.item_type_id(+)= cal_items.item_type_id and c.calendar_id = on_which_calendar and o.object_id = cal_items.cal_item_id; dbms_lob.writeappend(tlob, length('<creation_user>'), '<creation_user>'); dbms_lob.writeappend(tlob, length(v_creation_user) + 1, v_creation_user || ' '); dbms_lob.writeappend(tlob, length('</creation_user>'), '</creation_user>'); dbms_lob.writeappend(tlob, length('<name>'), '<name>'); dbms_lob.writeappend(tlob, length(v_name) + 1, v_name || ' '); dbms_lob.writeappend(tlob, length('</name>'), '</name>'); IF v_description IS NOT NULL THEN dbms_lob.writeappend(tlob, length(v_description) + 1, v_description || ' '); END IF; -- add the remaining attributes -- .. END; / show errors;
The <creation_user> and <name> tags allow for within searches, searching within such a section is possible if it was created with the following PL/SQL statement:
begin ctx_ddl.create_section_group('searchgroup', 'basic_section_group'); ctx_ddl.add_field_section('searchgroup, 'name, 'name, TRUE); ctx_ddl.add_field_section('searchgroup, 'creation_user', 'creation_user', TRUE); end;
name and creation_user are the two sections that make sense for all or almost all object types - they are the acs_object's name and creation user respectively. Both name and creation_user should also be assigned a higher relevance if they are matching a search term.
The datastore is created and then kept in sync with the data usually with triggers that just set the datastore to an arbitrary value. The Intermedia process then performs the appropriate action: inserting, updating, or deleting.
What to index?
One approach to index and subsequently present content is to index on an object level. Here it is necessary to figure out which table fields holds which particular content attritubes and feed those into search.
Alternatively, we could index a "page": with this approach you need to feed search with the same content you'll show the user on the "html" page - without the navigation overhead.
A few examples will make the differences clear:
- It is possible to either index every message of a forum thread individually or a whole thread as one.
- It is possible to index every blog comment individually or a blog posting with all its comments as one.
The page-level approach may be more familiar to the user whereas the object-level approach is only possible if your search engine understands the data model.
The object-level approach should be the default approach because it offers finer granularity of the search results stored and less initial programming and less subsequent maintenance costs:
Because a mixture of both approaches is possible, the page-level approach should be chosen where a package maintainer sees good reason for using it.
Indexing files
Most Oracle version - notable exceptions: Oracle 8i on Linux - come with INSO filters: utilities that take a file in binary format like word, excel, or pdf and convert them to ASCII, so that they can be indexed by Intermedia.
An alternative open source offering is the "wv" suite which can perform similar tasks on a more limited set of document formats.
Packages that go into Search
Integrating a package means:
- Write the necessary helper proc to synthesize the text to be indexed.
- Write the necessary get_url, get_name etc. procs.
- Write the insert, update, delete triggers on your content tables
These packages have a portlet module and thus need to be integrated into search for .LRN:
- bm-portlet
- calendar-portlet
- curriculum-portlet
- dotlrn-portlet
- faq-portlet
- forums-portlet
- fs-portlet
- news-aggregator-portlet
- news-portlet
- static-portlet
- survey-portlet
- weblogger-portlet
The content repository is OpenACS' abstraction layer for storing and accessing versioned content. Packages which use the content repository include file-storage, news, and forums. Integrating those packages into search must be first priority because we'll cover as many "content objects" as possible with minimal effort.
If we choose to index on an object level, then integrating these packages into search becomes trivial. Otherwise pages have to be built e. g. in the case of forums only initial postings will be indexed with all their corresponding answers.
Creating the Intermedia Index, Future Updates
The initial creation of the Intermedia index should be done over a weekend to minimize the impact on online users: It usually only means a lot of I/O. Future updates of Intermedia are usually rather painless: it is secondary information which can be safely dropped and then recreated, hence data model changes within search are usually not difficult, they just mean some downtime for search.
Converting the User Input
Oracle provided ArsDigita with a PL/SQL function that converts user input into something that Intermedia may like better. Here's the im_convert function from ACES-Sloan:
-- Query to take free text user entered query and frob it into something -- that will make interMedia happy. Provided by Oracle. create or replace function im_convert( query in varchar2 default null ) return varchar2 is i number :=0; len number :=0; char varchar2(1); minusString varchar2(256); plusString varchar2(256); mainString varchar2(256); mainAboutString varchar2(500); finalString varchar2(500); hasMain number :=0; hasPlus number :=0; hasMinus number :=0; token varchar2(256); tokenStart number :=1; tokenFinish number :=0; inPhrase number :=0; inPlus number :=0; inWord number :=0; inMinus number :=0; completePhrase number :=0; completeWord number :=0; code number :=0; begin len := length(query); -- we iterate over the string to find special web operators for i in 1..len loop char := substr(query,i,1); if(char = '"') then if(inPhrase = 0) then inPhrase := 1; tokenStart := i; else inPhrase := 0; completePhrase := 1; tokenFinish := i-1; end if; elsif(char = ' ') then if(inPhrase = 0) then completeWord := 1; tokenFinish := i-1; end if; elsif(char = '+') then inPlus := 1; tokenStart := i+1; elsif((char = '-') and (i = tokenStart)) then inMinus :=1; tokenStart := i+1; end if; if(completeWord=1) then token := '{ '||substr(query,tokenStart,tokenFinish-tokenStart+1)||' }'; if(inPlus=1) then plusString := plusString||','||token||'*10'; hasPlus :=1; elsif(inMinus=1) then minusString := minusString||'OR '||token||' '; hasMinus :=1; else mainString := mainString||' NEAR '||token; mainAboutString := mainAboutString||' '||token; hasMain :=1; end if; tokenStart :=i+1; tokenFinish :=0; inPlus := 0; inMinus :=0; end if; completePhrase := 0; completeWord :=0; end loop; -- find the last token token := '{ '||substr(query,tokenStart,len-tokenStart+1)||' }'; if(inPlus=1) then plusString := plusString||','||token||'*10'; hasPlus :=1; elsif(inMinus=1) then minusString := minusString||'OR '||token||' '; hasMinus :=1; else mainString := mainString||' NEAR '||token; mainAboutString := mainAboutString||' '||token; hasMain :=1; end if; mainString := substr(mainString,6,length(mainString)-5); mainAboutString := replace(mainAboutString,'{',' '); mainAboutString := replace(mainAboutString,'}',' '); mainAboutString := replace(mainAboutString,')',' '); mainAboutString := replace(mainAboutString,'(',' '); plusString := substr(plusString,2,length(plusString)-1); minusString := substr(minusString,4,length(minusString)-4); -- we find the components present and then process them based on the specific combinations code := hasMain*4+hasPlus*2+hasMinus; if(code = 7) then finalString := '('||plusString||','||mainString||'*2.0,about('||mainAboutString||')*0.5) NOT ('||minusString||')'; elsif (code = 6) then finalString := plusString||','||mainString||'*2.0'||',about('||mainAboutString||')*0.5'; elsif (code = 5) then finalString := '('||mainString||',about('||mainAboutString||')) NOT ('||minusString||')'; elsif (code = 4) then finalString := mainString; finalString := replace(finalString,'*1,',NULL); finalString := '('||finalString||')*2.0,about('||mainAboutString||')'; elsif (code = 3) then finalString := '('||plusString||') NOT ('||minusString||')'; elsif (code = 2) then finalString := plusString; elsif (code = 1) then -- not is a binary operator for intermedia text finalString := 'totallyImpossibleString'||' NOT ('||minusString||')'; elsif (code = 0) then finalString := ''; end if; return finalString; end; /
It does the following conversions which look pretty sane:
> select im_convert ('Dirk Gomez') from dual; IM_CONVERT('DIRKGOMEZ') ------------------------------------------------------------------------------------------------------------------------ ( { Dirk } NEAR { Gomez })*2.0,about( Dirk Gomez ) > select im_convert ('Dirk -Gomez') from dual; IM_CONVERT('DIRK-GOMEZ') ------------------------------------------------------------------------------------------------------------------------ ( { Dirk },about( Dirk )) NOT ({ Gomez }) >select im_convert ('"Dirk Gomez"') from dual; IM_CONVERT('"DIRKGOMEZ"') ------------------------------------------------------------------------------------------------------------------------ ( { "Dirk Gomez" })*2.0,about( "Dirk Gomez" ) >select im_convert ('+Dirk +Gomez') from dual; IM_CONVERT('+DIRK+GOMEZ') ------------------------------------------------------------------------------------------------------------------------ { Dirk }*10,{ Gomez }*10
This function needs more documentation - also user documentation - and someone at Sloan may still know why it was implemented that way. Also it probably makes sense to port this function to TCL.
How to query
An example search query could look like this:
select /*+ FIRST_ROWS */ score(10) as the_score, site_node.url(object_id), object_id, html_blurb, object_type, rownum from acs_objects where contains(datastore,'$final_query_string',10)>0 and acs_permission.permission_p(object_id, :user_id, 'read') = 't' and rownum < 10 order by the_score desc
Prior to running this query string, you'd rewrite the query string using the im_convert PL/SQL function. This query then selects the 10 most relevant objects a user has read permission on for this query string. The contains SQL function operates on the "central" datastore field.
Audit search performance
That is an on-going task: every package that is added requires test runs with realistic amounts of data.
Add pagination.
By default, search results should be paginated: you should see 10 results per page and then go through the results via 10 results-per-page pages.
Change the Master Template
The master template should be changed so that every page in the system offers search, either in the whole system or in the current subsite, that's one dropdown-list and a simple text field per page.
Monitoring Searches
We should monitor searches. That will help us creating additional services like Thesauri, finer-grained searches via categories, date intervals etc. And we will be able to determine the quality of search results. Here's what the search_statistics table should hold:
- Who performed the search? (user_id)
- When was the search run? (searched_when)
- How long did it take to answer the question? (time_elapsed)
- What were the search parameters? (search_url_stub)
- Which object_ids were shown for this search? (object_id_1, ...)1
Future Features
- Show "XXX hits found for your search" on the result page. The query for this is very expensive because the database has to run through all objects in questions, check for read permission and then throw away the result and just count how many items there are.
- Make search language-dependant, for example be able to search for content in a particular language, use language-specific thesauri etc.
- Use thesauri
- Categorized search
- Show only my objects
- Use sections with a better im_convert function
- Use themes
- Use fuzzy search
- Offer stored search results as RSS feeds