Erschienen im Splitter - IT-Nachrichten für die Berliner Verwaltung. (4/2003)
.LRN ist ein webbasiertes Open-Source-E-Learning-System, das auf dem ebenfalls frei verfügbaren Community-System OpenACS basiert.
This text was written in December 2004 by dirk@dirkgomez.de for MIT Sloan School. It is copyright MIT Sloan School.
To create a user-friendly, extensible, and scalable search package for .LRN.
There currently is no working search solution for .LRN. This document will describe how to implement a search package for .LRN for Oracle 8i, which is the database system in use in MIT Sloan School. The described approach should be portable to PostgreSQL.
Oracle provides a search engine out of the box which has renamed been renamed a few times: Oracle Text, Oracle Context, Oracle Intermedia. I'll use the term Intermedia in this text.
Intranet search is different from internet search:
Why not use Google (or Convera etc) search? They cost! Google offers an search appliance where the Google crawler requests your data via HTTP requests. But every page in .LRN is personalized and there is object-level permissioning, and the Google crawler will always only see an object's representation tailored to it.
Why not use htdig which is free? Because it runs outside of the database using HTTP to get documents to be indexed - it simply doesn't know about .LRN's permissioning: the search page would list results without considering permissions.
If you run an internet site which is mostly publically available, then you may be much better off with using an Internet search engine by using e.g. Google Free WebSearch or by using htdig and fine-tuning it to your needs.
There is no working search package for .LRN - the main reason is probably the effort of integrating the about 10 packages into a search solution. The site-side-search package in the OpenACS CVS repository has not been touched in about three years and the search package currently only works with PostgreSQL. Porting the search package to Oracle doesn't look too difficult, but the real work is making the .LRN packages work with it.
ACS3 and ACES, the predecessor to .LRN, had working search packages. When ArsDigita made the transition from ACS3 to ACS4, most of the packages were rewritten from scratch. When ArsDigita dropped ACS TCL altogether, many packages remained unfinished and site-wide-search is among them. Furthermore, this code has not been changed for almost three years, thus not reflecting the many data model changes.
I know of three search solutions based on OpenACS4 systems: the ShareNet system which is hopelessly forked and not freely available, the Greenpeace International system which is just hopelessly forked and the Greenpeace Cyberactivist Toolkit search which should be very similar in scope and functionality and currently only works with PostgreSQL.
My experience with the ACS3 version of search was pretty good and about one year ago I did a proof of concept port to a heavily customized version .LRN and achieved results pretty quickly. The approach described in this document will be a mixture of my past experience and the discussions I have conducted in the past months.
This list of requirements was collected by Deirdre Kane from MIT Sloan School.
Other requirements under evaluation
In short, these steps are necessary to create a search package:
Every user-facing page will have a search field and a drop-down list that allows the user to scope the search. The following example assumes you are on a page of the TCL 101 class:
Every search request will be a GET request with the search form fields having the previous request's values as their defaults and the page itself describing the query and then listing the relevant pages so that continual refinement of a query is possible. It is possible to reorder the relevant items by date.
The result page should be similar to Google's result page: The title of the found page as a link, a more verbose blurb, the target URL and the target's name of the community or group:
Clean delete of packages in Elearning Introduction 2004
irc, maybe also somewhere in this forum), that deletion of data in general is a bad idea
http://some.tld/elearning/introduction/2004/forums/message-view?message_id=82352
How do I get OpenFTS/Search working? in OpenACS Hacking 2003
installed at http:// www.thedesignexperience.org/search/ You can search the weblog from the #openacs irc channel there.
http://some.tld/openacs/hacking-2003/calendar/cal-item-view?item_id=24874
etp page links returned by search in OpenACS Hacking 2004
IRC channel and look for DavB. Thanks Jun, that actually looks kinda straightforward - I will
http://some.tld/openacs/hacking-2003/wiki-view?object_id=68487
ns_conn content payload in OpenACS Hacking 2004
IRC channel fairly regularly (openacs on opennet). As far as reconciling AOLserver changes made here
http://some.tld/openacs/hacking-2004/forums/message-view?message_id=68432
(Note that the community name should be a non-obvious link, too, non-obvious to not clutter the user interface.)
The search and the advanced search page are virtual URLs mounted under community-name/search and community-name/advanced-search respectively. They should look like the Google search page and offer "All" - "Documents" - "People" - "News" - "FAQ" as its tabs and "All your communities", "This community", "This community and all its subcommunities" as radio buttons. Advanced search gives a user more options on how to exclude or include file types, limit the search to a particular date interval etc.
I personally have no experience with spellcheckers. Here's an approach which I think is feasible: check the words against a dictionary, then if the search result is well beyond 10 rows and not scoped to the current community, present an alternative search string. I am sure that the spell checker's quality needs to be monitored for quality first and I propose to move te spell checker to a phase II.
Admins need to be able to look at searches that were run by users: search string, search url stub and the objects found by this search run.
This should be presented as a list of URLs that let the admin rerun the search query with the original user's credentials (note: this particular feature is limited to site-wide admins) to allow for tuning the search queries.
Here's an (ugly) example (without the pagination which will be necessary for the real system):
| Search String | In | By | When | Time to run query | Result 1 | Result 2 | Result 3 | Result 4 | Result 5 | Result 6 | Result 7 | Result 8 | Result 9 | Result 10 |
| New Year | All | Dirk Gomez | 31-Dec-2004 | 2 sec | 234 | 345 | ||||||||
| Reflection | Java 2004 | Dirk Gomez | 12-Dec-2004 | 2.1 sec | 3452 | 34523 | 34231 | 12523 |
This is a high-level description of what happens when you run a search query:
For all objects I have read permission on:
The resulting objects need to be ordered then by either relevance or last modification date and rendered as links with a explanatory blurb.
This translates into the following requirements for the OpenACS/.LRN core:
I do not know to which extent either of these items has been addressed in the latest stable .LRN release. This needs to be discussed and solved prior to implementing a search package.
Future data model changes may possibly break search: the community needs to find a way to propagate package changes quickly to the search module. I am not quite sure yet where to put the helper procs: it may be good to make them part of the respective package, so that it is clear, that a data model change may incur a change to the helper proc as well. On the other hand this is probably a nightmare for the maintainer of the search package.
Search will show an object's URL and thus need to retrieve it inexpensively. The site_node.url which operates on the database layer has had pretty bad performance especially with big site node trees, but it is certainly the most convenient to use. get_url works on the TCL layer and has pretty good performance, but I think it is a little less convenient.
We could also use the "redirect trick" which I brought up in the OpenACS forums a few times: we'd need to create a page like /object which takes an object_id and then figures out the object_type and the object's URL and URL stub and redirects to it. The /object approach offers by far the best performance because all you have to do is append two strings. We'd lose the speaking URLs though which I find unacceptable unless both site_node.url and get_url turn out to not scale well enough.
Oracle offers a package called CTX_DOC which helps in creating a verbose description with highlighted search terms. From reading the docs it seems that you need to do the following to put CTX_DOC to use: firstly use the CTX_DOC.Filter procedure to populate a table - the "result table" - with documents rendered in HTML or plain text. Then pass in the name of the result table, the object_id, and the query search term into either the Highlight or Markup function of the CTX_DOC package. From reading the docs it wasn't quite clear to me what these two functions do exactly - this needs to be tested in code. Anyway, we'd have to add this "result table".
Alternatively on a past project we just added a description field to a central table - called acs_objects_description - which took up to 4000 bytes and was kept uptodate the packages themselves who were also responsible for synthesizing the content: e. g. the calendar package would not only use the description field, but also add the start and end date to the central description.
I prefer using CTX_DOC.
The datastore field has to be added to a central table, it is the place holder column for Intermedia. Candidates would be acs_objects or cr_items or newly created tables like acs_objects_description or site_wide_index.
Information not provided by the search SQL query should be retrieved by a calling a TCL proc in a object_type-dependant namespace e.g.:
The namespace convention should be search::${object_type_name}::get_url etc.
The .LRN search package will use Intermedia's USER_DATASTORE object which allows synthesizing of documents during indexing with stored procedure. This way we can create indexed text which is composed of information which is stored in many tables and/or rows like author, location of an event, original question etc. This is achieved by the following code snippet (stolen from ACES-Sloan) which creates a preference that holds the name for a synthesizer procedure called sws_user_proc_&1 in the CTXSYS schema - The &1 is replaced at creation time by the database username. - for a user_datastore:
begin ctx_ddl.create_preference('sws_user_datastore', 'user_datastore'); ctx_ddl.set_attribute('sws_user_datastore', 'procedure', 'sws_user_proc_&1'); end; /
sws_user_proc_&1 itself is only a wrapper procedure in the CTXSYS schema and you need to grant the necessary permissions on this procedure to &1. It is called once for each row indexed and it has to provide the document's text in its second parameter.
CREATE OR replace procedure sws_user_proc_&1 ( rid IN ROWID, tlob IN OUT nocopy clob ) AS BEGIN &1..sws_user_datastore_proc(rid, tlob); END; / show errors; grant execute on sws_user_proc_&1 to &1; grant ctxapp to &1;
The datastore proc in the user's schema will look like this:
CREATE OR replace procedure sws_user_datastore_proc ( rid IN ROWID, tlob IN OUT nocopy clob ) IS v_object_id in acs_objects.object_id%TYPE default null, v_object_type in acs_objects.object_type%TYPE default 'acs_object', BEGIN SELECT object_type, object_id INTO v_object_type, v_object_id FROM acs_objects WHERE rid = ROWID; dbms_lob.trim(tlob, 0); -- handle the object types IF v_object_type = 'forums_message' THEN forums_message_sws_helper(rid, tlob, v_object_id); ELSIF v_object_type = 'cal_item' THEN cal_item_sws_helper(rid, tlob, v_object_id); ELSIF ... END IF; END; / show errors;
The procedure that creates the text to be indexed may roughly look like this (code is untested at the time of writing of writing):
CREATE OR replace procedure cal_item_sws_helper (rid IN ROWID, tlob IN OUT nocopy clob, p_object_id IN acs_objects.object_id%TYPE) IS -- declare all the necessary variables BEGIN select to_char(start_date, 'YYYY-MM-DD HH24:MI:SS') as start_date_ansi, to_char(end_date, 'YYYY-MM-DD HH24:MI:SS') as end_date_ansi, nvl(e.name, a.name) as name, nvl(e.description, a.description) as description, cal_item_types.type as item_type, c.calendar_name, o.creation_user INTO v_start_date_ansi, v_end_date_ansi, v_name, v_description, v_item_type, v_calendar_name, v_creation_user from acs_activities a, acs_events e, timespans s, time_intervals t, cal_items, cal_item_types, calendars c, acs_objects o where e.timespan_id = s.timespan_id and s.interval_id = t.interval_id and e.activity_id = a.activity_id and e.event_id = p_object_id and cal_items.cal_item_id= p_object_id and cal_item_types.item_type_id(+)= cal_items.item_type_id and c.calendar_id = on_which_calendar and o.object_id = cal_items.cal_item_id; dbms_lob.writeappend(tlob, length('<creation_user>'), '<creation_user>'); dbms_lob.writeappend(tlob, length(v_creation_user) + 1, v_creation_user || ' '); dbms_lob.writeappend(tlob, length('</creation_user>'), '</creation_user>'); dbms_lob.writeappend(tlob, length('<name>'), '<name>'); dbms_lob.writeappend(tlob, length(v_name) + 1, v_name || ' '); dbms_lob.writeappend(tlob, length('</name>'), '</name>'); IF v_description IS NOT NULL THEN dbms_lob.writeappend(tlob, length(v_description) + 1, v_description || ' '); END IF; -- add the remaining attributes -- .. END; / show errors;
The <creation_user> and <name> tags allow for within searches, searching within such a section is possible if it was created with the following PL/SQL statement:
begin ctx_ddl.create_section_group('searchgroup', 'basic_section_group'); ctx_ddl.add_field_section('searchgroup, 'name, 'name, TRUE); ctx_ddl.add_field_section('searchgroup, 'creation_user', 'creation_user', TRUE); end;
name and creation_user are the two sections that make sense for all or almost all object types - they are the acs_object's name and creation user respectively. Both name and creation_user should also be assigned a higher relevance if they are matching a search term.
The datastore is created and then kept in sync with the data usually with triggers that just set the datastore to an arbitrary value. The Intermedia process then performs the appropriate action: inserting, updating, or deleting.
One approach to index and subsequently present content is to index on an object level. Here it is necessary to figure out which table fields holds which particular content attritubes and feed those into search.
Alternatively, we could index a "page": with this approach you need to feed search with the same content you'll show the user on the "html" page - without the navigation overhead.
A few examples will make the differences clear:
The page-level approach may be more familiar to the user whereas the object-level approach is only possible if your search engine understands the data model.
The object-level approach should be the default approach because it offers finer granularity of the search results stored and less initial programming and less subsequent maintenance costs:
Because a mixture of both approaches is possible, the page-level approach should be chosen where a package maintainer sees good reason for using it.
Most Oracle version - notable exceptions: Oracle 8i on Linux - come with INSO filters: utilities that take a file in binary format like word, excel, or pdf and convert them to ASCII, so that they can be indexed by Intermedia.
An alternative open source offering is the "wv" suite which can perform similar tasks on a more limited set of document formats.
Integrating a package means:
These packages have a portlet module and thus need to be integrated into search for .LRN:
The content repository is OpenACS' abstraction layer for storing and accessing versioned content. Packages which use the content repository include file-storage, news, and forums. Integrating those packages into search must be first priority because we'll cover as many "content objects" as possible with minimal effort.
If we choose to index on an object level, then integrating these packages into search becomes trivial. Otherwise pages have to be built e. g. in the case of forums only initial postings will be indexed with all their corresponding answers.
The initial creation of the Intermedia index should be done over a weekend to minimize the impact on online users: It usually only means a lot of I/O. Future updates of Intermedia are usually rather painless: it is secondary information which can be safely dropped and then recreated, hence data model changes within search are usually not difficult, they just mean some downtime for search.
Oracle provided ArsDigita with a PL/SQL function that converts user input into something that Intermedia may like better. Here's the im_convert function from ACES-Sloan:
-- Query to take free text user entered query and frob it into something -- that will make interMedia happy. Provided by Oracle. create or replace function im_convert( query in varchar2 default null ) return varchar2 is i number :=0; len number :=0; char varchar2(1); minusString varchar2(256); plusString varchar2(256); mainString varchar2(256); mainAboutString varchar2(500); finalString varchar2(500); hasMain number :=0; hasPlus number :=0; hasMinus number :=0; token varchar2(256); tokenStart number :=1; tokenFinish number :=0; inPhrase number :=0; inPlus number :=0; inWord number :=0; inMinus number :=0; completePhrase number :=0; completeWord number :=0; code number :=0; begin len := length(query); -- we iterate over the string to find special web operators for i in 1..len loop char := substr(query,i,1); if(char = '"') then if(inPhrase = 0) then inPhrase := 1; tokenStart := i; else inPhrase := 0; completePhrase := 1; tokenFinish := i-1; end if; elsif(char = ' ') then if(inPhrase = 0) then completeWord := 1; tokenFinish := i-1; end if; elsif(char = '+') then inPlus := 1; tokenStart := i+1; elsif((char = '-') and (i = tokenStart)) then inMinus :=1; tokenStart := i+1; end if; if(completeWord=1) then token := '{ '||substr(query,tokenStart,tokenFinish-tokenStart+1)||' }'; if(inPlus=1) then plusString := plusString||','||token||'*10'; hasPlus :=1; elsif(inMinus=1) then minusString := minusString||'OR '||token||' '; hasMinus :=1; else mainString := mainString||' NEAR '||token; mainAboutString := mainAboutString||' '||token; hasMain :=1; end if; tokenStart :=i+1; tokenFinish :=0; inPlus := 0; inMinus :=0; end if; completePhrase := 0; completeWord :=0; end loop; -- find the last token token := '{ '||substr(query,tokenStart,len-tokenStart+1)||' }'; if(inPlus=1) then plusString := plusString||','||token||'*10'; hasPlus :=1; elsif(inMinus=1) then minusString := minusString||'OR '||token||' '; hasMinus :=1; else mainString := mainString||' NEAR '||token; mainAboutString := mainAboutString||' '||token; hasMain :=1; end if; mainString := substr(mainString,6,length(mainString)-5); mainAboutString := replace(mainAboutString,'{',' '); mainAboutString := replace(mainAboutString,'}',' '); mainAboutString := replace(mainAboutString,')',' '); mainAboutString := replace(mainAboutString,'(',' '); plusString := substr(plusString,2,length(plusString)-1); minusString := substr(minusString,4,length(minusString)-4); -- we find the components present and then process them based on the specific combinations code := hasMain*4+hasPlus*2+hasMinus; if(code = 7) then finalString := '('||plusString||','||mainString||'*2.0,about('||mainAboutString||')*0.5) NOT ('||minusString||')'; elsif (code = 6) then finalString := plusString||','||mainString||'*2.0'||',about('||mainAboutString||')*0.5'; elsif (code = 5) then finalString := '('||mainString||',about('||mainAboutString||')) NOT ('||minusString||')'; elsif (code = 4) then finalString := mainString; finalString := replace(finalString,'*1,',NULL); finalString := '('||finalString||')*2.0,about('||mainAboutString||')'; elsif (code = 3) then finalString := '('||plusString||') NOT ('||minusString||')'; elsif (code = 2) then finalString := plusString; elsif (code = 1) then -- not is a binary operator for intermedia text finalString := 'totallyImpossibleString'||' NOT ('||minusString||')'; elsif (code = 0) then finalString := ''; end if; return finalString; end; /
It does the following conversions which look pretty sane:
> select im_convert ('Dirk Gomez') from dual; IM_CONVERT('DIRKGOMEZ') ------------------------------------------------------------------------------------------------------------------------ ( { Dirk } NEAR { Gomez })*2.0,about( Dirk Gomez ) > select im_convert ('Dirk -Gomez') from dual; IM_CONVERT('DIRK-GOMEZ') ------------------------------------------------------------------------------------------------------------------------ ( { Dirk },about( Dirk )) NOT ({ Gomez }) >select im_convert ('"Dirk Gomez"') from dual; IM_CONVERT('"DIRKGOMEZ"') ------------------------------------------------------------------------------------------------------------------------ ( { "Dirk Gomez" })*2.0,about( "Dirk Gomez" ) >select im_convert ('+Dirk +Gomez') from dual; IM_CONVERT('+DIRK+GOMEZ') ------------------------------------------------------------------------------------------------------------------------ { Dirk }*10,{ Gomez }*10
This function needs more documentation - also user documentation - and someone at Sloan may still know why it was implemented that way. Also it probably makes sense to port this function to TCL.
An example search query could look like this:
select /*+ FIRST_ROWS */ score(10) as the_score, site_node.url(object_id), object_id, html_blurb, object_type, rownum from acs_objects where contains(datastore,'$final_query_string',10)>0 and acs_permission.permission_p(object_id, :user_id, 'read') = 't' and rownum < 10 order by the_score desc
Prior to running this query string, you'd rewrite the query string using the im_convert PL/SQL function. This query then selects the 10 most relevant objects a user has read permission on for this query string. The contains SQL function operates on the "central" datastore field.
That is an on-going task: every package that is added requires test runs with realistic amounts of data.
By default, search results should be paginated: you should see 10 results per page and then go through the results via 10 results-per-page pages.
The master template should be changed so that every page in the system offers search, either in the whole system or in the current subsite, that's one dropdown-list and a simple text field per page.
We should monitor searches. That will help us creating additional services like Thesauri, finer-grained searches via categories, date intervals etc. And we will be able to determine the quality of search results. Here's what the search_statistics table should hold: