<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-6848574</id><updated>2011-09-21T22:34:35.260-05:00</updated><category term='π'/><category term='JRDF'/><category term='class path'/><category term='Lucene'/><category term='description logic'/><category term='SKOS'/><category term='MacBookPro'/><category term='WAR'/><category term='RETE'/><category term='Semantic Web'/><category term='storage'/><category term='disk'/><category term='date'/><category term='Prolog'/><category term='JAAS'/><category term='EulerSharp'/><category term='pool'/><category term='audio'/><category term='tragedy'/><category term='job'/><category term='RSS'/><category term='Terrapass'/><category term='HashSet'/><category term='RDFS'/><category term='performance'/><category term='astrophotography'/><category term='eclipse'/><category term='scalable'/><category term='Talis'/><category term='OCL'/><category term='notebook'/><category term='Web 3.0'/><category term='64 bit'/><category term='InverseFunctionalPredicate'/><category term='New media'/><category term='semantic'/><category term='RDF'/><category term='CSS'/><category term='logic'/><category term='jdk'/><category term='security'/><category term='Web Services'/><category term='AST'/><category term='BioMOBY'/><category term='JavaCC'/><category term='camping'/><category term='filter'/><category term='CodePoint'/><category term='Unicode'/><category term='patent'/><category term='integration'/><category term='iPhone'/><category term='compatibility'/><category term='memory map'/><category term='tracker'/><category term='emissions'/><category term='Muglara'/><category term='interviews'/><category term='SeaDragon'/><category term='certificate'/><category term='architecture'/><category term='PhotoSynth'/><category term='Spivack'/><category term='Mulgara'/><category term='rules'/><category term='SPARQL'/><category term='podcast'/><category term='weaknesses'/><category term='trust'/><category term='303'/><category term='Beaver'/><category term='Closed world'/><category term='classpath'/><category term='CST'/><category term='conference'/><category term='Pellet'/><category term='FOAF'/><category term='3G'/><category term='types'/><category term='GFS'/><category term='OS X'/><category term='AVL'/><category term='Jetty'/><category term='Dopplr'/><category term='IRC'/><category term='Currying'/><category term='layout'/><category term='OWL'/><category term='SSL'/><category term='code'/><category term='inferencing'/><category term='piano'/><category term='open world'/><category term='social network'/><category term='transitive'/><category term='Graphs'/><category term='idiot'/><category term='REST'/><category term='RLog'/><category term='random'/><category term='lunar'/><category term='TQL'/><category term='XA2'/><category term='JFlex'/><category term='indexing'/><category term='RNG'/><category term='Feynmann'/><category term='whuffie'/><category term='Java'/><category term='Web 2.0'/><category term='blog'/><category term='strengths'/><category term='Google'/><category term='Racer'/><category term='JDBC'/><category term='JAR'/><category term='Open Source'/><category term='time'/><category term='OPTIONAL'/><category term='GOPHER'/><category term='pragmatic'/><category term='Ruby'/><category term='WebOS'/><category term='topaz'/><category term='search'/><category term='EDGE'/><category term='bitchun society'/><category term='numbers'/><category term='TED'/><category term='WiFi'/><title type='text'>Working notes</title><subtitle type='html'>Day to day notes on what I'm doing at work</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default?start-index=101&amp;max-results=100'/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>392</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-6848574.post-4564481076620223420</id><published>2011-09-04T12:24:00.003-05:00</published><updated>2011-09-08T08:47:31.086-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;SPARQL JSON&lt;/h3&gt; After commenting the other day that Fuseki was ignoring my request for a JSON response, I was asked to submit a bug report. It's easy to cast unsubstantiated stones in a blog, but a bug report is a different story, so I went back to confirm everything. In the process, I looked at the SPARQL 1.1 Query Results JSON Format spec (I can't find a public copy of it, so no link, sorry. UPDATE: &lt;a href="http://www.w3.org/2009/sparql/docs/json-results/json-results-lc.html"&gt;Found it&lt;/a&gt;.) and was chagrined to discover that it has its own MIME type of "application/sparql-results+json". I had been using the JSON type of "application/json", and this indeed does not work, but the corrected type does. I don't think it's a good idea to ignore "application/json" so I'll report that, but strictly speaking it's correct (at least, I think so. As Andy said to me in a back channel, I don't really know what the + is supposed to mean in the subtype). So Fuseki got it right. Sorry.&lt;br /&gt;&lt;br /&gt;When I finally get around to implementing this for Mulgara I'll try to handle both. Which reminds me... I'd better get some more SPARQL 1.1 implemented. My day job at Revelytix needs me to do some Mulgara work for a short while, so that may help get the ball rolling for me.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;separate&lt;/h3&gt; I commented the other day that I wanted a Clojure function that takes a predicate and a seq, and returns two seqs: one that matches the predicate and the other that doesn't match. I was thinking I'd build it with a &lt;code&gt;loop&lt;/code&gt; construct, to avoid recursing through the seq twice.&lt;br /&gt;&lt;br /&gt;The next day, Gary (at work) suggested the &lt;a href="http://clojuredocs.org/clojure_contrib/clojure.contrib.seq-utils/separate"&gt;separate&lt;/a&gt; function from &lt;a href="http://clojuredocs.org/clojure_contrib"&gt;Clojure Contrib&lt;/a&gt;. I know there's some good stuff in contrib, but unfortunately I've never taken the time to fully audit it.&lt;br /&gt;&lt;br /&gt;The implementation of this function is obvious:&lt;pre&gt;&lt;code&gt;(defn separate [f s]&lt;br /&gt;  [(filter f s) (filter (complement f) s)])&lt;/code&gt;&lt;/pre&gt;I was disappointed to learn that this function iterates twice, but Gary pointed out that there had been a discussion on exactly this point, and the counter argument is that this is the only way to build the results lazily. That's a reasonable point, and it is usually one of the top considerations for Clojure implementations. I don't actually have cause for complaint anyway, since the seqs I'm using are always small (being built out of query structures).&lt;br /&gt;&lt;br /&gt;This code was also a good reminder that &lt;code&gt;(complement x)&lt;/code&gt; offers a concise replacement for the alternative code:&lt;pre&gt;&lt;code&gt;  #(not x %)&lt;/code&gt;&lt;/pre&gt;By extension, it's a reminder to brush up on my idiomatic Clojure. I should finish the book &lt;a href="http://joyofclojure.com/"&gt;The Joy of Clojure&lt;/a&gt; (which is a thoroughly enjoyable read, by the way).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-4564481076620223420?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/4564481076620223420/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=4564481076620223420' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/4564481076620223420'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/4564481076620223420'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2011/09/sparql-json-after-commenting-other-day.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-5406101921787200874</id><published>2011-09-01T20:45:00.003-05:00</published><updated>2011-09-02T01:30:04.656-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;ASTs&lt;/h3&gt; &lt;br /&gt;After yesterday's post, I noticed that a number of references came to me via Twitter (using a &lt;a href="http://t.co/"&gt;t.co&lt;/a&gt; URL). After looking for it, I realized that I'm linked to from the &lt;a href="http://planet.clojure.in/"&gt;Planet Clojure&lt;/a&gt; page. I'm not sure why, though I guess it's because I'm working for &lt;a href="http://revelytix.com/"&gt;Revelytix&lt;/a&gt;, and we're one of the larger Clojure shops around. Only my post wasn't about Clojure - it was about JavaScript. So now I feel obligated to write something about Clojure. Besides, I'm out of practice with my writing, so it would do me good.&lt;br /&gt;&lt;br /&gt;So the project I spend most of my time on (for the moment) is called &lt;a href="http://revelytix.com/content/rex-ea"&gt;Rex&lt;/a&gt;. It's a &lt;a href="http://www.w3.org/standards/techs/rif#w3c_all"&gt;RIF&lt;/a&gt; based rules engine written entirely in Clojure. It does lots of things, but the basic workflow for running rules is:&lt;ol&gt;&lt;li&gt;Parse the rules out of the rule file and into a &lt;a href="http://en.wikipedia.org/wiki/Concrete_syntax_tree"&gt;Concrete Syntax Tree&lt;/a&gt;.  We can parse the RIF XML format, the RIF Presentation Syntax, and the &lt;a href="http://www.w3.org/TR/2011/NOTE-rif-in-rdf-20110512/"&gt;RIF RDF format&lt;/a&gt;, and we plan on doing others.&lt;/li&gt;&lt;li&gt;Run a set of transformations on the CST to convert it into an appropriate &lt;a href="http://en.wikipedia.org/wiki/Abstract_syntax_tree"&gt;Abstract Syntax Tree&lt;/a&gt; (AST). This can involve some tricky analysis, particularly for aggregates (which are a RIF extension).&lt;/li&gt;&lt;li&gt;Transformation of the AST into &lt;a href="http://www.w3.org/TR/sparql11-query/"&gt;SPARQL 1.1 query&lt;/a&gt; fragments.&lt;/li&gt;&lt;li&gt;Execute the engine, by processing the rules to generate more data until the capacity to generate new data has been exhausted. (My... &lt;em&gt;that's&lt;/em&gt; a lot of hand waving).&lt;/li&gt;&lt;/ol&gt;It's step 3 that I was interested in today.&lt;br /&gt;&lt;br /&gt;The rest of this post is about how Rex processes a CST into an AST using Clojure, and about some subsequent refactoring that went on. You have been warned...&lt;br /&gt;&lt;br /&gt;When I first wrote the CST to AST transformation step, it was to do a reasonably straight forward analysis of the CST. Most importantly, I needed to see the structure of the rule so that I could see what kind of data it depends on, thereby figuring out which other rules might need to be run once a given rule was executed. Since the AST is a tree structure, this made for relatively straight forward recursive functions.&lt;br /&gt;&lt;br /&gt;Next, I had to start identifying some CST structures that needed to be changed in the AST. This is where it got more interesting. Again, I had to write recursive functions, but instead of simply analyzing the data, it had to be changed. It turns out that this is handled easily by having a different function for each type of node in the tree. In the normal case the function then recurses on all of its children, and constructs an identical node type using the new children. The leaf nodes then just return themselves. The "different function" for each type is actually accessed with the same name, but dispatches on the function type. In Java that would need a visitor pattern, or perhaps a map of types to functors, but in Clojure it's handled trivially with &lt;a href="http://clojure.org/multimethods"&gt;multimethods&lt;/a&gt; or &lt;a href="http://clojure.org/Protocols"&gt;protocols&lt;/a&gt;. Unfortunately, the online resources for describing the multi-dispatch aspects of Clojure protocols are not clear, but Luke VanderHart and Stuart Sierra's book &lt;a href="http://www.apress.com/9781430272311"&gt;Practical Clojure&lt;/a&gt; covers it nicely.&lt;br /&gt;&lt;br /&gt;As an abstract example of what I mean, say I have an AST consisting of Conjunctions, Disjunctions and leaf nodes. Both Conjunctions and Disjunctions have a single field that contains a seq of the child nodes. These are declared with:&lt;pre&gt;&lt;code&gt;(defrecord Conjunction [children])&lt;br /&gt;(defrecord Disjunction [children])&lt;/code&gt;&lt;/pre&gt;The transformation function can be called &lt;code&gt;tx&lt;/code&gt;, and I'll define it with multiple dispatch on the node type using multimethods:&lt;pre&gt;&lt;code&gt;(defmulti tx  class)&lt;br /&gt;&lt;br /&gt;(defmethod tx Disjunction [{c :children}]&lt;br /&gt;    (Disjunction. (map tx c)))&lt;br /&gt;&lt;br /&gt;(defmethod tx Conjunction [{c :children}]&lt;br /&gt;    (Conjunction. (map tx c)))&lt;br /&gt;&lt;br /&gt;(defmethod tx :default [n] n)&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;This will reconstruct an identical tree to the original, though with all new nodes except the leaves. Now, the duplication in the Disjunction and Conjunction methods should be ringing alarm bells, but in real code the functions have more specific jobs to do. For instance, the Conjunction may want to group terms that meet a certain condition (call the test "&lt;code&gt;special-type?&lt;/code&gt;") into a new node type (call it "&lt;code&gt;Foo&lt;/code&gt;"):&lt;pre&gt;&lt;code&gt;;; A different definition for tx on Conjunction&lt;br /&gt;(defmethod tx Conjunction [{c :children}]&lt;br /&gt;  (let [new-children (map tx c)&lt;br /&gt;        special-nodes (filter special-type? new-children)&lt;br /&gt;        other-nodes (filter (comp not special-type?) new-children)]&lt;br /&gt;    (Conjunction. (conj other-nodes (Foo. special-nodes)))))&lt;/code&gt;&lt;/pre&gt;Hmmm... while writing that example I realized that I regularly run into the pattern of filtering out everything that meets a test, and everything that fails the test. Other than having to test everything twice, it seems too verbose. What I need is a function that will take a seq and a predicate and return a tuple containing a seq of everything that matches the predicate, and a second seq of everything that fails the predicate. I'm not seeing anything like that right now, so that may be a job for the morning.&lt;br /&gt;&lt;br /&gt;I should note, that there is no need for a function to return the same type that came into it. There are several occasions where Rex returns a different type. For example, a conjunction between a Basic Graph Pattern (BGP) and the negation of another BGP becomes a MINUS operation between the two BGPs (a Basic Graph Pattern comes from SPARQL and is just a triple pattern for matching against the subject/predicate/object of a triple in an RDF store).&lt;br /&gt;&lt;br /&gt;Overall, this approach works very well for transforming the CST into the full AST. As I've needed to incorporate more features and optimizations over time, I found that I had two choices. Either I could expand the complexity of the operation for every type in the tree processing code, or I could perform different types of analysis on the entire tree, one after another. The latter makes the process far easier to understand, making the design more robust and debugging easier, so that's how Rex has been written. It makes analysis slightly slower, but analysis is orders of magnitude faster than actually running the rules, so that is not a consideration.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Threading&lt;/h3&gt; The first time through, analysis was simple:&lt;pre&gt;&lt;code&gt;(defn analyse [rules]&lt;br /&gt;  (process rules))&lt;/code&gt;&lt;/pre&gt;Adding a new processing step was similarly easy:&lt;pre&gt;&lt;code&gt;(defn analyse [rules]&lt;br /&gt;  (process2 (process1 rules)))&lt;/code&gt;&lt;/pre&gt;But once a third, then a fourth step appeared, it became obvious that I needed to use the &lt;a href="http://clojure.github.com/clojure/clojure.core-api.html#clojure.core/-&gt;"&gt;Clojure threading macro&lt;/a&gt;:&lt;pre&gt;&lt;code&gt;(defn analyse [rules]&lt;br /&gt;  (-&gt; rules&lt;br /&gt;      process1&lt;br /&gt;      process2&lt;br /&gt;      process3&lt;br /&gt;      process4))&lt;/code&gt;&lt;/pre&gt;So now it's starting to look nice. Each step in the analysis process is a single function name implemented for various types. These names are then provided in a list of things to be applied to the rules, via the threading macro. There's a little more complexity (one of the steps picks up references to parts of the tree, and since each stage &lt;em&gt;changes&lt;/em&gt; the tree, then these references will be pointing to an old and unused version of the tree. So that step has to be last), but it paints the general picture.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Partial&lt;/h3&gt; Another thing that I've glossed over is that each rule is actually a record that contains the AST as one of its members. The rules themselves are a seq which is in turn a member of a record containing a "Rule Program". So each process step actually ended up being like this:&lt;pre&gt;&lt;code&gt;(defn process1 [rule-program]&lt;br /&gt;  (letfn [(process-rule [rule] (assoc rule :body (tx (:body rule))))]&lt;br /&gt;    (assoc rule-program :rules (map process-rule (:rules rule-prog)))))&lt;/code&gt;&lt;/pre&gt;Did you follow that?&lt;br /&gt;&lt;br /&gt;The bottom line is replacing the :rules field with a new one. It's mapping &lt;code&gt;process-rule&lt;/code&gt; onto the seq of rules, and storing the result in a new rules-program, which is what gets returned. The &lt;code&gt;process-rule&lt;/code&gt; function is defined locally as associating the :body of a rule with the &lt;code&gt;tx&lt;/code&gt; function applied to the existing body. This creates a new rule that has has the &lt;code&gt;tx&lt;/code&gt; applied to it.&lt;br /&gt;&lt;br /&gt;This all looked fine to start with. A new rule program is created by transforming all the rules in the old program. A transformed rule is created by transforming the AST (the :body) in the old rule. But after the third analysis it became obvious that there was duplication going on. In fact, it was all being duplicated except for the &lt;code&gt;tx&lt;/code&gt; step. But that was buried deep in the function. What was the best way to pull it out?&lt;br /&gt;&lt;br /&gt;To start with, the embedded &lt;code&gt;process-rule&lt;/code&gt; function came out. After all, it was just inside to hide it, and not because it had to pick up a closure anywhere. This function then accepts the kind of transformation that it needs to do as a parameter:&lt;pre&gt;&lt;code&gt;(defn convert-rule&lt;br /&gt;  [convert-fn rule]&lt;br /&gt;  (assoc rule :body (convert-fn (:body rule))))&lt;/code&gt;&lt;/pre&gt;Next, we want a general function for converting all the rules, which can accept a conversion function to pass on to &lt;code&gt;convert-rule&lt;/code&gt;. It does all the rules, so I just pluralized the name:&lt;pre&gt;&lt;code&gt;(defn convert-rules&lt;br /&gt;  [conversion-fn rule-prog]&lt;br /&gt;  (assoc rule-prog :rules (map #(convert-rule conversion-fn %) (:rules rule-prog)))))&lt;/code&gt;&lt;/pre&gt;That works, but now the function getting mapped is looking messy (and messy leads to mistakes). I could improve it by defining a new function, but I just factored a function out of this function. Fortunately, there is a simpler way to define this new function. It's a "partial" application of &lt;code&gt;convert-rule&lt;/code&gt;. But I'll still move it into a &lt;code&gt;let&lt;/code&gt; block for clarity:&lt;pre&gt;&lt;code&gt;(defn convert-rules&lt;br /&gt;  [conversion-fn rule-prog]&lt;br /&gt;  (let [conv-rule (partial convert-rule conversion-fn)]&lt;br /&gt;    (assoc rule-prog :rules (map conv-rule (:rules rule-prog)))))&lt;/code&gt;&lt;/pre&gt;So now my original &lt;code&gt;process1&lt;/code&gt; definition becomes a simple:&lt;pre&gt;&lt;code&gt;(defn process1 [rule-program]&lt;br /&gt;  (convert-rules tx rule-program))&lt;/code&gt;&lt;/pre&gt;That works, but the &lt;code&gt;rule-program&lt;/code&gt; parameter is just sticking out like a sore thumb. Fortunately, we've already seen how to fix this:&lt;pre&gt;&lt;code&gt;(def process1 (partial convert-rules tx))&lt;/code&gt;&lt;/pre&gt;Indeed, all of the processing functions can be written this way:&lt;pre&gt;&lt;code&gt;(def process1 (partial convert-rules tx))&lt;br /&gt;(def process2 (partial convert-rules tx2))&lt;br /&gt;(def process3 (partial convert-rules tx3))&lt;br /&gt;(def process4 (partial convert-rules tx4))&lt;/code&gt;&lt;/pre&gt;It may seem strange that a function is now being defined with a &lt;code&gt;def&lt;/code&gt; instead of a &lt;code&gt;defn&lt;/code&gt;, but it's really not an issue. It's worth remembering that &lt;code&gt;defn&lt;/code&gt; is just a macro that uses &lt;code&gt;def&lt;/code&gt; to attach a symbol to a call to &lt;code&gt;(fn ...)&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Documenting&lt;/h3&gt; Of course, my functions don't appear as sterile as what I've been typing here. I do indeed use documentation. That means that the &lt;code&gt;process1&lt;/code&gt; function would look more like:&lt;pre&gt;&lt;code&gt;(defn process1 [rule-program]&lt;br /&gt;  "The documentation for process1"&lt;br /&gt;  (convert-rules tx rule-program))&lt;/code&gt;&lt;/pre&gt;One of the nice features of the &lt;code&gt;defn&lt;/code&gt; macro is the ease of writing documentation. This isn't as trivial with &lt;code&gt;def&lt;/code&gt; since it's a special form, rather than a macro, but it's still not too hard to do. You just need to attach some metadata to the object, with a key of &lt;code&gt;:doc&lt;/code&gt;. Unfortunately, I couldn't remember the exact syntax for this today, and rather then go trawling through books or existing code, &lt;a href="http://tech.puredanger.com/"&gt;Alex&lt;/a&gt; was kind enough to remind me:&lt;pre&gt;&lt;code&gt;(def ^{:doc "The documentation for process1"}&lt;br /&gt;  process1 (partial convert-rules tx))&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Framework&lt;/h3&gt; The upshot of all of this is a simpler framework for adding new steps to the existing analysis system. Adding a new analysis step just needs a new function thrown into the thread. I could put a call to &lt;code&gt;(partial convert-rules ...)&lt;/code&gt; directly into the thread, but by using a &lt;code&gt;def&lt;/code&gt; I get to name and document that step of the analysis. the only real work of the analysis is then done in the single multi-dispatch function, which is just as it should be.&lt;br /&gt;&lt;br /&gt;So right now my evening "hobby" has been JavaScript, while my day job is Clojure. I have to tell you, the day job is &lt;em&gt;much&lt;/em&gt; more fun. &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-5406101921787200874?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/5406101921787200874/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=5406101921787200874' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/5406101921787200874'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/5406101921787200874'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2011/09/asts-after-yesterdays-post-i-noticed.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-7620051540900290700</id><published>2011-09-01T10:30:00.003-05:00</published><updated>2011-09-01T11:17:43.528-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Robusta&lt;/h3&gt;I was just posting this into Google+ when I realized I was typing more than I'd intended. It read more like a short blog post. Which reminded me.... "Oh yeah. I have a blog out there somewhere! Maybe I should write this in there." &lt;br /&gt;&lt;br /&gt;After spending a lot of recent nights on JavaScript I think I'm starting to get a feel for it. I started with Douglas Crockford's "&lt;a href="http://oreilly.com/catalog/9780596517748"&gt;JavaScript: The Good Parts&lt;/a&gt;", and am now plowing through David Flanagan's "&lt;a href="http://oreilly.com/catalog/9780596805524/"&gt;JavaScript: The Definitive Guide&lt;/a&gt;". (I met David when he came to Brisbane to run a short course on Java programming back in 1996. Nice guy. He said that he'd preferred to have called his book "Java in a Demitasse", but O'Reilly wanted it in their "Nutshell" series).&lt;br /&gt;&lt;br /&gt;After using languages like Ruby, Erlang, Scala and Clojure, I'm finding it a little frustrating, but it's hard to argue with the ubiquity of the platform. Fortunately it has closures and first class functions, though the variable scoping is bizarre. I've been enjoying the callback approach to asynchronous function calls, though the syntax tends to make the resulting code confusing to read. I'm mostly sticking to Crockford's subset of the language, and this does make things a little more sensible. Flanagan's book has been filling in the gaps for me, but it's especially useful for documenting libraries like HTML5 Canvas and the File API.&lt;br /&gt;&lt;br /&gt;As with all first attempts at using a language, mine is a little messy and inconsistent. However, it can't hurt to put it out there. I've built a simple tool for working with SPARQL endpoints (specifically aimed at Jena/Fuseki, but it should mostly work on others too). The important piece is the SPARQL connection object that it comes with (found in sparql.js). I'm hoping that this will be a useful object for more general application. It can even convert XML responses into a SPARQL-JSON structure (I wrote this after discovering Fuseki was ignoring my &lt;em&gt;Content-type&lt;/em&gt; settings on queries).&lt;br /&gt;&lt;br /&gt;Unfortunately, it's missing one important piece, which is the ability to upload a file from a browser. In general, it's possible to upload a file using a form submission, but that encodes all the parameters into the request body, and the SPARQL HTTP protocol requires that the graph URI appear as a parameter in the URL of the request. In an attempt to get the &lt;strong&gt;graph&lt;/strong&gt; parameter out of the body and into the URL, I even tried dynamically constructing the URL for the form submission, but the browser "cleverly" saw what I was doing and pushed the parameter back into the body. So I can't use form submission. Alternatively, JavaScript makes it easy to submit an HTTP POST operation with everything set the way you want it. However, the only way to read a local file is through the form submission process, which means I still can't do a file upload. In the end, I just used Fuseki's file upload servlet, but this has the problem of being non-standard, and it also doesn't like URIs that aren't http (yes Andy, that's why I asked you about this restriction - though I'd already run into it at work).&lt;br /&gt;&lt;br /&gt;The resulting system needed a name, so I called it &lt;em&gt;Robusta&lt;/em&gt;. Everyone seems to be enamored with Arabica beans, but no one ever talks about Robusta beans. Don't get me wrong.... if I had to choose between the two I'd definitely go for the Arabica. But by blending in a small portion of Robusta beans you add a richness to the flavor of your coffee (it's also used as a cheap "filler" and promotes crema in espresso, but I like the flavor aspect). At the time, I came up with the name because I really needed some caffeine, and Arabica was too obvious. But in retrospect, I like the name, since adding a bit of SPARQL to your scripts can really enhance a system (OK, that's tacky. I probably need another one of those coffees). It wasn't after I'd had some coffee that I thought to look for other projects with the same name, but by that point it was already up there.&lt;br /&gt;&lt;br /&gt;Robusta is still a work in progress, and it's mostly a late night project that fits around everything else that I'm doing. But I'm using it at work, and it makes my life easier. I'd like to know if anyone has ideas for it, or can point out errors, inefficiencies, or potential improvements. It's posted at GitHub as a part of the &lt;a href="https://github.com/revelytix"&gt;Revelytix project group&lt;/a&gt;, at:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&lt;a href="https://github.com/revelytix/robusta"&gt;http://github.com/revelytix/robusta&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-7620051540900290700?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/7620051540900290700/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=7620051540900290700' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/7620051540900290700'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/7620051540900290700'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2011/09/robusta-i-was-just-posting-this-into.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-4297271368044649062</id><published>2010-10-21T23:24:00.004-05:00</published><updated>2010-10-22T10:40:21.667-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Clojure Experiences&lt;/h3&gt; I've been learning Clojure recently, as my new job uses it frequently. I'm happy with this (it has me attending &lt;a href="http://clojure-conj.org/"&gt;(first clojure-conj)&lt;/a&gt; right now), but I'm still learning a lot about the language. I'm comfortable with functional programming, and of course, there's next to nothing to learn about syntax and grammar, but there is a steep learning curve to effectively use the libraries. Not only are there a lot of them, but they are not really well documented.&lt;br /&gt;&lt;br /&gt;OO languages tell you which functions belong to a class, but since Clojure functions don't really belong to anything it can sometimes be hard to find everything that is relevant to the structures you are dealing with. I've had a few people tell me that there are good websites out there that provide the sort of documentation I'd like, so I'll go looking for them soon.&lt;br /&gt;&lt;br /&gt;Meanwhile, I find myself looking at the libraries all the time. Other than the general principle of code reuse, the dominant paradigm in Clojure is to use the core libraries whenever possible.  This is because those libraries are likely to be much more efficient than anything you can come up with on your own. They're also "lazy", meaning that the functions do very little work themselves, with all of the work being put off until you actually need it. Laziness is a good thing, and should show up in Clojure code as much as possible. Using the core libraries is the best way to accomplish that.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Clojure Maps&lt;/h3&gt;&lt;br /&gt;Part of my current task is to take a data structure that represents an RDF graph, and process it. The structure that I have now maps resource identifiers (sometimes URIs, but mostly internal blank node IDs) to structures representing all of the property/values associated with that resource. Unfortunately, the values are simple identifiers too, so to find their data I have to go back to the structure and get the resources for that object. This has to be repeated until everything found has no properties or is a literal (which has no properties). Also, the anonymous resources in RDF lists are showing up as data structures, which makes them difficult to work with as well.&lt;br /&gt;&lt;br /&gt;Following this kind of structure is easy enough in an imperative language, though quite tedious. However, in Clojure it is simply too inelegant to work with. So I want some way to make this to a nice nested structure that represents the nested data from the RDF graph (fortunately, this kind of graph is guaranteed to not have any cycles). In particular, I'd this new structure to be lazy wherever I can get it.&lt;br /&gt;&lt;br /&gt;The basic approach is simple: get the property/values for a given resource. Then convert the values (resources) that aren't literals to the actual data that they represent. This is done by getting the property/values of those resources, which is what I started with, so now I know I need recursion.&lt;br /&gt;&lt;br /&gt;Now there are a few caveats. First off, if a resource is actually the head of a list, I don't want to just get it's properties. Instead, I want to traverse the list (lazily, of course) and get all the values attached to the nodes. That's easy enough, and is done with a function to identify resources that are lists, and another one to construct the list (a recursive function that uses &lt;a href="http://clojure.github.com/clojure/clojure.core-api.html#clojure.core/lazy-seq"&gt;lazy-seq&lt;/a&gt;). Second, there is one property/value that I want to preserve, and that is the &lt;code&gt;:id&lt;/code&gt; property. This property shows the URI or identifier for the original node, so the value here needs to be kept, and not expanded on (it would recurse indefinitely otherwise).&lt;br /&gt;&lt;br /&gt;After identifying these requirements, I figured I needed some way to convert all the key/value elements of a map into a new map. This sounds very much like converting all the elements in a sequence (a "seq") into a new sequence, a task which is done using the &lt;a href="http://clojure.github.com/clojure/clojure.core-api.html#clojure.core/map"&gt;map&lt;/a&gt; function. Is there a similar operation to be applied to a map? (The noun and the verb get confusing here, sorry).&lt;br /&gt;&lt;br /&gt;It turns out that Alex Miller had been dealing with exactly this problem just recently, and has built a new function called &lt;a href="http://tech.puredanger.com/2010/09/24/meet-my-little-friend-mapmap/"&gt;mapmap&lt;/a&gt; to do this operation. However, I wasn't aware of this at the time, so I had to pursue my own solution. I'm reasonably happy with my answer, and I learnt a bit in the process, so I'm glad I solved it myself, though it would have been nice to get that sleep back.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Morphing a Map&lt;/h3&gt;&lt;br /&gt;There have been a few times when I've needed to create a new map, and the first time I needed to I went to clojure.core to see what kind of map-creating functions were available to me. At that point I found &lt;a href="http://clojure.github.com/clojure/clojure.core-api.html#clojure.core/hash-map"&gt;hash-map&lt;/a&gt;, which takes a sequence containing an even number of items, and converts it into a map using the even-numbered offsets as the keys and the odd-numbered offsets as the values (with the first offset being "0"). So I could create a map of numbers to their labels using the following:&lt;pre&gt;&lt;code&gt;  (hash-map 1 "one" 2 "two" 3 "three")&lt;/code&gt;&lt;/pre&gt;So was there a way to create this list from my existing map? (Well, yes, but was it &lt;em&gt;easy&lt;/em&gt;?)&lt;br /&gt;&lt;br /&gt;I wanted to iterate over my existing map "entries" and create new entries. If you treat a map as a seq, then it appears as a sequence of 2-element vectors, with the first element being the key and the second element being the value. So if I want to morph the values in a map called "mymap" with a function called "morph", I could convert these pairs into new pairs with:&lt;pre&gt;&lt;code&gt;  (map (fn [[k v]] [k (morph v)]) mymap)&lt;/code&gt;&lt;/pre&gt;For anyone unfamiliar with the syntax, my little anonymous function in the middle there is taking as it's single parameter a 2 element vector, and destructuring it into the values k and v, then returning a new two element vector of k and (morph v). So let's apply this to a map of numbers to their names (incidentally, I don't code anything like this, but I do blog this way)....&lt;pre&gt;&lt;code&gt;  (def mymap {1 "one", 2 "two", 3 "three"})&lt;br /&gt;  (def morph #(.toUpperCase %))&lt;br /&gt;  (map (fn [[k v]] [k (morph v)]) mymap)&lt;/code&gt;&lt;/pre&gt;The result is:&lt;pre&gt;&lt;code&gt;([1 "ONE"] [2 "TWO"] [3 "THREE"])&lt;/code&gt;&lt;/pre&gt;Looks good, but there is one problem. This result is a sequence of pairs, while &lt;code&gt;hash-map&lt;/code&gt; just takes a sequence. That's OK, we can just call &lt;code&gt;flatten&lt;/code&gt; before passing it in:&lt;pre&gt;&lt;code&gt;  (apply hash-map (flatten (map (fn [[k v]] [k (morph v)]) mymap)))&lt;br /&gt;{1 "ONE", 2 "TWO", 3 "THREE"}&lt;/code&gt;&lt;/pre&gt;The "&lt;code&gt;apply&lt;/code&gt;" was needed because &lt;code&gt;hash-map&lt;/code&gt; needed lots of arguments (6 in this case) while the result of &lt;code&gt;flatten&lt;/code&gt; would have been just 1 argument (a list).&lt;br /&gt;&lt;br /&gt;This is looking good. It's what I've done in the past, and it's hardly worthy of a blog post. But this time I had a new twist. My "values" were also structures, and often these structures were seqs as well. For instance, say I have a map of numbers to names in a couple of languages:&lt;pre&gt;&lt;code&gt;(def langmap {1 ["one" "uno"], 2 ["two" "due"], 3 ["three" "tre"]})&lt;/code&gt;&lt;/pre&gt;My morph method needs to change to accept a seq of strings instead of just a string, but that's trivial. So then the map step looks good, giving an output of:&lt;pre&gt;&lt;code&gt;([1 ("ONE" "UNO")] [2 ("TWO" "DUE")] [3 ("THREE" "TRE")])&lt;/code&gt;&lt;/pre&gt;But this fails to be loaded into a hash-map, with the error:&lt;pre&gt;&lt;code&gt;java.lang.IllegalArgumentException: No value supplied for key: TRE (NO_SOURCE_FILE:0)&lt;/code&gt;&lt;/pre&gt;This is exactly what bit me at 1am.&lt;br /&gt;&lt;br /&gt;The problem is simple. &lt;a href="http://clojure.github.com/clojure/clojure.core-api.html#clojure.core/flatten"&gt;flatten&lt;/a&gt; is recursive. It dives into the sub-sequences, and brings them up too. The result is trying to map 1 to "ONE", mapping "UNO" to 2, and so on, until it gets to "TRE" at which point it doesn't has a value to map it to, so it gives an error. Fortunately for me I had an odd number of items when everything was flattened, or else I would have been working with a corrupted map and may not have discovered it for some time.&lt;br /&gt;&lt;br /&gt;My first thought was to write a version of flatten that isn't recursive, but that's when I took a step back and asked myself if there was &lt;em&gt;another&lt;/em&gt; way to construct a map. Maybe I shouldn't be using &lt;code&gt;hash-map&lt;/code&gt;. So I looked at Clojure's &lt;a href="http://clojure.org/cheatsheet"&gt;cheatsheet&lt;/a&gt;, and discovered &lt;a href="http://clojure.github.com/clojure/clojure.core-api.html#clojure.core/zipmap"&gt;zipmap&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;zipmap&lt;/h3&gt;&lt;br /&gt;&lt;code&gt;zipmap&lt;/code&gt; wasn't quite what I wanted, in that instead of taking a sequence of pairs, it wants a pair of sequences. It then constructs a map where the &lt;em&gt;n&lt;/em&gt;th key/value comes from the &lt;em&gt;n&lt;/em&gt;th entry in the first sequence (the key) and the &lt;em&gt;n&lt;/em&gt;th entry from the second sequence (the value). So I just needed to rotate, or transpose my data. Fortunately, the day before one of my co-workers had asked me exactly the same question.&lt;br /&gt;&lt;br /&gt;In this case, he had a sequence of MxN entries, looking something like:&lt;pre&gt;&lt;code&gt;[[a1 a2 a3] [b1 b2 b3] [c1 c2 c3]]&lt;/code&gt;&lt;/pre&gt;And the result that he wanted was:&lt;pre&gt;&lt;code&gt;[[a1 b1 c1] [a2 b2 c2] [a3 b3 c3]]&lt;/code&gt;&lt;/pre&gt;My first thought was some wildly inefficient loop-comprehension, when he asked me if the following would work (for some arg):&lt;pre&gt;&lt;code&gt;  (apply map list arg)&lt;/code&gt;&lt;/pre&gt;Obviously he didn't have a repl going, so I plugged it in, and sure enough it worked. But why?&lt;br /&gt;&lt;br /&gt;I struggled to understand it until I looked up the documentation for &lt;code&gt;map&lt;/code&gt; again, and remembered that it can take &lt;em&gt;multiple&lt;/em&gt; sequences, not just one. I had totally forgotten this. When provided with &lt;em&gt;n&lt;/em&gt; sequences, &lt;code&gt;map&lt;/code&gt; requires a function that accepts &lt;em&gt;n&lt;/em&gt; arguments. It then goes through each of the sequences in parallel, with the output being a sequence where the &lt;em&gt;n&lt;/em&gt;th entry is the result of calling the function with the &lt;em&gt;n&lt;/em&gt;th entry of every source sequence. In this case, the function was just &lt;code&gt;list&lt;/code&gt;, meaning that it creates a list out of the items from each sequence. The &lt;code&gt;apply&lt;/code&gt; just expanded the single argument out into the required lists that &lt;code&gt;map&lt;/code&gt; needed. This one line is a lovely piece of elegance, and deserves its own name:&lt;pre&gt;&lt;code&gt;  (defn rotate [l] (apply map list l))&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;So the final composition for creating a new map with values modified from the original map is:&lt;pre&gt;&lt;code&gt;  (apply zipmap (rotate (map (fn [[k v]] [k (morph v)]) langmap)))&lt;br /&gt;{3 ("THREE" "TRE"), 2 ("TWO" "DUE"), 1 ("ONE" "UNO")}&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Now the actual function that I use inside map isn't nearly so simple. After all, I need to recursively traverse the RDF graph, plus I need to find RDF lists and turn them into sequences, etc. But this zipmap/rotate/map idiom certainly makes it easy.&lt;br /&gt;&lt;br /&gt;Incidentally, I could do exactly the same thing with Alex's &lt;code&gt;mapmap&lt;/code&gt;:&lt;pre&gt;&lt;code&gt;  (mapmap key (comp morph val) langmap)&lt;/code&gt;&lt;/pre&gt;Oh well. I'm still happy I discovered zipmap/rotate.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-4297271368044649062?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/4297271368044649062/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=4297271368044649062' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/4297271368044649062'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/4297271368044649062'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2010/10/clojure-experiences-ive-been-learning.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-5581888475103599038</id><published>2010-05-13T12:18:00.003-05:00</published><updated>2010-05-13T12:25:22.615-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Feedburner&lt;/h3&gt;&lt;br /&gt;Something reminded me that I had RSS going through Feedburner, so I tried to look at it. Turns out that after the Google move they want to update everyone's account, and I hadn't done it yet. So I gave them all my details, and the system told me that it didn't recognize me.&lt;br /&gt;&lt;br /&gt;So then I said I'd forgotten my password, at which point it recognized my email address and asked a "secret question". I know the answer to the question, and it's not even that secret, since I'm sure that most of my family could figure it out, but Feedburner claims I'm wrong. Could someone have changed this? Maybe, but there are only a handful of people who would know that answer, and I trust them not to do something like that.&lt;br /&gt;&lt;br /&gt;Not a problem, I'll just submit a report explaining my problem. Only it seems that Feedburner is exempt from that kind of thing. The closest you can get to help is an FAQ. Good one Google.&lt;br /&gt;&lt;br /&gt;So now what? Well, I guess I just create a new Feedburner link and take it from there. Sorry, but if you've been using RSS to follow this blog, then would you mind changing it please? It's annoying, I know.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-5581888475103599038?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/5581888475103599038/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=5581888475103599038' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/5581888475103599038'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/5581888475103599038'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2010/05/feedburner-something-reminded-me-that-i.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-736856052823901895</id><published>2010-05-13T07:03:00.004-05:00</published><updated>2010-05-13T12:00:36.718-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Wrongful Indexing&lt;/h3&gt; &lt;br /&gt;Some years ago &lt;a href="http://gearon.blogspot.com/2004/08/proof-reading-once-again-its-way-too.html"&gt;I commented on the number and type of indexes&lt;/a&gt; that can be used for tuples. At the time, I pointed out that indexing triples required 3 indexes, and there were 2 appropriate sets of indexes to use. Similarly, quads can be indexed with 6 indexes, and there are 4 such sets. In both cases (5-tuples get silly, requiring 10 indexes and there are 12 possible sets). In each case, I said that each set of indexes would work just as well as the others, and so I always selected the set that included the natural ordering of the tuples.&lt;br /&gt;&lt;br /&gt;So for RDF triples, the two sets of indexes are ordered by:&lt;pre&gt;  subject,predicate,object&lt;br /&gt;  predicate,object,subject&lt;br /&gt;  object,subject,predicate&lt;/pre&gt;and&lt;pre&gt;  object,predicate,subject&lt;br /&gt;  predicate,subject,object&lt;br /&gt;  subject,object,predicate&lt;/pre&gt;For convenience I have always chosen the first set, as this includes the natural ordering of subject/predicate/object, but it looks like I was wrong.&lt;br /&gt;&lt;br /&gt;In using these indexes I've always presumed random 3-tuples, but in reality the index is representing RDF. Whenever I thought about the data I was looking for, this seemed OK, but that's because I tended to think about properties on resources, and not other RDF structures. In particular, I was failing to consider lists.&lt;br /&gt;&lt;br /&gt;Since first building RDF indexes (2001) and writing about them (2004) I've learnt a lot about functional programming. This, in turn, led to an appreciation of lists, particularly in algorithms. I'm still not enamored of them in on-disk structures, but I do appreciate their utility and elegance in many applications. So it was only natural that when I was representing RDF graphs with Scala and I needed to read lists, then I used some trivial recursive code to build a Scala list, and it all looked great. But then I decided to port the Graph class to Java to avoid including the Scala Jars for a really lightweight library.&lt;br /&gt;&lt;br /&gt;I'd like to point out that I'm talking about a library function that can read a well-formed RDF list and return a list in whatever programming language the library is implemented in. The remainder of this post is going to presume that the lists are well formed, since any alternatives can never be returned as a list in an API anyway.&lt;br /&gt;&lt;br /&gt;Reading a list usually involves the subject/predicate/object (SPO) index. You start by looking up the head of the list as a subject, then the predicates &lt;code&gt;rdf:first&lt;/code&gt; for the data at that point in the list, and &lt;code&gt;rdf:rest&lt;/code&gt; for the rest of the list. Rinse and repeat until &lt;code&gt;rdf:rest&lt;/code&gt; yields a value of &lt;code&gt;rdf:nil&lt;/code&gt;. So for each node in the list, there is a lookup by subject, followed by two lookups by predicate. This is perfect for the SPO index.&lt;br /&gt;&lt;br /&gt;However, it's been bugging me that I have such a general approach, when the structure is predetermined. Why look up these two predicates so generally, when we know exactly what we want? What if we reduce the set we're looking in to just the predicates that we want and then go looking for the subjects? That would mean looking first by predicate, then subject, then object, leading to a PSO index. So what does that algorithm look like?&lt;br /&gt;&lt;br /&gt;First, look up the &lt;code&gt;rdf:rest&lt;/code&gt; predicate, leading to an index of subject/object containing all list structures. Next, look up the &lt;code&gt;rdf:rest&lt;/code&gt; predicate, retrieving subject/objects containing all the list data. Now to iterate down the list no longer involves finding the subject followed by the predicate, in order to read the next list node, but rather it just requires finding the subject, and the list node is in the corresponding object. Similarly with the data stored in the node. We're still doing a fixed number of lookups in an index, which means that the overall complexity does not change at all. Tree indexes will still give &lt;em&gt;O(log(N))&lt;/em&gt; complexity, and hash indexes will still give &lt;em&gt;O(1)&lt;/em&gt; complexity. However, each step can involve disk seeks, so it's worth seeing the difference.&lt;br /&gt;&lt;br /&gt;To compare more directly, using an SPO index requires every node to:&lt;ul&gt;&lt;li&gt;Lookup across the entire graph by subject.&lt;/li&gt;&lt;li&gt;Lookup across the subject (2 or 3 predicates) for &lt;code&gt;rdf:first&lt;/code&gt;.&lt;/li&gt;&lt;li&gt;Lookup across the subject (2 or 3 predicates) for &lt;code&gt;rdf:rest&lt;/code&gt;.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;For the PSO index we have some initial setup:&lt;ul&gt;&lt;li&gt;Lookup across the entire graph for the &lt;code&gt;rdf:first&lt;/code&gt; predicate.&lt;/li&gt;&lt;li&gt;Lookup across the entire graph for the &lt;code&gt;rdf:rest&lt;/code&gt; predicate.&lt;/li&gt;&lt;/ul&gt;Then for every node:&lt;ul&gt;&lt;li&gt;Lookup the &lt;code&gt;rdf:first&lt;/code&gt; data for the value.&lt;/li&gt;&lt;li&gt;Lookup the &lt;code&gt;rdf:rest&lt;/code&gt; data for the next node.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;It's important to note a few things, particularly for tree indexes. Trees are the most likely structure used when using a disk, so I'm going to concentrate on them. The number of subjects in a graph tends to scale up with the size of the graph, while the number of predicates is bounded. This is because predicates are used to express a model, with each predicate indicating a certain relationship. Any system trying to deal with the model needs some idea of the concepts it is dealing with, so it's &lt;em&gt;almost&lt;/em&gt; impossible to deal with completely arbitrary relationships. If we know what the relationships are ahead of time, then there must be a fixed number of them. In contrast, subjects represent individuals, and these can be completely unbounded. So if we look up an entire graph to find a particular subject, then we may have to dive down a very deep tree to find that subject. Looking across the entire graph for a given predicate will never have to go very deep, because there are so few of them.&lt;br /&gt;&lt;br /&gt;So the first algorithm (using the SPO index) iteratively looks across every subject in the graph for each node in the list. The next two lookups are trivial, since nodes in a list will only have properties of &lt;code&gt;rdf:first&lt;/code&gt;, &lt;code&gt;rdf:rest&lt;/code&gt; and possibly &lt;code&gt;rdf:type&lt;/code&gt;. The data associated with these properties will almost certainly be in the same block where the subject was found, meaning that there will be no more disk seeks.&lt;br /&gt;&lt;br /&gt;The second algorithm (using the PSO index) does a pair of lookups across every predicate in the graph. The expected number of disk seeks to find the first predicate is significantly fewer than for any of the "subject" searches in the first algorithm. Given how few predicates are in the system, then finding the second predicate may barely involve any disk seeks at all, particularly since the first search will have populated the disk cache with a good portion of the tree, and the similarities in the URIs of the predicates is likely to make both predicates very close to each other. Of course, this presumes that the predicates are even in a tree. Several systems (including one I'm writing right now) treat predicates differently because of how few there are. Indeed, a lot of systems will cache them in a hashtable, regardless of the on-disk structure. So the initial lookup is very inexpensive.&lt;br /&gt;&lt;br /&gt;The second algorithm then iterates down the list, just like the first one does. However, this time, instead of searching for the nodes out of every subject in the list, it will now be just searching for these nodes in the subjects that appear as list nodes. While lists are commonly used in some RDF structures, the subjects in all the lists typically form a very small minority out of all the subjects in a graph. Consequently, depending on the type and depth of trees being used, iterating through a list with the second algorithm, could result in a 2 or three (or more) fewer disk seeks for each node. That's a saving that can add up.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Solid State Disks&lt;/h3&gt;&lt;br /&gt;I've been talking about disk seeks, but this is an artificial restriction imposed by spinning disk drives. Solid State Disks (SSDs) don't have this limitation.&lt;br /&gt;&lt;br /&gt;People have been promoting solid state drives (SSDs) for some years now, but I've yet to use them myself. In fact, most people I know are still using traditional spinning platters. The prices difference is still a big deal, and for really large data, disk drives are still the only viable option. But this will change one day, so am I right to be concerned about disk seeks?&lt;br /&gt;&lt;br /&gt;Disk seeks are a function of data locality. When data has to be stored somewhere else on a disk, the drive head must physically seek across the surface to this new address. SSDs don't require anything to move, but there are still costs in addressing scattered data.&lt;br /&gt;&lt;br /&gt;While it is possible to address every bit of memory in a device in one step, in practice this is never done. This is because the complexity of the circuit grows exponentially as you try to address more and more data in one step. Instead, the memory is broken up into "banks". A portion of the address can now be used to select a bank, allowing the remaining bits in the address to select the required memory in just that bank. This works well, but it does lead to some delays. Selecting a new bank requires "setup", "hold" and "settling" times, all leading to delays. These delays are an order of magnitude smaller than seek delays for a spinning disk, but they do represent a limit on the speed of the device. So while SSDs are much faster than disk drives, there are still limits to their speed, and improvements in data locality can still have a significant impact on performance.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-736856052823901895?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/736856052823901895/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=736856052823901895' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/736856052823901895'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/736856052823901895'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2010/05/wrongful-indexing-some-years-ago-i.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-5054173887147152325</id><published>2010-05-04T20:06:00.003-05:00</published><updated>2010-05-04T21:00:35.095-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Web Services Solved&lt;/h3&gt; &lt;br /&gt;It's a been a long couple of days, and I really want to relax instead of write, but it's been a few days and I've been promising myself that I'd write, so I figured I need to get something written before I can open a beer.&lt;br /&gt;&lt;br /&gt;First of all, the web services problem was trivial. I recently added a new feature that allowed &lt;a href="http://jetty.codehaus.org/jetty/jetty-6/apidocs/org/mortbay/jetty/handler/ContextHandler.html"&gt;ContextHandlers&lt;/a&gt; in &lt;a href="http://jetty.codehaus.org/jetty/"&gt;Jetty&lt;/a&gt; to be configured. Currently the only configuration option I've put in there is the one that was requested, and that is the size of a form. Apparently this is 200k by default, but if you're going to load large files then that may not be enough. Anyway, the problem came about when my code tried to read the maximum form size from the configuration. I wasn't careful enough to check if the context was being configured in the first place, so an NPE was thrown if it was missing.&lt;br /&gt;&lt;br /&gt;Fortunately, most people would never see the problem, since the default configuration file includes details for contexts, and this ends up in every build by default. The reason I was seeing it is because Topaz replaces the configuration with their own (since it describes their custom resolvers), and this custom configuration file doesn't have the new option in it. Of course, I could just add it to Topaz, but the correct solution is to make sure that a configuration can't throw an NPE – which is exactly what I told the Ehcache guys, so it's fitting that I have to do it myself. :-)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Hosting&lt;/h3&gt;&lt;br /&gt;Since I'm on the topic of Topaz, it looks like te OSU/OSL guys and I have both the Topaz and Mulgara servers configured. They wouldn't typically be hosting individual projects (well, they do occasionally), but in this case it's all going in under the umbrella of Duraspace. Of course this has taken some time, and in the case of Topaz I'm still testing that it's all correct, but I think it's there. I'll be changing the DNS for Topaz over soon, and Mulgara was changed last week. Mulgara's DNS has propagated now, so I'm in the process of cutting a long-overdue release.&lt;br /&gt;&lt;br /&gt;One thing that changed in the hosting is that I no longer have a Linux host to build the distribution on. Theoretically, that would be OK, since I ought to be able to build on any platform. However, Mulgara is still distributed as a build for Java 1.5 (I've had complaints when I accidentally put out a release that was built for 1.6). This is easy to set up on Linux, since you just change the JAVA_HOME environment variable to make sure you're pointing to a 1.5 JDK. However, every computer I have here is a Mac. Once upon a time that didn't change anything, but now all JDKs point to JDK 1.6. That means I need to configure the compiler to output the correct version. It can be done, but Mulgara wasn't set up for it.&lt;br /&gt;&lt;br /&gt;If you read the &lt;a href="http://ant.apache.org/manual/CoreTasks/javac.html"&gt;Ant documentation on compiling&lt;/a&gt; you'll see that you can set the target to any JDK version you like. However, that would require editing 58 files (I just had to run a quick command to see that. Wow... I didn't realize it was so bad). I'm sure I'd miss a &amp;lt;&lt;code&gt;javac&lt;/code&gt;&amp;gt; somewhere. Fortunately, there is another option, even if the Ant documents discourage it. There's a system parameter called &lt;a href="http://ant.apache.org/manual/javacprops.html#target"&gt;ant.build.java.target&lt;/a&gt; which will set the default value globally. I checked to make sure that nothing was going to be missed by this (ie. that nothing was manually setting the target) and when it all looked good I changed the build script to set this to "1.5". I didn't change the corresponding script on Windows, but personally I only want this for distributions. Anyone who needs to set it up on Windows probably has the very JDK they want to run Mulgara on anyway.&lt;br /&gt;&lt;br /&gt;Well, that's my story, and I'm sticking to it.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Semantic Universe&lt;/h3&gt;&lt;br /&gt;What else? Oh yes. I wrote a &lt;a href="http://www.semanticuniverse.com/blogs-common-sparql-extension.html"&gt;post for Semantic Universe&lt;/a&gt;. It's much more technical that the other posts I've seen there, but I was told that would be OK. I'm curious to know how it will be received.&lt;br /&gt;&lt;br /&gt;I was interested in how it was promoted on Twitter. I wrote something that mixes linked data and SPARQL to create a kind of federated query (something I find to be very useful, BTW, and I think more people should be aware of it). However, in the process I mentioned that this shouldn't be necessary, since SPARQL 1.1 will be including a note on federated querying. Despite SPARQL 1.1 only being mentioned a couple of times, &lt;a href="http://twitter.com/SemUni/status/13325494782"&gt;the tweet&lt;/a&gt; said, that I discussed "how/why SPARQL 1.1 plans to be a bit more dazzling". Well, admittedly SPARQL 1.1 &lt;em&gt;will&lt;/em&gt; be more dazzling, but my post didn't discuss that. Perhaps it was a hint to talk about that in a future post.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Miscellanea&lt;/h3&gt;&lt;br /&gt;Speaking of future posts, I realized that I've been indexing RDF backwards, at least for lists. It doesn't affect the maximum complexity of iterating a list, but it &lt;em&gt;does&lt;/em&gt; affect the &lt;em&gt;expected complexity&lt;/em&gt;. I won't talk about it tonight, but hopefully by mentioning it here I'll prompt myself to write about it soon.&lt;br /&gt;&lt;br /&gt;This last weekend was the final weekend that my in-laws were visiting from the other side of the planet, so I didn't get much jSPARQLc done. I hope to fix that tomorrow night. I'm even wondering if the Graph API should be factored out into it's own sister project. It's turning out to be incredibly useful for reading and working with RDF data when you just want access to the structure and you don't need a full query engine. It would even plug directly into almost every query engine out there, so there's a lot of utility to it.&lt;br /&gt;&lt;br /&gt;I'm also &lt;em&gt;finally&lt;/em&gt; learning &lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt;, since I've had more pressure to consider a clustered RDF store, much as &lt;a href="http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/"&gt;BBN have created&lt;/a&gt;. I've read the &lt;a href="http://labs.google.com/papers/mapreduce.html"&gt;MapReduce&lt;/a&gt;, &lt;a href="http://labs.google.com/papers/gfs.html"&gt;GFS&lt;/a&gt; and &lt;a href="http://labs.google.com/papers/bigtable.html"&gt;BigTable&lt;/a&gt; papers, so I went into it thinking I'd be approaching the problem one way, but the more I learn the more I think it would scale better if I went in other directions. So for the moment I'm trying to avoid getting too many preconceived notions of architecture until I've learnt some more and applied my ideas to some simple cases. Of course, &lt;a href="http://hadoop.apache.org/hive/"&gt;Hive&lt;/a&gt; tries to do the same thing for relational data, so I think I need to look at the code in that project too. I have a steep learning curve ahead of me there, but I've been avoiding those recently, so it will do me some good.&lt;br /&gt;&lt;br /&gt;Other than that, it's been interviews and immigration lawyers. These are horribly time consuming, and way too boring to talk about, so I won't. See you tomorrow.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-5054173887147152325?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/5054173887147152325/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=5054173887147152325' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/5054173887147152325'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/5054173887147152325'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2010/05/web-services-solved-its-been-long.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-3163625980747870769</id><published>2010-04-30T09:03:00.003-05:00</published><updated>2010-04-30T10:31:13.807-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Topaz and Ehcache&lt;/h3&gt; &lt;br /&gt;Don't ask what I did 2 days ago, because I forget. It's one of the reasons I need to blog more. I also forgot because my brain got clouded due to yesterday's tasks.&lt;br /&gt;&lt;br /&gt;I didn't get a lot done yesterday for the simple reason that I was filling in paperwork for an immigration lawyer. For anyone who has ever had to do this mind numbing task, they probably know that you end up filling in mostly the same things that you filled in 12 months ago, but just subtly different, so there's no possibility of using copy/paste. They will also know that getting all of the information together can take half a day. Strangely, the hardest piece of information was my mother's birthday (why do they want this? I have no idea). There is a bizarre story behind this, that I won't go into right now, but with my mother asleep in Australia there was no way to ask her. Fortunately, I had two brothers online at the time: one lives here in the USA, and the other is a student in Australia (who was up late drinking with friends, and decided to get online to say hi before going to bed). Unfortunately, neither of them new either (the well-oiled brother being completely unaware of why we didn't know).&lt;br /&gt;&lt;br /&gt;But I finally got it done, cleared my immediate email queue (only 65 to go!) and got down to work.&lt;br /&gt;&lt;br /&gt;My first task was to get the Topaz version of Mulgara up and running the same way it used to run 10 months ago. I had already tried going back through the subversion history for the project (ah, that's one of the things I did two days ago!), but with no success. However, I &lt;em&gt;had been&lt;/em&gt; able to find out that others have had this error with &lt;a href="http://www.ehcache.org/"&gt;Ehcache&lt;/a&gt;. No one had a fix, since upgrading the library normally made their problem go away. Well I tried upgrading it myself, but without luck. Evidently the problem was in usage, but I didn't know if it was a problem in the code talking to Ehcache, or the XML configuration file that is uses. Since everything used to work without error, I figured that the code was probably OK, and that it was the configuration at fault. The complexity of the configuration file only deepened my suspicion.&lt;br /&gt;&lt;br /&gt;I didn't want to learn the ins-and-outs of an Ehcache configuration, so my first non-lawyer related task yesterday was to look at the code where the exception was coming from (thank goodness the Java compiler includes line numbers in class files by default). So it turned out that &lt;a href="http://www.terracotta.org/"&gt;Terracotta&lt;/a&gt; (the company who provides Ehcache) have a nice navigable HTML versions of all their opensource code, which made this task much more pleasant than having to get it all from Subversion. This led me to &lt;a href="http://ehcache.org/xref/net/sf/ehcache/distribution/MulticastKeepaliveHeartbeatSender.html#180"&gt;the line&lt;/a&gt; that was throwing the exception, which looked like:&lt;pre&gt;&lt;code&gt;List localCachePeers = cacheManager.getCachePeerListener("RMI").getBoundCachePeers();&lt;/code&gt;&lt;/pre&gt;Great, a compound statement. OK, so I use them myself, but they're annoying when you debug. Was it &lt;code&gt;cacheManager&lt;/code&gt; that was &lt;code&gt;null&lt;/code&gt; or was it the return value from &lt;code&gt;getCachePeerListener("RMI")&lt;/code&gt;?&lt;br /&gt;&lt;br /&gt;At this point I jumped around in the code for a bit (I quite like those hyperlinks. I've seen them before too. I should figure out which project creates these pages), looking for what initialized cacheManager. I didn't find definitive proof that it was set, but it looked pretty good. So I looked at &lt;a href="http://ehcache.org/xref/net/sf/ehcache/CacheManager.html#1128"&gt;getCachePeerListener("RMI")&lt;/a&gt; and discovered that it was a lookup in a Hashmap. This is a prime candidate for returning &lt;code&gt;null&lt;/code&gt;, and indeed the documentation for the method even states that it will return &lt;code&gt;null&lt;/code&gt; if the scheme is not configured. Since the heartbeat code was making the presumption that it could perform an operation on the return value of this method, then the "RMI" scheme is evidently supposed to be configured in every configuration. The fact that it's possible for this method to return &lt;code&gt;null&lt;/code&gt; (even if it's not supposed to) means that the calling code is not defensive enough (any kind of NullPointerException is unacceptable, even if you catch it and log it). Also, the fact that something is always supposed to be configured for "RMI" had me looking in the code to discover where listeners get registered. This turned out to come from some kind of configuration object, which looked like it had been built from an XML file.&lt;br /&gt;&lt;br /&gt;So the problem appears to be the combination of something that's missing from the configuration file, and a presumption that it will be there (i.e. the code couldn't handle it if the item was missing). At this point I joined a forum and described the issue, both to point out that the code should be more defensive, and also to ask what is missing. In the meantime, I tried creating my own version of the library with a fix in it, and discovered that the issue did indeed go away. Then this morning I received a message explaining what I &lt;a href="http://ehcache.org/documentation/distributed_caching_with_rmi.html"&gt;needed to configure&lt;/a&gt;, and also that the code now deals with the missing configuration. It still complains on every heartbeat (in 5 second intervals), but now it tells you what's wrong, and how to fix it:&lt;pre&gt;&lt;code&gt;WARNING: The RMICacheManagerPeerListener is missing. You need to configure&lt;br /&gt;  a cacheManagerPeerListenerFactory with&lt;br /&gt;  class="net.sf.ehcache.distribution.RMICacheManagerPeerListenerFactory"&lt;br /&gt;  in ehcache.xml.&lt;/code&gt;&lt;/pre&gt;Kudos to "gluck" for the quick response. (Hey, I just realized – "gluck" is from Brisbane. My home town!)&lt;br /&gt;&lt;br /&gt;Incidentally, creating my own version of Ehcache was problematic in itself. It's a Maven project, and when I tried to build "package" it attempted to run all the tests, which took well over an hour. Coincidentally, it also happened to be dinner time, so I came back later, only to discover that not all of the tests had passed, and that the JAR files had not been built. Admittedly, it was an older release, but it &lt;em&gt;was&lt;/em&gt; a release, so I found this odd. In the end, I avoided the tests by removing the code, and running the "package" target again.&lt;br /&gt;&lt;br /&gt;With all the errors out of the way I went back to the Topaz system again and run it. As I said earlier, it was no longer reporting errors. But then when I tried to use queries against it, it was completely unresponsive. A little probing found that it wasn't listening for HTTP at all, so I checked the log, and sure enough:&lt;pre&gt;&lt;code&gt;EmbeddedMulgaraServer&gt; Unable to start web services due to: null [Continuing]&lt;/code&gt;&lt;/pre&gt;Argh.&lt;br /&gt;&lt;br /&gt;Not only do I have to figure out what's going on here, it also appears that someone (possibly me) didn't code this configuration defensively enough! Sigh.&lt;br /&gt;&lt;br /&gt;At that point it was after dinner, and I had technical reading to do for a job I might have. Well, I've received the offer, but it all depends on me not being kicked out of the country.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-3163625980747870769?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/3163625980747870769/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=3163625980747870769' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/3163625980747870769'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/3163625980747870769'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2010/04/topaz-and-ehcache-dont-ask-what-i-did-2.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-2991388790499106088</id><published>2010-04-26T20:52:00.004-05:00</published><updated>2010-04-26T23:21:18.948-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='SPARQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Mulgara'/><title type='text'></title><content type='html'>&lt;h3&gt;Multitasking&lt;/h3&gt; &lt;br /&gt;At the moment I feel like I have too many things on the boil. I'm regularly answering &lt;a href="http://osuosl.org/"&gt;OSU/OSL&lt;/a&gt; about porting data over from &lt;a href="http://topazproject.org/trac/"&gt;Topaz&lt;/a&gt; and &lt;a class="zem_slink" href="http://mulgara.org/" title="Mulgara (software)" rel="homepage"&gt;Mulgara&lt;/a&gt;, I'm supposed to be getting work done on &lt;a class="zem_slink" href="http://en.wikipedia.org/wiki/SPARQL" title="SPARQL" rel="wikipedia"&gt;SPARQL&lt;/a&gt; Update 1.1 (which suffered last week while I looked around for a way to stay in the USA), I'm trying to track down some unnecessary sorting that is being done by Mulgara queries in a Topaz configuration, I'm trying to catch up on reading (refreshing my memory on some important Semantic Web topics so that I keep nice and current), I'm trying to find a someone who can help us not get kicked out of the country (long story), I'm responding to requests on Mulgara, and when I have some spare time (in my evenings and weekends) I'm trying to make &lt;a href="http://code.google.com/p/jsparqlc"&gt;jSPARQLc&lt;/a&gt; look more impressive.&lt;br /&gt;&lt;br /&gt;So how's it all going?&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;OSU/OSL&lt;/h3&gt;&lt;br /&gt;Well, OSU/OSL are responding slowly, which is frustrating, but also allows me the time to look at other things, so it's a mixed blessing. They keep losing my tickets, and then respond some time later apologizing for not getting back to me. However, they're not entirely at fault, as I have sudo access on out server, and could do some of this work for myself. The thing is that I've been avoiding the learning curves of Mailman and Trac porting while I have other stuff to be doing. All the same, we've made some progress lately, and I'm really hoping to switch the DNS over to the new servers in the next couple of days. Once that happens I'll be cutting an overdue release to Mulgara.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;SPARQL Update 1.1&lt;/h3&gt;&lt;br /&gt;I really should have done some of this work already, but my job (and impending lack thereof) have interfered. Fortunately another editor has stepped up to help here, so with his help we should have it under control for the next publication round.&lt;br /&gt;&lt;br /&gt;The biggest issues are:&lt;ol&gt;&lt;li&gt;Writing possible responses for each operation. In some cases this will simply be success/failure, but for others it will mean describing partial success. For instance, a long-running &lt;code&gt;LOAD&lt;/code&gt; operation may have loaded 100,000 triples before failing. Most systems want that data to stay in there, and not roll back the change, and we need some way to report what has happened.&lt;/li&gt;&lt;li&gt;Dealing with an equivalent for &lt;code&gt;FROM&lt;/code&gt; and &lt;code&gt;FROM NAMED&lt;/code&gt; in &lt;code&gt;INSERT/DELETE&lt;/code&gt; operations. Using &lt;code&gt;FROM&lt;/code&gt; in a &lt;code&gt;DELETE&lt;/code&gt; operation looks like this is the graph that you want to remove data from, whereas we really want to describe the list of graphs (and/or named graphs) that affect the &lt;code&gt;WHERE&lt;/code&gt; clause. The last I read, the suggestion to use &lt;code&gt;USING&lt;/code&gt; and &lt;code&gt;USING NAMED&lt;/code&gt; instead was winning out. The problem is that no one really likes it, though they don't like every other suggestion even more. :-)&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;I doubt I'll get much done before the next meeting, but at least I did a little today, and I've been able to bring the other editor up to speed.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Sorting&lt;/h3&gt;&lt;br /&gt;This is a hassle that's been plaguing me for a while. A long time back &lt;a href="http://www.plos.org/"&gt;PLoS&lt;/a&gt; complained about queries that were taking too long (like, up to 10 minutes!). After looking at them, I found that a lot of sorting of a lot of data was going on, so I investigated why.&lt;br /&gt;&lt;br /&gt;From the outset, Mulgara adopted "Set Semantics". This meant that everything appeared only once. It made things a little harder to code, but it also made the algebra easier to work with. In order to accomplish this cleanly, each step in a query resolution removed duplicates. I wasn't there, so I don't know why the decision wasn't made to just leave it to the end. Maybe there was a good reason. Of course, in order to remove these duplicates, it had to order the data.&lt;br /&gt;&lt;br /&gt;When SPARQL came along, the pragmatists pointed out that not having duplicates was a cost, and for many applications it didn't matter anyway. So they made duplicates allowable by default, and introduced the &lt;code&gt;DISTINCT&lt;/code&gt; keyword to remove them if necessary, just like SQL. Mulgara didn't have this feature (though the Sesame-Mulgara bridge hacked it to work by selecting all variables across the bridge and projecting out the ones that weren't needed), but given the cost of this sorting, it was obvious we needed it.&lt;br /&gt;&lt;br /&gt;The sorting in question came about because the query was a UNION between a number of product terms (or a disjunction of conjunctions). In order to make the UNION in order, each of the product terms was sorted first. Of course, without the sorting, a UNION can be a trivial operation, but with it the system was very slow. Actually, the query in question was more like a UNION between multiple products, with some of the product terms being UNIONS themselves. The resulting nested sorting was painful. Unfortunately, the way things stood, it was necessary, since there was no way to do a conjunction (product) without having the terms sorted, and since some of the terms could be UNIONS, then the result of a UNION had to be sorted.&lt;br /&gt;&lt;br /&gt;The first thing I did was to factor the query out into a big UNION between terms (a sum-of-products). Then I manually executed each one to find out how long it took. After I added up all the times, the total was about 3 seconds, and most of that time was spent waiting for &lt;a href="http://lucene.apache.org/java/docs/"&gt;Lucene&lt;/a&gt; to respond (something I have no control over), so this was looking pretty good.&lt;br /&gt;&lt;br /&gt;To make this work in a real query I had to make the factoring occur automatically, I had to remove the need to sort the output of a UNION, and I had to add a query syntax to TQL to turn this behavior on and off.&lt;br /&gt;&lt;br /&gt;The syntax was already done for SPARQL, but PLoS were using TQL through Topaz. I know that a number of people use TQL, so I wasn't prepared to break the semantics of that language, which in turn meant that I couldn't introduce a &lt;code&gt;DISTINCT&lt;/code&gt; keyword. After asking a couple of people, I eventually went with a new keyword of &lt;code&gt;NONDISTINCT&lt;/code&gt;. I hate it, but it also seemed to be the best fit.&lt;br /&gt;&lt;br /&gt;Next I did the factorization. Fortunately, Andrae had introduced a framework for modifying a query to a &lt;a href="http://en.wikipedia.org/wiki/Fixed_point_%28mathematics%29"&gt;fixpoint&lt;/a&gt;, so I was able to add to that for my algebraic manipulation. I also looked at other expressions, like differences (which was only in TQL, but is about to become a part of SPARQL) and Optional joins (which were part of SPARQL, and came late into TQL). It turns out that there is a lot that you can do to expand a query to a sum-of-products (or as close to as possible), and fortunately it was easy to accomplish (thanks Andrae).&lt;br /&gt;&lt;br /&gt;Finally, I put the code in to only do this factorization if a query was &lt;em&gt;not&lt;/em&gt; supposed to be &lt;code&gt;DISTINCT&lt;/code&gt; (the default in SPARQL, and if the new keyword is present for TQL). Unexpectedly, this ended up being the trickiest part. Part of the reason was because some UNION operations still needed to have the output sorted if they were embedded in an expression that couldn't be expanded out (a rare, though possible situation, but only when mixing with differences and optional joins).&lt;br /&gt;&lt;br /&gt;I needed lots of tests to be sure that I'd done things correctly. I mean, this was a huge change to the query engine. If I'd got it wrong, it would be a serious issue. As a consequence, this code didn't get checked in and used in the timeframe that it ought to have. But finally, I felt it was correct, and I ran my 10 minute queries against the PLoS data.&lt;br /&gt;&lt;br /&gt;Now the queries were running at a little over a minute. Well, this was an order of magnitude improvement, but still 30 times slower than I expected. What had happened? I checked where it was spending it's time, and it was still in a &lt;code&gt;sort()&lt;/code&gt; method. Sigh. At a guess, I missed something in the code that allows sorting when needed, and avoids it the rest of the time.&lt;br /&gt;&lt;br /&gt;Unfortunately, the time taken to get to that point had led to other things becoming important, and I didn't pursue the issue. Also, the only way to take advantage of this change was to update Topaz to use &lt;code&gt;SELECT NONDISTINCT&lt;/code&gt; but that keyword was going to fail unless being run on a new Mulgara server. This meant that I couldn't update Topaz until I knew they'd moved to a newer Mulgara, and that didn't happen for a long time. Consequently, PLoS didn't see a performance change, and I ended up trying to improve other things for them rather than tracking it down. In retrospect, I confess that this was a huge mistake. PLoS recently reminded me of their speed issues with certain queries, but now they're looking at other solutions to it. Well, it's my fault that I didn't get it all going for them but that doesn't mean I should never do it, so I'm back at it again.&lt;br /&gt;&lt;br /&gt;The problem queries only look really slow when executed against a large amount of data, so I had to get back to the PLoS dataset. The queries also meant running the Topaz setup, since they make use of the Topaz Lucene resolvers. So I updated Topaz and built the system.&lt;br /&gt;&lt;br /&gt;Since I was going to work on Topaz, I figured I ought to add in the use of &lt;code&gt;NONDISTINCT&lt;/code&gt;. This was trickier than I expected, since it looked like the Topaz code was not only trying to generate TQL code, it was also trying to re-parse it to do transformations on it. The parser in question was &lt;a href="http://www.antlr.org/"&gt;Antlr&lt;/a&gt; which is one that I've limited experience with, so I spent quite a bit of time trying to figure out what instances of &lt;code&gt;SELECT&lt;/code&gt; could have a &lt;code&gt;NONDISTINCT&lt;/code&gt; appended to it. In the end, I decided that all of the parsing was really for their own &lt;a href="http://topazproject.org/trac/wiki/Topaz/Manual/Section11#ObjectQueryLanguage"&gt;OQL&lt;/a&gt; language (which looks a lot like TQL). I hope I was right!&lt;br /&gt;&lt;br /&gt;After spending way to long on Topaz, I took the latest updates from SVN, and compiled the Topaz version of Mulgara. Then I ran it to test where it was spending time in the query.&lt;br /&gt;&lt;br /&gt;Unfortunately, I immediately started getting regular INFO messages of the form:&lt;pre&gt;&lt;code&gt;MulticastKeepaliveHeartbeatSender&gt; Unexpected throwable in run thread. Continuing...null&lt;br /&gt;java.lang.NullPointerException&lt;br /&gt; at net.sf.ehcache.distribution.MulticastKeepaliveHeartbeatSender$MulticastServerThread.createCachePeersPayload(MulticastKeepaliveHeartbeatSender.java:180)&lt;br /&gt; at net.sf.ehcache.distribution.MulticastKeepaliveHeartbeatSender$MulticastServerThread.run(MulticastKeepaliveHeartbeatSender.java:137)&lt;/code&gt;&lt;/pre&gt;Now Mulgara doesn't make use of ehcache at all. That's purely a Topaz thing, and my opinion to date has been that it's more trouble than it's worth. This is another example of it. I really don't know what could be going on here, but luckily I kept open the window where I updated the source from SVN, and I can see that someone has modified the class:&lt;pre&gt;&lt;code&gt;  org.topazproject.mulgara.resolver.CacheInvalidator&lt;/code&gt;&lt;/pre&gt;I can't guarantee that this is the problem, but I've never seen it before, and no other changes look related.&lt;br /&gt;&lt;br /&gt;But by this point I'd reached the end of my day, so I decided I should come back to it in the morning (errr, maybe that will be &lt;em&gt;after&lt;/em&gt; the SPARQL Working Group meeting).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Jetty&lt;/h3&gt;&lt;br /&gt;Despite that describing a good portion of my day (at least, those parts not spent in correspondence), I also got a few things done over the weekend. The first of these was a request for a new feature in the Mulgara &lt;a href="http://jetty.codehaus.org/jetty/"&gt;Jetty&lt;/a&gt; configuration.&lt;br /&gt;&lt;br /&gt;One of our users has been making heavy use of the REST API (yay! That time wasn't wasted after all!) and had found that Jetty was truncating their POST methods. It turns out that Jetty restricts this to 200,000 characters by default, and it wasn't enough for them. I do have to wonder what they're sticking in their queries, but OK. Or maybe they're POSTing RDF files to the server? That might explain it.&lt;br /&gt;&lt;br /&gt;Jetty normally lets you define a lot of configuration with system parameters from the command line, or with an XML configuration file, and I was asked if I could allow either of those methods. Unfortunately, our embedded use of Jetty doesn't allow for either of these, but since I was shown exactly what was wanted I was able to track it down. A bit of 'grepping' for the system parameter showed me the class that gets affected. Then some Javadoc surfing took me to the appropriate interface (Context), and then I was able to go grepping through Mulgara's code. I found where we had access to these Contexts, and fortunately the Jetty configuration was located nearby. Up until this point Jetty's Contexts had not been configurable, but now they are. I only added in the field that had been requested, but everything is set up to add more with just two lines of code each - plus the XSD to describe the configuration in the configuration file.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;jSPARQLc&lt;/h3&gt;&lt;br /&gt;My other weekend task was to add &lt;code&gt;CONSTRUCT&lt;/code&gt; support to &lt;a href="http://code.google.com/p/jsparqlc/"&gt;jSPARQLc&lt;/a&gt;. Sure, no one is using it yet, but Java needs so much boilerplate to make SPARQL work, that I figure it will be of use to &lt;em&gt;someone&lt;/em&gt; eventually – possibly me. I'm also finding it to be a good learning experience for why JDBC is the wrong paradigm for SPARQL. I'm not too worried about that though, as the boilerplate stuff is all there, and it would be could easy to clean it up to something that doesn't try to conform to SPARQL. But for the moment it's trying to make SPARQL look like JDBC, and besides there's already another library that isn't trying to look like JDBC. I'd better stick to my niche.&lt;br /&gt;&lt;br /&gt;I've decided that I'm definitely going to go with StAX to make forward-only result sets. However, I'm not sure if there is supposed to be a standard configuration for JDBC to set the desired form of the result set, so I haven't started on that yet.&lt;br /&gt;&lt;br /&gt;The result of a &lt;code&gt;CONSTRUCT&lt;/code&gt; is a graph. By default we can expect a RDF/XML document, though other formats are certainly possible. I'm not yet doing content negotiation with jSPARQLc, though that may need to be configurable, so I wanted to keep an open mind about what can be returned. That means that standard &lt;code&gt;SELECT&lt;/code&gt; queries could return &lt;a href="http://www.w3.org/TR/rdf-sparql-XMLres/"&gt;SPARQL Query Result XML&lt;/a&gt; or &lt;a href="http://www.w3.org/TR/rdf-sparql-json-res/"&gt;JSON&lt;/a&gt;, and &lt;code&gt;CONSTRUCT&lt;/code&gt; queries could result in RDF/XML, &lt;a href="http://www.w3.org/DesignIssues/Notation3"&gt;N3&lt;/a&gt;, or &lt;a href="http://n2.talis.com/wiki/RDF_JSON_Specification"&gt;RDF-JSON&lt;/a&gt; (Mulgara supports all but the last, but maybe I should add that one in. I've already left space for it).&lt;br /&gt;&lt;br /&gt;Without content negotiation, I'm keeping to the XML formats for the moment, with the framework looking for the other formats (though it will report that the format is not handled). Initially I thought I might have to parse the top of the file, until I cursed myself for an idiot and looked up the content type in the header. Once the parameters have been removed, I could use the content type to do a "look up" for a parser constructor. I like this approach, since it means that any new content types I want to handle just become new entries in the look-up table.&lt;br /&gt;&lt;br /&gt;This did leave me wondering if every SPARQL endpoint was going to fill in the Content-Type header, but I presume they will. I can always try a survey of servers once I get more features into the code.&lt;br /&gt;&lt;br /&gt;Parsing an RDF/XML graph is a complex process that I had no desire to attempt (it could take all week to get it right - if not longer). Luckily, Jena has the ARP parser to do the job for me. However, the ARP parser is part of the main Jena jar, which seemed excessive to me. Fortunately, Jena's license is BSD, so it was possible to bring the ARP code in locally. I just had to update the packages to make sure it wouldn't conflict if anyone happens to have their own Jena in the classpath.&lt;br /&gt;&lt;br /&gt;Funnily enough, while editing the ARP files (I'm doing this project "oldschool" with &lt;a href="http://www.vim.org/"&gt;VIM&lt;/a&gt;). I discovered copyright notices for Plugged In Software. For anyone who doesn't know, Plugged In Software was the company that created the Tucana Knowledge Store (later to be open sourced as Kowari, then renamed to Mulgara). The company changed its name later on to match the software, but this code predated that. Looking at it, I seem to recall that the code in question was just a few bugfixes that Simon made. But it was still funny to see.&lt;br /&gt;&lt;br /&gt;Once I had ARP installed, I could parse a graph, but into what? I'm not trying to run a triplestore here, just an API. So I reused an interface I came up with when I built my SPARQL test framework when I needed to read a graph. The interface isn't fully indexed, but it lets you do a lot of useful things if you want to navigate around a graph. For instance, it lets you ask for the list of properties on a subject, or to find the value(s) of a particular subject's property, or to construct a list from an RDF collection (usually an iterative operation). Thinking that I might also want to ask questions about particular objects (or even predicates) I've added in the other two indexes this time, but I'm in two minds about whether they really need to be there.&lt;br /&gt;&lt;br /&gt;The original code for my graph interface was in Scala, and I was tempted to bring it in like this. But one of the purposes of this project was to be lightweight (unfortunately, I lost that advantage when I discovered that ARP needs &lt;a href="http://xerces.apache.org/"&gt;Xerces&lt;/a&gt;), so I thought I should try to avoid the Scala JARs. Also, I thought that the exercise of bringing the Scala code into Java would refresh the API for me, as well as refresh me on Scala (which I haven't used for a couple of months). It did all of this, as well as having the effect of reminding me why Scala is so superior to Java.&lt;br /&gt;&lt;br /&gt;Anyway, the project is getting some meat to it now, and it's been fun to work on in my evenings, and while I've been stuck indoors on my weekends. If anyone has any suggestions for it, then please feel free to let me know.&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-top: 10px; height: 15px;" class="zemanta-pixie"&gt;&lt;a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/9128f86b-b7c9-45cc-878a-b12d0c138756/" title="Reblog this post [with Zemanta]"&gt;&lt;img style="border: medium none; float: right;" class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=9128f86b-b7c9-45cc-878a-b12d0c138756" alt="Reblog this post [with Zemanta]"&gt;&lt;/a&gt;&lt;span class="zem-script more-related pretty-attribution"&gt;&lt;script type="text/javascript" src="http://static.zemanta.com/readside/loader.js" defer="defer"&gt;&lt;/script&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-2991388790499106088?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/2991388790499106088/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=2991388790499106088' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2991388790499106088'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2991388790499106088'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2010/04/multitasking-at-moment-i-feel-like-i.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-67221495540229172</id><published>2010-04-24T16:39:00.003-05:00</published><updated>2010-04-24T20:49:09.949-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Program Modeling&lt;/h3&gt; &lt;br /&gt;The rise of &lt;a href="http://en.wikipedia.org/wiki/Model-driven_architecture"&gt;Model Driven Architecture&lt;/a&gt; in development, has created a slew of modeling frameworks that can be used as tools for this design process. While some MDA tools involve a static design and a development process, others seek to have an interactive model that allows developers to work with the model at runtime. Since an RDF store allows models developed in RDFS and OWL to be read and updated at runtime, it therefore seems natural to use the modeling capabilities of RDFS and OWL to provide yet another framework for design and development.  RDF also has the unusual characteristic of seamlessly mixing instance data with model data (the so-called "ABox" and "TBox"), giving the promise of a system that allows both dynamic modeling and persistent data all in a single integrated system. However, there appear to be some common pitfalls that developers fall prey to, which make this approach less useful than it might otherwise be.&lt;br /&gt;&lt;br /&gt;For good or ill, two of the most common languages used on large scale projects today are Java and C#. Java in particular also has good penetration on the web, though for smaller projects more modern languages, such as Python or Ruby, are more often deployed. There are lots of reasons for Java and C# to be so popular on large projects: They are widely known and it can be easy to assemble a team around it; both the JVM and .Net engines have demonstrated substantial benefits in portability, memory management and optimization through JIT compiling; they have established a reputation of stability over their histories. Being a fan of functional programming and modern languages, I often find Java to be frustrating, but these strengths often bring me back to Java again, despite its shortcomings. Consequently, it is usually with Java or C# in mind that MDA projects start out trying to use OWL modeling.&lt;br /&gt;&lt;br /&gt;On a related note, enterprise frameworks regularly make use of &lt;a href="http://www.hibernate.org/"&gt;Hibernate&lt;/a&gt; to store and retrieve instance data using a relational database (RDBMS). Hibernate maps object definitions to an equivalent representation in a database table using an Object-Relational Mapping (ORM). While not a formal MDA modeling paradigm, a relational database schema forms a model, in the same way that UML or MOF does (only less expressive). While an ORM is not a tool for MDA, it nevertheless represents the code of a project in a form of model, with instance data that is interactive at runtime.&lt;br /&gt;&lt;br /&gt;Unfortunately, the ORM approach offers a very weak form of modeling, and it has no capability to dynamically update at runtime. Several developers have looked at this problem and reasoned that perhaps these problems could be solved by modeling in RDF instead. After all, an RDF store allows a model to be updated as easily as the data, and the expressivity of OWL is far greater than that of the typical RDBMS schema. To this end, we have seen a few systems which have created an Object-Triples Mapping (OTM), with some approaches demonstrating more utility than others.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Static Languages&lt;/h3&gt;&lt;br /&gt;When OTM approaches are applied to Java and C#, we typically see an RDFS schema that describes the data to be stored and retrieved. This can be just as effective as an ORM on a relational database, and has the added benefit of allowing expansions of the class definitions to be implemented simply as extra data to be stored, rather than the whole change management demanded with the update of a RDBMS. An RDF store also has the benefit of being able to easily annotate the class instances being stored, though this is not reflected in the Java/C# classes and requires a new interface to interact with it. Unfortunately, the flow of modeling information through these systems is typically one-way, and does not make use of the real power being offered by RDF.&lt;br /&gt;&lt;br /&gt;ORM in Hibernate embeds the database schema into the programming language by mapping table schemas to class descriptions in Java/C# and table entries to instances of those classes in runtime memory. Perhaps through this example set by ORM, we often see OTM systems mapping OWL classes to Java/C# classes, and RDF instances to Java/C# instances. This mapping seems intuitive, and it has its uses, but it is also fundamentally flawed.&lt;br /&gt;&lt;br /&gt;The principle issue with OTM systems that attempt to embed themselves in the language, is that static languages (like the popular Java and C# languages) are unable to deal with arbitrary RDF. RDF and OWL work on an Open World Assumption, meaning that there may well be more of the model that the system is not yet aware of, and should be capable of taking into consideration. However, static languages are only able to update class definitions outside of runtime, meaning that they cannot accept new modeling information during runtime. It &lt;em&gt;is&lt;/em&gt; possible to define a new class at runtime using a bytecode editing library, but then the class may only be accessed through meta-programming interfaces like reflection, defeating the purpose of the embedding in the first place. This is what is meant by the flow of modeling information being one-way: updates to the program code can be dynamically handled by a model in an RDF store, but updates to the model cannot be reflected by corresponding updates in the program.&lt;br /&gt;&lt;br /&gt;But these programming languages are Turing Complete. We ought to be able to work with dynamic modeling in triples with static languages, so how do we approach it? The solution is to abandon the notion of embedding the model into the language. These classes are not dynamically reconfigurable, and therefore they cannot update with new model updates. Instead, object structures that &lt;em&gt;can&lt;/em&gt; be updated can be used to represent the model. Unfortunately, this no longer means that the model is being used to model the programming code (as desired in MDA), but it does mean that the models are now accurate, and can represent the full functionality being expressed in the RDFS/OWL model.&lt;br /&gt;&lt;br /&gt;As an example, it is relatively easy to express an instance of a class as a Java Map, with properties being the keys, and the "object" values being the values in the map. This is exactly the same as the way structures are expressed in Perl, so it should be a familiar approach to many developers. These instances should be constructed with a factory that takes a structure that contains the details of an OWL class (or, more likely, that subset of OWL that is relevant to the application). In this way it is possible to accurately represent any information found in an RDF store, regardless of foreknowledge. I can personally attest to the ease and utility of this approach, having written a version of it over two nights, and then providing it to a colleague who used it along with Rails to develop an impressive prototype ontology and data editor, complete with rules engine, all in a single day. I expect others can cite similar experiences.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Really Embedding&lt;/h3&gt;&lt;br /&gt;So we've seen that static languages like C# and Java can't dynamically embed "Open" models like RDF/OWL, but are there languages that can?&lt;br /&gt;&lt;br /&gt;In the last decade we've seen a lot of "dynamic" languages gaining popularity, and to various extents, several of these offer that functionality. The most obvious example is &lt;a href="http://www.ruby-lang.org/en/"&gt;Ruby&lt;/a&gt;, which has explicit support for opening up already defined classes in order to add new methods, or redefine existing ones. Smalltalk has &lt;a href="http://coweb.cc.gatech.edu/cs2340/6243"&gt;Metaprogramming&lt;/a&gt;. Meta-programming isn't an explicit feature for many other languages, but so long as the language is dynamic there is often a way, such as these methods for &lt;a href="http://blog.ianbicking.org/2007/08/08/opening-python-classes/"&gt;Python&lt;/a&gt; and &lt;a href="http://transfixedbutnotdead.com/2010/01/14/anyone-for-perl-6-metaprogramming/"&gt;Perl&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Despite the opportunity to embed models into these languages, I'm unaware of any systems which do so. It seems that the only forms of OTM I can find are in the languages which already have an ORM, and are probably better used with that paradigm. There are numerous libraries for accessing RDF in each of the above languages, such as the Redland RDF Language bindings for &lt;a href="http://librdf.org/docs/ruby.html"&gt;Ruby&lt;/a&gt; and &lt;a href="http://librdf.org/docs/python.html"&gt;Python&lt;/a&gt;, the &lt;a href="http://raa.ruby-lang.org/project/rena/"&gt;Rena&lt;/a&gt;, and &lt;a href="http://activerdf.org/"&gt;ActiveRDF&lt;/a&gt; in Ruby, &lt;a href="http://www.rdflib.net/"&gt;RDFLib&lt;/a&gt; and &lt;a href="http://infomesh.net/pyrple/"&gt;pyrple&lt;/a&gt; in Python, &lt;a href="http://www.perlrdf.org/"&gt;Perl RDF&lt;/a&gt;... the list goes and continues to grow. But none of the libraries I know perform any kind of language embedding in a dynamic language. My knowledge in this space is not exhaustive, but the lack of obvious candidates tells a story on its own.&lt;br /&gt;&lt;br /&gt;Is this indicative of dynamic languages not needing the same kind of modeling storage that static languages seem to require? Java and C# often used Hibernate and similar systems in large scale commercial applications with large development teams, while dynamic languages are often used by individuals or small groups to quickly put together useful systems that aim at a very different target market. But as commercial acceptance of dynamic languages develops further, perhaps this kind of modeling would be useful in future. In fact a good modeling library like this could well show a Java team struggling with their own version of an OTM, just what they've been missing in their closed world.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Topaz&lt;/h3&gt;&lt;br /&gt;I wrote this essay partly out of frustration with a system I've worked with called &lt;a href="http://www.topazproject.org/trac/"&gt;Topaz&lt;/a&gt;. Topaz is a very clever piece of OTM code, written for Java, and built over my own project &lt;a href="http://mulgara.org/"&gt;Mulgara&lt;/a&gt;. However, Topaz suffers from all of the closed world problems I outlined above, without any apparent attempt to mitigate the use of RDF by reading extra annotations, etc. It has been used by the &lt;a href="http://www.plos.org/"&gt;Public Library of Science&lt;/a&gt; for their data persistence, but they have been unhappy with it, and it will soon be replaced.&lt;br /&gt;&lt;br /&gt;While performance in Mulgara (something I'm working on), in Topaz's use of Mulgara, and in Topaz itself, has been an issue, I believe that a deeper problem lay in the use of a dynamic system to represent static data. My knowledge of Topaz has me wondering why the system didn't simply choose to use Hibernate. I'm sure the used of RDF and OWL provided some functionality that isn't easy accomplished by Hibernate, but I don't see the benefits being strong enough to make it worth the switch to a dynamic model.&lt;br /&gt;&lt;br /&gt;For my money, I'd either adopt the standard static approach that so many systems already employ to great effect, or go the whole hog and design an OTM system that is truly dynamic and open.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-67221495540229172?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/67221495540229172/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=67221495540229172' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/67221495540229172'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/67221495540229172'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2010/04/program-modeling-rise-of-model-driven.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-6950581414187666399</id><published>2010-04-21T21:26:00.003-05:00</published><updated>2010-04-21T22:43:51.857-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Work&lt;/h3&gt; I've had a number of administrative things to get done this week, since work will be taking a dramatic new turn soon. I've been missing working in a team, so that part will be good, but there are too many unknowns right now, including a visa nightmare that has been unceremoniously dumped in my lap. So, I'm stressed and have a lot to do. But that doesn't mean I'm not working.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Multi-Results&lt;/h3&gt;&lt;br /&gt;I'd recently been asked to allow the HTTP interfaces to return multiple results. One look at the &lt;a href="http://www.w3.org/TR/rdf-sparql-XMLres/"&gt;SPARQL Query Results XML Format&lt;/a&gt; makes it clear that SPARQL isn't capable of it, but the TQL XML format has always allowed it - or at least, I think it did. The SPARQL structure is sort of flat, with a declaration of variables at the top, and bindings under it. The TQL structure is similar, but embeds it all in another element called a "query". That name seems odd (since it's a result, not a query), so I wonder if someone had intended to include the original query as an attribute of that tag. Anyway, the structure is available, so I figured I should add it.&lt;br /&gt;&lt;br /&gt;This was a little trickier than I expected, since I'd tried to abstract out the streaming of answers. This means that I could select the output type simply by using a different streaming class. For now, the available classes are SPARQL-XML, SPARQL-JSON and TQL-XML, but there could easily be others. However, I now had to modify all of those classes to handle multiple answers. Of course, the SPARQL streaming classes had to ignore them, while the TQL class didn't, but that wasn't too hard. However, I came away feeling that it was somehow messier than it ought to have been. Even so, I thought it worked OK.&lt;br /&gt;&lt;br /&gt;One bit of complexity was in handling the GET requests of TQL vs. SPARQL. In SPARQL we can only expect a single query in a GET, but TQL can have multiple queries, separated by semicolons. While I like to keep as much code as possible common to all the classes, in the end I decided that the complexity of doing this was more than it was worth, and I put special multi-query-handling code in the TQL servlet.&lt;br /&gt;&lt;br /&gt;All of this was done a little while ago, but because I was waiting on responses on the mulgara.org move, I decided not to release just yet. This was probably fortunate, since I got an email the other day explaining that subqueries were not being embedded properly. They were starting with a new &lt;code&gt;query&lt;/code&gt; element tag, but not closing with them. However, these tags should not have appeared at this level at all. The suggested patch would have worked, but it relied on checking the indentation used for pretty-printing in order to find out if the &lt;code&gt;query&lt;/code&gt; element should be opened. This would work, but was covering the problem, rather than solving it. A bit of checking, and I realized that I had code to send a header for each answer, code to send the data for the answer, but no code for the "footer". The footer would have been the closing tag for the &lt;code&gt;query&lt;/code&gt; element, and this was being handled in other code, meaning that it only came up at the top level, and not in the embedded sub-answers. This in turn meant that it wasn't always matching up to the header. So I introduced a footer method for answers (a no-op in SPARQL-XML and SPARQL-JSON) which cleaned up the process well enough that avoiding the header (and footer) on sub-answers was now easy to see and get right.&lt;br /&gt;&lt;br /&gt;So was I done? No. The email also commented on warnings of transactions not being closed. So I went looking at this, and decided that all answers were being closed properly. In confusion, I looked at the email again, and this time realized that the bug report said that they were using &lt;code&gt;POST&lt;/code&gt; methods. Since I was only dealing with queries (and not update commands) I had only gone to the &lt;code&gt;GET&lt;/code&gt; method. So I looked at &lt;code&gt;POST&lt;/code&gt;, and sure enough it was a dogs breakfast.&lt;br /&gt;&lt;br /&gt;Part of the problem with a &lt;code&gt;POST&lt;/code&gt; is that it can include updates as well as queries. Not having a standard response for an update, I had struggled a little with this in the past. In the end, I'd chosen to only output the final result of all operations, but this was causing all sorts of problems. For a start, if there was more than one query, then only the last would be shown (OK in SPARQL, not in TQL). Also, since I was ignoring so many things, it meant that I wasn't closing anything if it needed it. This was particularly galling to have wrong, since I'd finally added SPARQL support for &lt;code&gt;POST&lt;/code&gt; queries.&lt;br /&gt;&lt;br /&gt;I'd really have liked to use the same multi-result code that I had for &lt;code&gt;GET&lt;/code&gt; requests, but that didn't look like it was going to mix well with the need to support commands in the middle. In the end I copied/pasted some of the GET code (shudder) and fixed it up to deal with the result lists that I'd already built through the course of processing the &lt;code&gt;POST&lt;/code&gt; request. It doesn't look too bad, and I've commented on the redundancy and why I've allowed it, so I think it's all OK. Anyway, it's all looking good now. Given that I also have a major bugfix from a few weeks back, then I should get it out the door despite the mulgara.org shuffle not being done.&lt;br /&gt;&lt;br /&gt;I didn't mention that major bug, did I? For anyone interested, some time early last year a race bug was avoided by putting a lock into the transaction code. Unfortunately, that lock was to broad, and it prevented any other thread from reading while a transaction was being committed. This locked the database up during large commit operations. It's not the sort of thing that you're likely to see with unit tests, but I was still deeply embarrassed. At least I found it (a big thanks to the guys at PLoS for reporting this bug, and helping me find where it was).&lt;br /&gt;&lt;br /&gt;So before I get dragged into any admin stuff tomorrow morning (office admin or sysadmin), I should try to cut a release to clean up some of these problems.&lt;br /&gt;&lt;br /&gt;Meanwhile, I'm going to relax with a bit of Hadoop reading. I once talked about putting a triplestore on top of this, and it's an idea that's way overdue. I know others have tried exactly this, but each approach has been different, and I want to see what I can make of it. But I think I need a stronger background in the subject matter before I try to design something in earnest.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-6950581414187666399?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/6950581414187666399/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=6950581414187666399' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6950581414187666399'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6950581414187666399'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2010/04/work-ive-had-number-of-administrative.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-3012391275739554247</id><published>2010-04-19T11:03:00.003-05:00</published><updated>2010-04-19T20:17:34.202-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;jSPARQLc&lt;/h3&gt; &lt;br /&gt;I spent Sunday and a little bit of this afternoon finishing up the code and testing for the SPARQL API. The whole thing had just a couple of typos in it, which surprised me no end, because I used VIM and not Eclipse. I must be getting more thorough as I age. :-)&lt;br /&gt;&lt;br /&gt;Anyway, the whole thing works, though its limited in scope. To start with, the only data accessing methods I wrote were &lt;code&gt;getObject(int)&lt;/code&gt; and &lt;code&gt;getObject(String)&lt;/code&gt;. Ultimately, I'd like to include many of the other &lt;code&gt;get...&lt;/code&gt; methods, but these would require tedious mapping. For instance, &lt;code&gt;getInt()&lt;/code&gt; would need to map &lt;code&gt;xsd:int&lt;/code&gt;, &lt;code&gt;xsd:integer&lt;/code&gt;, and all the other integral types to an integer. I've done this work in some of the internal APIs for Mulgara (particularly when dealing with SPARQL), so I know how tedious it is. I suppose I can copy/paste a good portion of it out of Mulgara (it's all Apache 2.0), but for the moment I wanted to get it up and running the way I said.&lt;br /&gt;&lt;br /&gt;If I were to do all of this, then there are all sorts of mappings that can be done between data types. For instance, &lt;code&gt;xsd:base64Binary&lt;/code&gt; would be good for Blobs. I could even introduce some custom data types to handle things like serialized Java objects, with a data type like: &lt;code&gt;java:org.example.ClassName&lt;/code&gt;. Actually, that looks familiar. I should see if anyone has done it.&lt;br /&gt;&lt;br /&gt;Anyway, as I progressed, I found that while it was straight forward enough to get basic functionality in, the JDBC interfaces are really inappropriate.&lt;br /&gt;&lt;br /&gt;To start with, JDBC usually accesses a "cursor" at the server end, and this is accessed implicitly by &lt;code&gt;ResultSet&lt;/code&gt;. It's not absolutely necessary, but moving backwards and forwards through a result that isn't entirely in memory really does need a server-side cursor. Since I'm doing everything in memory right now, then I was able to do an implementation that isn't &lt;code&gt;TYPE_FORWARD_ONLY&lt;/code&gt;, but if I were to move over to using StAX (in the comments from my last entry) then I'd have to fall back to that.&lt;br /&gt;&lt;br /&gt;The server-side cursor approach also makes it possible to write to a &lt;code&gt;ResultSet&lt;/code&gt;, since SQL results are closely tied to the tables they represent. However, this doesn't really apply to RDF, since statements can never be updated, only added and removed. SPARQL Update is coming (I ought to know, as I'm the editor for the document), but there is no real way to convert the update operations on a &lt;code&gt;ResultSet&lt;/code&gt; back into SPARQL-Update operations over the network. It might be theoretically, possible but it would need a lot of communication with the server, and it doesn't even make sense. After all, you'd be trying to map one operation paradigm to a completely different one. Even if it could be made to work, it would be confusing to use. Since my whole point in writing this API was to simplify things for people who are used to JDBC, then it would be self defeating.&lt;br /&gt;&lt;br /&gt;So if this API were to allow write operations as well, then it would need a new approach, and I'm not sure what that should be. Passing SPARQL Update operations straight through might be the best bet, though it's not offering a lot of help (beyond doing all the HTTP work for you).&lt;br /&gt;&lt;br /&gt;The other thing that I noticed was that a blind adherence to the JDBC approach created a few classes that I don't think are really needed. For instance, the &lt;code&gt;ResultSetMetaData&lt;/code&gt; interface only contains two methods that make any sense from the perspective of SPARQL: &lt;code&gt;getColumnCount()&lt;/code&gt; and &lt;code&gt;getColumnName()&lt;/code&gt;. The data comes straight out of the ResultSet, so I would have put them there if the choice were mine. The real metadata is in the list of "link" elements in the result set, but this could encoded with anything (even text) so there was no way to make that metadata fit the JDBC API. Instead, I just let the user ask for the last of links directly (creating a new method to do so).&lt;br /&gt;&lt;br /&gt;Another class that didn't make too much sense to me was &lt;code&gt;Statement&lt;/code&gt;. It's a handy place to record some state about what you've doing on a &lt;code&gt;Connection&lt;/code&gt;, but other than that, it just seems to proxy the &lt;code&gt;Connection&lt;/code&gt; it's attached to. I see there some options for caching (that I've never used myself when I've been on JDBC), so I suspect that it does more than I give it credit for, but for the moment it just appears to be an inconvenience.&lt;br /&gt;&lt;br /&gt;Anyway, I thought I'd put it up somewhere, and since I haven't tried Google's code repository before, I've put it up there. It's a Java SPARQL Connectivity library, so for lack of anything better I called it &lt;a href="http://code.google.com/p/jsparqlc/"&gt;jSPARQLc&lt;/a&gt; (maybe JRC for Java RDF Connectivity would have been better, but there are lots of JRC things out there, but jSPARQLc didn't return any hits from Google, so I went with that). It's very raw and has very little configuration, but it passes it's tests.  :-)&lt;br /&gt;&lt;br /&gt;Speaking of tests, if you want to try it, then the connectivity tests won't pass until you've done the following:&lt;ul&gt;&lt;li&gt;Start a SPARQL endpoint at http://localhost:8080/sparql/ (the sourcecode in the test needs to change if your endpoint is elsewhere).&lt;/li&gt;&lt;li&gt;Create a graph with the URI &lt;code&gt;&amp;lt;test:data&amp;gt;&lt;/code&gt;&lt;/li&gt;&lt;li&gt;Loaded the data in &lt;em&gt;test.n3&lt;/em&gt; up into it (this file is in the root directory)&lt;/li&gt;&lt;/ul&gt;I know I shouldn't have hardcoded some of this, but it was just a test on a 0.1 level project. If it seems useful, and/or you have ideas for it, then please let me know.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Mulgara&lt;/h3&gt;&lt;br /&gt;Other than this, I have some administration I need to do to get both Mulgara and Topaz onto another server, and this seems to slow everything down as I wait for the admins to get back to me. It's also why there hasn't been a Mulgara release recently, even though it's overdue. However, I just got a message from an admin today, so hopefully things have progressed. Even so, I think I'll just cut the release soon anyway.&lt;br /&gt;&lt;br /&gt;One fortunate aspect of the delayed release was a message I got from David Smith about how some resources aren't being closed in the TQL REST interface (when subqueries are used). He's sent me a patch, but I need to spend some time figuring out why I got this wrong, else I could end up hiding the real problem. That's a job for the morning... right after the SPARQL Working Group meeting. Once all of that is resolved, I'll get a release out, and try to figure out what I can do to speed up the server migration.&lt;br /&gt;&lt;br /&gt;Oh, and I need to update Topaz to take advantage of some major performance improvements in Mulgara, and then I need to find even  more performance improvements. Hopefully I'll be onto some of that by the afternoon, but I don't want to promise the moon only to come back tomorrow night and confess I got stuck on the same thing all day.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-3012391275739554247?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/3012391275739554247/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=3012391275739554247' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/3012391275739554247'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/3012391275739554247'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2010/04/jsparqlc-i-spent-sunday-and-little-bit.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-2768812035285723292</id><published>2010-04-17T19:08:00.003-05:00</published><updated>2010-04-17T20:46:55.618-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;SPARQL API&lt;/h3&gt; Every time I try to use SPARQL with Java I keep running into all that minutiae that Java makes you deal with. The &lt;a href="http://hc.apache.org/"&gt;HttpComponents from Apache&lt;/a&gt; make things easier, but there's still a loto f code that has to be written. Then after you get your data back, you still have to process it, which means XML or JSON parsing. All of this means a lot of code, just to get a basic framework going.&lt;br /&gt;&lt;br /&gt;I know there are a lot of APIs out there for working with RDF engines, but there aren't many for working directly over SPARQL. I eventually had a look and found &lt;a href="http://sparql.sourceforge.net/"&gt;SPARQL Engine for Java&lt;/a&gt;, but this seemed to have more client-side processing than I'd expect. I haven't looked too carefully at it, so this may be incorrect, but I thought it would be worthwhile to take all the boilerplate I've had to put together in the past, and see if I can glue it all together in some sensible fashion. Besides, today's Saturday, meaning I don't have to worry about my regular job, and I'm recovering from a procedure yesterday, so I couldn't do much more than sit at the computer anyway.&lt;br /&gt;&lt;br /&gt;One of my inspirations was a conversation I had with &lt;a href="http://bblfish.net/"&gt;Henry Story&lt;/a&gt; (hmmm, Henry's let that link get badly out of date) a couple of years ago about a standard API for RDF access, much like &lt;a href="http://java.sun.com/javase/6/docs/technotes/guides/jdbc/"&gt;JDBC&lt;/a&gt;. At the time I didn't think that Sun could make something like that happen, but if there were a couple of decent attempts at it floating around, then some kind of pseudo standard could emerge. I never tried it before, but today I thought it might be fun to try.&lt;br /&gt;&lt;br /&gt;The first thing I remembered was that when you write a library, you end up writing all sorts of tedious code while you consider the various ways that a user might want to use it. So I stuck to the basics, though I did add in various options as I dealt with individual configuration options. So it's possible to set the &lt;em&gt;default-graph-uri&lt;/em&gt; as a single item as well as with a list (since a lot of the time you only want to set one graph URI). I was eschewing Eclipse today, so I ended up making use of VIM macros for some of my more tedious coding. The tediousness also reminded me again why I like Scala, but given that I wanted it to look &lt;em&gt;vaguely&lt;/em&gt; JDBC-like, I figured that the Java approach was more appropriate.&lt;br /&gt;&lt;br /&gt;I remember that TKS (the name of the first incarnation of the Mulgara codebase) had attempted to implement JDBC. Apparently, a good portion of the API, was implemented, but the there were some elements that just didn't fit. So from the outset I avoided trying to duplicate that mistake. Instead, I decided to cherry pick the most obvious features, abandon anything that doesn't make sense, and add in a couple of new features where it seems useful or necessary. So while some of it might &lt;em&gt;look&lt;/em&gt; like JDBC, it won't have anything to do with it.&lt;br /&gt;&lt;br /&gt;I found a piece of trivial JDBC code I'd used to test something once-upon-a-time, and tweaked it a little to look like something I might try to do with SPARQL. My goal was to write the library that would make this work, and then take it from there. This is the example:&lt;br /&gt;&lt;pre&gt;&lt;code&gt;    final String ENDPOINT = "http://localhost:8080/sparql/";&lt;br /&gt;    Connection c = DriverManager.getConnection(ENDPOINT);&lt;br /&gt;&lt;br /&gt;    Statement s = c.createStatement();&lt;br /&gt;    s.setDefaultGraph("test:data");&lt;br /&gt;    ResultSet rs = s.executeQuery("SELECT * WHERE { ?s ?p ?o }");&lt;br /&gt;    rs.beforeFirst();&lt;br /&gt;    while (rs.next()) {&lt;br /&gt;      System.out.println(&lt;br /&gt;              rs.getObject(1).toString() + ", " +&lt;br /&gt;              rs.getObject(2) + ", " +&lt;br /&gt;              rs.getObject(3));&lt;br /&gt;    }&lt;br /&gt;    rs.close();&lt;br /&gt;    c.close();&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;My first thought was that this is not how I would design the API (the "Statement" seems a little superfluous), but that wasn't the point.&lt;br /&gt;&lt;br /&gt;Anyway, I've nearly finished it, but I'm dopey from pain medication, so I thought I'd write down some thoughts about it, and pick it up again in the morning. So if anyone out there is reading this (which I doubt, given how little I write here) these notes are more for me than for you, so don't expect to find it interesting.  :-)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Observations&lt;/h3&gt;&lt;br /&gt;The first big difference to JDBC is the configuration. A lot of JDBC is either specific to a particular driver, or to RDBM Systems in general. This goes for the structure of the API as well as the configuration. For instance, &lt;code&gt;ResultSet&lt;/code&gt; seems to be heavily geared towards cursors, which SPARQL doesn't support. I was momentarily tempted to try emulating this functionality through LIMIT and OFFSET, but that would have involved a lot of network traffic, and could potentially interfere with the user trying to use these keywords themselves. Getting the row number (&lt;a href="http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html#getRow()"&gt;getRow&lt;/a&gt;) would have been really tricky if I'd gone that way too.&lt;br /&gt;&lt;br /&gt;But ResultSet was one of the last things I worked on today, so I'll rewind.&lt;br /&gt;&lt;br /&gt;The first step was making the HTTP call. I usually use GET, but I've recently added in the &lt;a href="http://www.w3.org/TR/rdf-sparql-protocol/#query-bindings-http"&gt;POST binding&lt;/a&gt; for SPARQL querying in Mulgara , so I made sure the client code can do both. For the moment I'm automatically choosing to do a POST query when the URL gets above 1024 characters (I believe that was the URL limit for some version of IE), but I should probably make the use of POST vs. GET configurable. Fortunately, building parameters was identical for both methods, though they get put into difference places.&lt;br /&gt;&lt;br /&gt;Speaking of parameters, I need to check this out, but I believe that graph URIs in SPARQL do not get encoded. Now that's not going to work if they contain their own queries (any why wouldn't they), but most graphs don't do that, so it's never bitten me before. Fortunately, doing a URL-Decode on an unencoded graph URI is usually safe, so that's how I've been able to get away with it until now. But as a client that has to do the encoding I needed to think more carefully about it.&lt;br /&gt;&lt;br /&gt;From what I can tell, the only part that will give me grief is the query portion of the URI. So I checked out the query, and if there wasn't one, I just sent the graph unencoded. If there was one, then I'd encode just the query, add it to the URI, and then see if decoding got me back to the original. If it does, then I send that. Otherwise, I just encode the whole graph URI and send that. As I write it down, it looks even more like a hack than ever, but so far it seems to work.&lt;br /&gt;&lt;br /&gt;So now that I have all the HTTP stuff happening, what about the response? Since answers can be large, my first thought was &lt;a href="http://java.sun.com/javase/6/docs/api/org/xml/sax/package-summary.html"&gt;SAX&lt;/a&gt;. Well, actually, my first thought was Scala, since I've already parsed SPARQL response documents with Scala's XML handling, and it was trivial. But I'm Java so that means SAX or DOM. SAX can handle large documents, but possibly more importantly, I've always found SAX easier to deal with than DOM, so that's the way I went.&lt;br /&gt;&lt;br /&gt;Because SAX operates on a stream, I thought I could build a stream handler, but I think that was just the medication talking, since I quickly remembered that it's an event model. The only way I could do it as a stream would be if I buit up a queue with one thread writing at one end and the consuming thread taking data off at the other. That's possible, but it's hard to test if it scales, and if the consumer doesn't get drain the queue in a timely manner, then you can cause problems for the writing end as well. It's possible to slow up the writer by not returning from the even methods until the queue has some space, but that seems clunky. Also, when you consider that a ResultSet is supposed to be able to rewind and so forth, a streaming model just doesn't work.&lt;br /&gt;&lt;br /&gt;In the end, it seemed that I would have to have my ResultSets in memory. This is certainly easier that any other option I could think of, and the size of RAM these days means that it's not really a big deal. But it's still in the back of my mind that maybe I'm missing an obvious idea.&lt;br /&gt;&lt;br /&gt;The other thing that came to mind is to create an API to provides object events in the same way that SAX provides events for XML elements. This would work fine, but it's nothing like the API I'm trying to look like, so I didn't give that any serious thought.&lt;br /&gt;&lt;br /&gt;So now I'm in the midst of a SAX parser. There's a lot of work in there that I don't need when working with other languages, but it does give you a comfortable feeling knowing that you have such fine-grained control over the process, Java enumerations have come in handy here, as I decided to go with a state-machine approach. I don't use this very often (outside of hardware design, where I've always liked it), but it's made the coding so straightforward it's been a breeze.&lt;br /&gt;&lt;br /&gt;One question I have, is if the parser should create a ResultSet object, or if it should &lt;em&gt;be&lt;/em&gt; the object. It's sort of easy to just create the object with the InputStream as the parameter for the constructor, but then the object you get back could be either a boolean result or a list of variable bindings, and you have to interrogate it to find out which one it is. The alternative is to use a factory that returns different types of result sets. I initially went with the former because both have to parse the header section, but now that I've written it out, I'm thinking that the latter is the better way to go. I'll change it in the morning.&lt;br /&gt;&lt;br /&gt;I'm also thinking of having a parser to deal with JSON (I did some abstraction to make this easy), but for now I'll just take one step at a time.&lt;br /&gt;&lt;br /&gt;One issue I haven't given a lot of time to yet is the CONSTRUCT query. These have to return a graph and not a result set. That brings a few questions to mind:&lt;ul&gt;&lt;li&gt;How do I tell the difference? I don't want to do it in the API, since that's something the user may not want to have to figure out. But short of having an entire parser, it could be difficult to see the form of the query before it's sent.&lt;/li&gt;&lt;li&gt;I can wait for the response, and figure it out there, but then my SAX parser needs to be able to deal with RDF/XML. I usually use Jena's parser for this, since I know it's a lot of work. Do I really want to go that way? Unfortunately, I don't know of any good way to move to a different parser once I've seen the opening elements. I &lt;em&gt;could&lt;/em&gt; try a &lt;a href="http://java.sun.com/javase/6/docs/api/java/io/BufferedInputStream.html"&gt;BufferedInputStream&lt;/a&gt;, so I could rewind it, but can that handle really large streams? I'll think on that.&lt;/li&gt;&lt;li&gt;How do I represent the graph at the client end?&lt;/li&gt;&lt;/ul&gt;Representing a graph goes way beyond ResultSet, and poses the question of just how far to go. A simple list of triples would probably suffice, but if I have a graph then I usually want to do interesting stuff with it.&lt;br /&gt;&lt;br /&gt;I'm thinking of using my normal graph library, which isn't out in the wild yet, but I find it very useful. I currently have implementations of it in Java, Ruby and Scala. I keep re-implementing it whenever I'm in a new language, because it's just so useful (it's trivial to put under a Jena or Mulgara API too). However, it also goes beyond the JDBC goal that I was looking for, so I'm cautious about going that way.&lt;br /&gt;&lt;br /&gt;Anyway, it's getting late on a Saturday night, and I'm due for some more pain medication, so I'll leave it there. I need to talk to people about work again, so having an active blog will be important once more (even if it makes me look ignorant occasionally). I'll see if I can keep it up.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-2768812035285723292?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/2768812035285723292/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=2768812035285723292' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2768812035285723292'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2768812035285723292'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2010/04/sparql-api-every-time-i-try-to-use.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-3040619251796237722</id><published>2009-05-21T13:15:00.004-05:00</published><updated>2009-05-21T18:15:07.306-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Federated Queries&lt;/h3&gt; A long time ago, TKS supported federated queries, though the approach was a little naive (bring all the matches of triple patterns in to a single place, and join them there). Then a few years ago I added this to Mulgaraas well. I've always wanted to make it more intelligent in order to reduce network bandwidth, but at the same time, I was always happy that it worked. Unfortunately, it was all accomplished through RMI, and was Mulgara specific. That used to be OK, since RDF servers didn't have standardized communications mechanisms, but that changed with SPARQL.&lt;br /&gt;&lt;br /&gt;More recently, I've started running across distributed queries through another avenue. While working through the SPARQL protocol, I realized that the Mulgara approach of treating unknown HTTP URIs as data that can be retrieved can be mixed with SPARQL CONSTRUCT queries encoded into a URI. The result of an HTTP request on a SPARQL CONSTRUCT query is an RDF document, which is exactly what Mulgara is expecting when it does an HTTP GET on a graph URI. The resulting syntax is messy, but it works quite well. Also, while retrieving graph URIs is not standard in SPARQL, most systems implement this, making it a relatively portable idiom. I was quite amused at the exclamations of surprise and horror (especially the horror) when I &lt;a href="http://lists.w3.org/Archives/Public/semantic-web/2009May/0029.html"&gt;passed this along on a mailing list&lt;/a&gt; a few weeks ago.&lt;br /&gt;&lt;br /&gt;The ease at which this was achieved using SPARQL made me consider how federated querying might be done using a SPARQL-like protocol. Coincidentally, the &lt;a href="http://www.w3.org/2009/sparql/wiki/Main_Page"&gt;SPARQL Working Group&lt;/a&gt; has &lt;a href="http://www.w3.org/2009/sparql/wiki/Feature:BasicFederatedQuery"&gt;Basic Federated Queries&lt;/a&gt; as a proposed feature, and now I'm starting to see a lot of people asking about it on mailing lists (was people always asking about this, or am I just noticing it now?). I'm starting to think this feature may be more important in SPARQL, and think that perhaps I should have made it a higher priority when I voted on it. As it is, it's in the list of things we'll get to if we have time.&lt;br /&gt;&lt;br /&gt;Then, while I was thinking about this, one of the other Mulgara developers tells me that he absolutely has to have distributed queries (actually, he needs to run rules over distributed datasets) to meet requirements in his organization. Well, the existing mechanisms will sort of work for him, but to do it right it should be in SPARQL.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Requirements&lt;/h3&gt; So what would I want to see in federated SPARQL queries? Well, as an implementer I need to see a few things:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;A syntactic mechanism for defining the URI of a SPARQL endpoint containing the graph(s) to be accessed.&lt;/li&gt;&lt;li&gt;A syntactic mechanism for defining the query to be made on that endpoint (a subquery syntax would be fine here).&lt;/li&gt;&lt;li&gt;A means of asking the size of a query result.&lt;/li&gt;&lt;li&gt;A mechanism for passing existing bindings along with a query.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;The first item seemed trivial until I realized that SPARQL has no standard way of describing an endpoint. Systems like Mulgara simply use &lt;code&gt;http://hostname/sparql/&lt;/code&gt;, which provides access to the entire store (everything can be referred to using HTTP parameters, such as &lt;code&gt;default-graph-uri&lt;/code&gt; and &lt;code&gt;query&lt;/code&gt;). On the other hand, &lt;a href="http://www.joseki.org/"&gt;Joseki&lt;/a&gt; can do the /sparql/ thing, but also provides an option to access datasets through the path, and &lt;a href="http://www.openrdf.org/"&gt;Sesame&lt;/a&gt; can have several repositories, each of which is accessible by varying the path in the URL.&lt;br /&gt;&lt;br /&gt;The base URL for issuing SPARQL queries against would be easy enough to specify, but it introduces a new concept into the query language, and that has deeper ramifications than should be broached in this context.&lt;br /&gt;&lt;br /&gt;The query that can be issued against an endpoint should look like a standard query, and not just a CONSTRUCT, as this provides greater flexibility and also binds the columns to variable names that can appear in other parts of the query. This is basically identical to a subquery, which is exactly what we want.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Bandwidth Efficiency&lt;/h3&gt; The last 2 items are a matter of efficiency and not correctness. However, they can mean the difference between transferring a few bytes vs a few megabytes over a network.&lt;br /&gt;&lt;br /&gt;(BTW, when did "bandwidth" get subverted to describe data rates? When I was a boy this referred to the range of frequencies that a signal used, and this had a mathematical formula relating it to the number of symbols-per-second that could be transmitted over that signal - which does indeed translate to a data rate. However, it now gets used in completely different contexts which have nothing to do with frequency range. Oh well.. back to the story).&lt;br /&gt;&lt;br /&gt;If I want to ask for the identifiers of people named "Fred" (as opposed to something else I want to name with &lt;code&gt;foaf:givenname&lt;/code&gt;), then I could use the query:&lt;pre&gt;&lt;code&gt;PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt;&lt;br /&gt;SELECT ?person WHERE {&lt;br /&gt;  ?person foaf:givenname "Fred" .&lt;br /&gt;  ?person a foaf:Person&lt;br /&gt;}&lt;/code&gt;&lt;/pre&gt;Now what if the "type" data and the "name" data appear on different servers? In that case we would use a distributed query.&lt;br /&gt;&lt;br /&gt;Using the HTTP/GET idiom I mentioned at the top of this post, then I could send the query to the server containing the &lt;code&gt;foaf:givenname&lt;/code&gt; statements, and change it now to say:&lt;pre&gt;&lt;code&gt;PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt;&lt;br /&gt;SELECT ?person WHERE {&lt;br /&gt;  ?person foaf:givenname "Fred" .&lt;br /&gt;  GRAPH &amp;lt;http://hostname/sparql/?query=&lt;br /&gt;PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0A&lt;br /&gt;CONSTRUCT+%7B%3Fperson+a+foaf%3APerson%7D+&lt;br /&gt;WHERE+%7B%3Fperson+a+foaf%3APerson%7D&amp;gt; {&lt;br /&gt;    ?person a foaf:Person&lt;br /&gt;  }&lt;br /&gt;}&lt;/code&gt;&lt;/pre&gt;So now the server will resolver all the entities with the name "Fred", then it will retrieve a graph and ask it for all the entities that are a &lt;code&gt;foaf:Person&lt;/code&gt;. Then it will join these results to create the final result.&lt;br /&gt;&lt;br /&gt;But what happens if there are only 3 things named "Fred", but 10,000 people in the data set? In that case the server will resolve the first pattern, getting 3 bindings for ?person, and then make a request across the network, getting back 10,000 statement which are then queried for those statements the describe a &lt;code&gt;foaf:Person&lt;/code&gt; (they all will), and only then does the join happen. Ideally, we'd have gone the other way, and asked the server with 10,000 people to request data from the server that had 3 entities named Fred, but we may not have known ahead of time that this would be better, and a more complex query could require a more complex access pattern than simply "reversing" the resolution order.&lt;br /&gt;&lt;br /&gt;First of all, we need a way to ask each server how large a set of results is likely to be. The &lt;a href="http://www.w3.org/2009/sparql/wiki/Feature:AggregateFunctions"&gt;COUNT&lt;/a&gt; function that is being discussed in the SPARQL &lt;abbr title="Working Group"&gt;WG&lt;/abbr&gt; at the moment could certainly be used to help here, though for the sake of efficiency it would also be nice to have a mechanism for asking for the upper-limit of the COUNT. That isn't appropriate for a query language (since it refers to database internals) but would be nice to have in the protocol, such as with an HTTP/OPTION request (though I &lt;em&gt;really&lt;/em&gt; don't see something like that being ratified by the SPARQL WG). But even without an "upper limit" option, a normal COUNT would give us what we need to find out how to move the query around.&lt;br /&gt;&lt;br /&gt;So once we realize that the server running the query has only a little data (call it "Server A"), and it needs to join it to a large amount of data on a different server (call this one "Server B", then of course we want Server A to send the small amount of data to Server B instead of retrieving the large amount from it. One way to do this might be to invert the query at this point, and send the whole thing to Server B. That server then asks Server A for the data, and sends its response. Unfortunately, that is both complex, and requires a lot more hops than we want. The final chain here would be:&lt;ol&gt;&lt;li&gt;Client sends query as a request to Server A&lt;/li&gt;&lt;li&gt;Server A reverses the query and sends the new query as a request to Server B&lt;/li&gt;&lt;li&gt;Server B resolves its local data, and sends the remainder of the query as a request to Server A&lt;/li&gt;&lt;li&gt;Server A responds to Server B with the result of entities with the name "Fred"&lt;/li&gt;&lt;/li&gt;Server B joins the data it got with the local data and responds to Server A with the results of the entire query&lt;/li&gt;&lt;li&gt;Server A responds to the client with the unmodified results it just received&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;Yuck.&lt;br /&gt;&lt;br /&gt;Instead, when Server A detects a data size disparity like this, it needs a mechanism to package up its bindings for the &lt;em&gt;?person&lt;/em&gt; variable, and send these to Server B along with the request. Fortunately, we already have a format for this in the &lt;a href="http://www.w3.org/TR/rdf-sparql-XMLres/"&gt;SPARQL result set format&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Normally, a query would be performed using an HTTP/GET, but including a content-body in a GET request has never been formally recognized (though it has not been made illegal), so I don't want to go that way. Instead, a POST would work just as well here. The HTTP request with content could look like this (I've added line breaks to the request):&lt;code&gt;&lt;pre&gt;POST /sparql/?query=&lt;br /&gt;PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0A&lt;br /&gt;SELECT+%3Fperson+WHERE+%7B%3Fperson+a+foaf%3APerson%7D HTTP/1.1&lt;br /&gt;Host: www.example&lt;br /&gt;User-agent: my-sparql-client/0.1&lt;br /&gt;Content-Type: application/sparql-results+xml&lt;br /&gt;Content-Length: xxx&lt;br /&gt;&lt;br /&gt;&amp;lt;?xml version="1.0"?&amp;gt;&lt;br /&gt;&amp;lt;sparql xmlns="http://www.w3.org/2005/sparql-results#"&amp;gt;&lt;br /&gt; &amp;lt;head&amp;gt;&lt;br /&gt;   &amp;lt;variable name="person"/&amp;gt;&lt;br /&gt; &amp;lt;/head&amp;gt;&lt;br /&gt; &amp;lt;results distinct="false" ordered="false"&amp;gt;&lt;br /&gt;   &amp;lt;result&amp;gt;&lt;br /&gt;     &amp;lt;binding name="person"&amp;gt;&amp;lt;uri&amp;gt;http://www.example/FredFlintstone&amp;lt;/uri&amp;gt;&amp;lt;/binding&amp;gt;&lt;br /&gt;   &amp;lt;/result&amp;gt;&lt;br /&gt;   &amp;lt;result&amp;gt;&lt;br /&gt;     &amp;lt;binding name="person"&amp;gt;&amp;lt;uri&amp;gt;http://www.example/FredKruger&amp;lt;/uri&amp;gt;&amp;lt;/binding&amp;gt;&lt;br /&gt;   &amp;lt;/result&amp;gt;&lt;br /&gt;   &amp;lt;result&amp;gt;&lt;br /&gt;     &amp;lt;binding name="person"&amp;gt;&amp;lt;uri&amp;gt;http://www.example/FredTheDog&amp;lt;/uri&amp;gt;&amp;lt;/binding&amp;gt;&lt;br /&gt;   &amp;lt;/result&amp;gt;&lt;br /&gt; &amp;lt;/results&amp;gt;&lt;br /&gt;&amp;lt;/sparql&amp;gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;I can't imagine that I could be successful in suggesting this as part of the underlying protocol for federated querying, but I'm thinking that I'll be incorporating it into Mulgara all the same.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-3040619251796237722?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/3040619251796237722/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=3040619251796237722' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/3040619251796237722'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/3040619251796237722'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2009/05/federated-queries-long-time-ago-tks.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-1365826457813077288</id><published>2009-02-27T12:37:00.002-06:00</published><updated>2009-02-27T21:16:15.411-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OWL'/><category scheme='http://www.blogger.com/atom/ns#' term='integration'/><category scheme='http://www.blogger.com/atom/ns#' term='RLog'/><category scheme='http://www.blogger.com/atom/ns#' term='description logic'/><category scheme='http://www.blogger.com/atom/ns#' term='SKOS'/><category scheme='http://www.blogger.com/atom/ns#' term='RDF'/><category scheme='http://www.blogger.com/atom/ns#' term='Mulgara'/><category scheme='http://www.blogger.com/atom/ns#' term='logic'/><title type='text'></title><content type='html'>&lt;h3&gt;More Programmable Logic&lt;/h3&gt; In the last post I gave a basic description of how the Krule rules engine works. I left out a number of details, but it provides some details on the overall approach.&lt;br /&gt;&lt;br /&gt;The most important detail I skipped was the interaction with resolvers in order to identify resources with particular properties. This includes among other things, finding URIs in a domain, inequality comparisons for numeric literals, regular expression matching on strings, the "type" of a resource (URI, blank node, or literal), and optimized transitivity of predicates. We also manage "subtractions" in our data, in the same way that Evren Sirin &lt;a href="http://clarkparsia.com/weblog/2009/02/11/integrity-constraints-for-owl/"&gt;described NOT&lt;/a&gt; two weeks ago. I should point out that Mulgara does subtractions directly, and not with a combined operation with the OPTIONAL/FILTER/!BOUND pattern. This was introduced this in 2004 (or was it 2005?). I have to say that I'm a little surprised that SPARQL never included anything to express it directly, particularly since so many people ask for it (and hence, the popularization of OPTIONAL/FILTER/!BOUND as a pattern) and because the &lt;a href="p://www.w3.org/TR/rdf-sparql-query/#sparqlAlgebra"&gt;SPARQL algebra&lt;/a&gt; provides a definition of a function called "Diff" that is used internally.&lt;br /&gt;&lt;br /&gt;Anyway, these extensions are not necessary to understand Krule or RLog, but I think it's useful to know that they're there.&lt;br /&gt;&lt;br /&gt;So now that I've described Krule, I've set the scene for describing RLog.&lt;br /&gt;&lt;h3&gt;Krule Configuration&lt;/h3&gt; When I first wrote &lt;abbrev title="Kowari Rules"&gt;Krule&lt;/abbrev&gt;, I was ultimately aiming at OWL, but I had a short term goal of RDFS. I find that I have to take these things one step at a time, or else I never make progress. Since I knew that my rules were going to expand, I figured I should not hard code anything into Mulgara, but that I should instead interpret a data structure which described the rules. That also meant I would be able to run rules for lots of systems: RDFS, SKOS, OWL, or anything else. Of course, some things would need more features than RDFS needed (e.g. both OWL and SKOS need "lists"), but my plan was to work on that iteratively.&lt;br /&gt;&lt;br /&gt;At the time, I designed an &lt;a href="http://mulgara.org/files/misc/krule.rdf"&gt;RDF schema&lt;/a&gt; to describe my rules, and built the Krule engine to initialize itself from this. This works well, since the whole system is built around RDF already. I also created a new TQL command for applying rules to data:&lt;pre&gt;&lt;code&gt;  &lt;strong&gt;apply&lt;/strong&gt; &amp;lt;&lt;em&gt;rule_graph_uri&lt;/em&gt;&amp;gt; &lt;strong&gt;to&lt;/strong&gt; &amp;lt;&lt;em&gt;data_graph_uri&lt;/em&gt;&amp;gt; [&amp;lt;&lt;em&gt;output_graph_uri&lt;/em&gt;&amp;gt;]&lt;/code&gt;&lt;/pre&gt;By default all of the entailed data goes into the graph the rules are being applied to, but by including the optional output graph you can send the entailed data there instead.&lt;br /&gt;&lt;br /&gt;This worked as planned, and I was able to build a &lt;a href="http://mulgara.org/files/misc/rdfs-krule.rdf"&gt;Krule configuration graph for RDFS&lt;/a&gt;. Then life and work interfered and the rules engine was put on the back burner before I got to add some of the required features (like consistency checking).&lt;br /&gt;&lt;br /&gt;Then about 18 months ago I thought I'd have a go at writing OWL entailment, at least for that part that the rules engine would support. So I set out to write a new Krule file. The complexity of the file was such that I started writing out the rules that I wanted using a kind of Prolog notation with second order programming, in a very similar way to how Raphael Volz represented the same constructs in some of &lt;a href="http://www.daml.org/listarchive/joint-committee/att-1254/01-bubo.pdf"&gt;his&lt;/a&gt; &lt;a href="http://lists.w3.org/Archives/Public/www-webont-wg/2002Oct/att-0033/Paper.pdf"&gt;papers&lt;/a&gt;. This grammar uses binary predicates to represent genereal triples, and unary predicates to indicate "type" statements, ie. statements with a predicate of &lt;em&gt;rdf:type&lt;/em&gt;. As an example, the &lt;code&gt;owl:sameAs&lt;/code&gt; predicate indicates that if the subject of a statement is the &lt;code&gt;owl:sameAs&lt;/code&gt; another resource, then that statement can be duplicated with the other resource as the subject. This was easily expressed this as:&lt;pre&gt;&lt;code&gt;  A(Y,Z) :- A(X,Z), owl:sameAs(X,Y).&lt;/code&gt;&lt;/pre&gt;I wrote out about 3 rules before I realized that converting these to Krule was going to be tedious and prone to error. In fact, I had unthinkingly demonstrated that I already had a language I wanted to use, and the process of translation was an easily automated task. Since the language was allowing me to describe RDF with logic, I decided to call it RLog (for RDF Logic).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Iterations&lt;/h3&gt; Andrae and I had been discussing how much we disliked &lt;a href="http://sablecc.org/"&gt;SableCC&lt;/a&gt; for generating the &lt;abbrev title="Tucana Query Language"&gt;TQL&lt;/abbrev&gt; parser for Mulgara, and so I started looking around at other parsers. The power of &lt;a href="http://en.wikipedia.org/wiki/LALR_parser"&gt;LALR parsers&lt;/a&gt; appealed to me, and so I went with &lt;a href="http://beaver.sourceforge.net/"&gt;Beaver&lt;/a&gt;. Along with the &lt;a href="http://jflex.de/"&gt;JFlex&lt;/a&gt; lexer, this software is a pleasure to use. I had learned how to use them both, &lt;em&gt;and&lt;/em&gt; created the RLog grammar in about an hour. I then converted the &lt;a href="http://mulgara.org/files/misc/rdfs-krule.rdf"&gt;Krule configuration for RDFS&lt;/a&gt; into this new grammar, and convinced myself that I had it right. Then life got in the way again, and I put it away.&lt;br /&gt;&lt;br /&gt;Last year while waiting for some tests to complete, I remembered this grammar, and spent some of my enforced downtime making it output some useful RDF in the Krule schema. For anyone who's looked at Krule, they may have noticed that triggers for rules (which rules cause which other rules to be run) are explicitly encoded into the configuration. I did this partly because I already had the list of trigger dependencies for RDFS rules, and partly because I thought it would offer more flexibility. However, I had realized some time before that these dependencies were easy to work out, and had been wanting to automate this. I decided that RLog was the perfect place to do it, partly because it meant not having to change much, but also because it still allowed me the flexibility of tweaking the configuration.&lt;br /&gt;&lt;br /&gt;Once I'd finished writing a system that could output Krule, I tested it against my &lt;a href="http://mulgara.org/files/misc/rdfs.dl"&gt;RDFS RLog file&lt;/a&gt;, and compared the generated Krule to the original configuration. Initially I was disappointed to see to many dependencies, but on closer inspection I realized that they were all valid. The original dependencies were a reduced set because they applied some of the semantics of the predicates and classes they were describing, which was not something that a grammar at the level of RLog could deal with. Semi-na&amp;iuml;ve evaluation was going to stop unnecessary rules from running anyway, so I decided that these extra triggers were fine. I ran it against the various test graphs that I had, and was pleased to see that it all worked perfectly.&lt;br /&gt;&lt;br /&gt;But once again, work and life got in the way, and I put it aside again.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;SKOS&lt;/h3&gt; A couple of months ago Brian asked me about running rules for generating &lt;a href="http://www.w3.org/TR/skos-reference/"&gt;SKOS&lt;/a&gt; entailments, as he was writing &lt;a href="http://www.devx.com/semantic/Article/39348/1954"&gt;a paper&lt;/a&gt; on this topic. I pointed him towards RLog and knocked up a couple of useful rules for him. However, as I got into it, I realized that I could actually do most of SKOS quite easily, and before I knew it, I'd written an entire &lt;a href="http://mulgara.org/trac/attachment/wiki/SKOS/skos.rlog"&gt;RLog program for it&lt;/a&gt;. The only thing I could not do was "&lt;a href="http://www.w3.org/TR/skos-reference/#L3312"&gt;S35&lt;/a&gt;", as this requires a predicate for list membership (also a requirement for OWL, and on my TODO list).&lt;br /&gt;&lt;br /&gt;The &lt;em&gt;really&lt;/em&gt; interesting thing about this document, is that almost everything is an axiom and not a rule. It only requires 2 RDFS rules and 5 OWL rules to make the whole thing work. This is quite important, as the complexity in running the rules is generally exponential in the number of rules.&lt;br /&gt;&lt;br /&gt;This is (&lt;a href="http://en.wiktionary.org/wiki/IMNSHO"&gt;IMNSHO&lt;/a&gt;) the power of ontologies. By providing properties of classes and properties, they reduce the need for many rules. To demonstrate what I mean, I've seen a few systems (such as &lt;a href="http://www.dbai.tuwien.ac.at/proj/dlv/"&gt;DLV&lt;/a&gt;) which define a predicate to be transitive in the following way:&lt;pre&gt;&lt;code&gt;  pred(A,C) :- pred(A,B), pred(B,C).&lt;/code&gt;&lt;/pre&gt;This works, but it creates a new rule to do it. Every new transitive predicate also gets its own rule. As I have already said, this has a significant detrimental effect on complexity.&lt;br /&gt;&lt;br /&gt;Conversely, models such as OWL are able to declare properties as "transitive". Each such declaration then becomes a statement rather than a rule. Indeed, all the transitive statements get covered with a single second-order rule:&lt;pre&gt;&lt;code&gt;  P(A,C) :- P(A,B), P(B,C), owl:TransitivePredicate(P).&lt;/code&gt;&lt;/pre&gt;"Second-order" refers to the fact that variables can be used for the predicates (such as the variable &lt;em&gt;P&lt;/em&gt; in the expression &lt;em&gt;P(A,B)&lt;/em&gt;), and that predicates can appear as parameters for other predicates, such as &lt;em&gt;owl:TransitivePredicate(...)&lt;/em&gt;. The symmetry of Mulgara indexes for RDF statements allows such second order constructs to be evaluated trivially.&lt;br /&gt;&lt;br /&gt;Using the OWL construct for transitivity, any number of predicates can be declared as transitive with no increase to the number of rules. The complexity of rules does have a component derived from the number of statements, but this is closer to linear or polynomial (depending on the specific structure of the rules), and is therefore far less significant for large systems. It is also worth noting that several OWL constructs do not need an exhaustive set of their own rules, as their properties can be described using other OWL constructs. For instance, &lt;em&gt;owl:sameAs&lt;/em&gt; is declared as being &lt;em&gt;owl:SymmetricProperty&lt;/em&gt;. This means that the entailment rule for &lt;em&gt;owl:sameAs&lt;/em&gt; (shown above) need only be written once for &lt;em&gt;owl:sameAs(A,B)&lt;/em&gt; and is not needed for symmetric case of &lt;em&gt;owl:sameAs(B,A)&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;General Acceptance&lt;/h3&gt; Brian wasn't the only one to like RLog. I've had reports and feature requests from a few other people who like using it as well. The most commonly requested feature has been the generation of blank nodes. The main reason for this is to handle existential formula, which makes me wary, as this can lead to infinite loops if not carefully controlled. On the other hand, I &lt;em&gt;can&lt;/em&gt; see the usefulness of it, so I expect to implement it eventually.&lt;br /&gt;&lt;br /&gt;A related feature is to create multiple statements based on a single matched rule. This can usually be handled by introducing a new rule with the same body and a different head, but if a blank node has been generated by the rule, then there needs to be some way to re-use it in the same context.&lt;br /&gt;&lt;br /&gt;A problem with general usage is that the domains understood by RLog have been preset with the domains that I've wanted, namely: &lt;a href="http://www.w3.org/1999/02/22-rdf-syntax-ns#"&gt;RDF&lt;/a&gt;, &lt;a href="http://www.w3.org/2000/01/rdf-schema#"&gt;RDFS&lt;/a&gt;, &lt;a href="http://www.w3.org/2002/07/owl#"&gt;OWL&lt;/a&gt;, &lt;a href="http://www.w3.org/2001/XMLSchema#"&gt;XSD&lt;/a&gt;, &lt;a href="http://mulgara.org/mulgara#"&gt;MULGARA&lt;/a&gt;, &lt;a href="http://mulgara.org/owl/krule/#"&gt;KRULE&lt;/a&gt;, &lt;a href="http://xmlns.com/foaf/0.1/"&gt;FOAF&lt;/a&gt;, &lt;a href="http://www.w3.org/2004/02/skos/core#"&gt;SKOS&lt;/a&gt;, and &lt;a href="http://purl.org/dc/elements/1.1/"&gt;DC&lt;/a&gt;. The fix to this can be isolated in the parser, so I anticipate this being fixed by Monday. :-)&lt;br /&gt;&lt;br /&gt;Despite it being limited, RLog was proving to be useful, allowing me to encode systems like SKOS very easily. However, being a separate program that translated an RLog file into Krule configuration files that &lt;em&gt;then&lt;/em&gt; had to be loaded and applied to data, was a serious impediment to the usage.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Integration&lt;/h3&gt; The Mulgara "Content Handler" interface is a mechanism for loading any kind of data as triples, and optionally writing it back out. The two main ones are the &lt;a href="http://www.w3.org/TR/rdf-syntax-grammar/"&gt;RDF/XML&lt;/a&gt; handler and the &lt;a href="http://www.w3.org/DesignIssues/Notation3.html"&gt;N3&lt;/a&gt; handler, but there are others in the default distribution as well. There is a MBox handler for representing Unix Mailbox files as RDF, and an &lt;abbrev title="Moving Pictures Expert Group-1, Audio Layer 3"&gt;MP3&lt;/abbrev&gt; handler which maps ID3 metadata from MP3 files. These handlers compliment the "Resolver" interface which represents external data sources as a dynamic graph.&lt;br /&gt;&lt;br /&gt;Since RLog has a well-defined mapping into RDF (something the RLog program was already doing when it emitted RDF/XML) then reimplementing this system as a content handler would integrate it into Mulgara with minimal effort. I had been planning on this for some time, but there always seemed to be more pressing priorities. These other priorities are still there (and still pressing!) but a few people (e.g. &lt;a href="http://prototypo.blogspot.com/2009/02/desperately-seeking-skos-vendors.html"&gt;David&lt;/a&gt;) have been pushing me for it recently, so I decided to bite the bullet and get it done.&lt;br /&gt;&lt;br /&gt;The first problem was that the parser was in Beaver. This is yet another &lt;abbrev title="Java ARchive"&gt;JAR&lt;/abbrev&gt; file to include at a time when I'm trying to cut down on our plethora of libraries. It also seemed excessive, since we already have both JavaCC &lt;em&gt;and&lt;/em&gt; SableCC in our system - the former for SPARQL, the latter for TQL, and I hope to redo TQL in JavaCC eventually anyway. So I decided to re-implement the grammar parser in JavaCC.&lt;br /&gt;&lt;br /&gt;Unfortunately, it's been over a year since I looked at JavaCC, and I was very rusty. So my first few hours were spent relearning token lookahead, and various aspects of JavaCC grammar files. I actually think I know it better now than I did when I first did the SPARQL parser (that's a concern). There are a few parts of the grammar which are not LL(1) either, which forced me to think through the structure more carefully, and I think I benefited from the effort.&lt;br /&gt;&lt;br /&gt;I was concerned that I would need to reimplement a lot of the AST for RLog, but fortunately this was not the case. Once I got a handle on the translation it all went pretty smoothly, and the JavaCC parser was working identically to the original Beaver parser by the end of the first day.&lt;br /&gt;&lt;br /&gt;After the parser was under control I moved on to emitting triples. This was when I was reminded that writing RDF/XML can actually be a lot easier than writing raw triples. I ended up making slow progress, but I finally got it done last night.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Testing&lt;/h3&gt; Before running a program through the content handler for the first time, I wanted to see that the data looked as I expected it to. &lt;a href="http://www.junit.org/"&gt;JUnit&lt;/a&gt; tests are going to take some time to write, and with so much framework around the rules configuration, it was going to be clear very quickly if things weren't &lt;em&gt;just&lt;/em&gt; right. I considered running it all through the debugger, but that was going to drown me in a sea of RDF nodes. That was when I decided that I could make the resolver/content-handler interfaces work for me.&lt;br /&gt;&lt;br /&gt;Mulgara can't usually tell the difference between data in its own storage, and data sourced from elsewhere. It always opts for internal storage first, but if a graph URI is not found, then it will ask the various handlers if they know what to do with the graph. By using a &lt;strong&gt;file:&lt;/strong&gt; URL for the graphs, I could make Mulgara do all of it's reading and writing to files, using the content handlers to do the I/O. In this case, I decided to "export" my RLog graph to an N3 file, and compare the result to an original Krule RDF/XML file that I exported to another N3 file.&lt;br /&gt;&lt;br /&gt;The TQL command for this was:&lt;pre&gt;&lt;code&gt;  export &amp;lt;file:/path/rdfs.rlog&amp;gt; to &amp;lt;file:/path/rlog-output.n3&amp;gt;&lt;/code&gt;&lt;/pre&gt;Similarly, for the RDF/XML file was transformed to N3 with:&lt;pre&gt;&lt;code&gt;  export &amp;lt;file:/path/rdfs-krule.rdf&amp;gt; to &amp;lt;file:/path/krule-output.n3&amp;gt;&lt;/code&gt;&lt;/pre&gt;I love it when I can glue arbitrary things together and it all "just works". (This may explain why I like the Semantic Web).&lt;br /&gt;&lt;br /&gt;My first test run demonstrated that I was allowing an extra # into my URIs, and then I discovered that I'd fiddled with the literal token parsing, and was now including quotes in my strings (oops). These were trivial fixes. The third time through was the charm. I spent some time sorting my N3 files before deciding it looked practically identical, and so off I went to run an RLog program directly.&lt;br /&gt;&lt;br /&gt;As I mentioned in my last post, applying a set of rules to data is done with the &lt;strong&gt;apply&lt;/strong&gt; command. While I could have loaded the rules into an internal graph (pre-compiling them, so to speak) I was keen to "run" my program straight from the source:&lt;pre&gt;&lt;code&gt;  apply &amp;lt;file:/path/rdfs.rlog&amp;gt; to &amp;lt;test:data:uri&amp;gt;&lt;/code&gt;&lt;/pre&gt;...and whaddaya know? It worked. :-)&lt;br /&gt;&lt;br /&gt;Now I have a long list of features to add, optimizations to make, bugs to fix, and all while trying to stay on top of the other unrelated parts of the system. Possibly even more importantly, I need to document how to write an RLog file! But for the moment I'm pretty happy about it, and I'm going to take it easy for the weekend. See you all on Monday!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-1365826457813077288?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/1365826457813077288/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=1365826457813077288' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/1365826457813077288'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/1365826457813077288'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2009/02/more-programmable-logic-in-last-post-i.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-2523761822983983225</id><published>2009-02-27T10:46:00.003-06:00</published><updated>2009-02-27T12:37:03.960-06:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Programmable Logic&lt;/h3&gt; I had hoped to blog a lot this week, but I kept putting it off in order to get some actual work done. I still have a lot more to do, but I'm at a natural break, so I thought I'd write about it.&lt;br /&gt;&lt;br /&gt;I have finally integrated RLog into Mulgara! In some senses this was not a big deal, so it took a surprising amount of work.&lt;br /&gt;&lt;br /&gt;To explain what RLog is, I should describe Krule first.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Krule&lt;/h3&gt; The Mulgara rules engine, &lt;abbrev title="Kowari Rules"&gt;Krule&lt;/abbrev&gt;, implements my design for executing rules on large data sets. For those familiar with rules engines, it is similar to &lt;a href="http://en.wikipedia.org/wiki/Rete_algorithm"&gt;RETE&lt;/a&gt;, but it runs over data as a batch, rather than iteratively. It was designed this way because our requirements were to perform inferencing on gigabytes of RDF. This was taking a long time to load, so we wanted to load it all up, and &lt;em&gt;then&lt;/em&gt; do the inferencing.&lt;br /&gt;&lt;br /&gt;RETE operates by setting up a network describing the rules, and then as statements are fed through it, each node in that network builds up a memory of the data it has seen and passed. As data comes in the first time, the nodes that care about it will remember this data and pass it on, but subsequent times through, the node will recognize the data and prevent it from going any further. In this way, every statement fed into the system will be processed as much as it needs to be, and no more. There is even a proof out there that says that RETE is the optimal rules &lt;em&gt;algorithm&lt;/em&gt;. Note that the implementation of an algorithm can, and does, have numerous permutations which can allow for greater efficiency in some circumstances, so RETE is often treated as the basis for efficient engines.&lt;br /&gt;&lt;br /&gt;One variation on algorithms of this sort is to trade &lt;em&gt;time&lt;/em&gt; for &lt;em&gt;space&lt;/em&gt;. This is a fancy way of saying that if we use more memory then we can use less processing time, or we can use more processing time to save on memory. Several variations on RETE do just this, and so does Krule.&lt;br /&gt;&lt;br /&gt;When looking at the kinds of rules that can be run on RDF data, I noticed that the simple structure of RDF meant that each "node" in a RETE network corresponds to a constraint on a triple (or a &lt;abbrev title="Basic Graph Pattern"&gt;BGP&lt;/abbrev&gt; in &lt;a href="http://www.w3.org/TR/rdf-sparql-query/"&gt;SPARQL&lt;/a&gt;). Because Mulgara is indexed in "every direction", this means that every constraint can be found as a "slice" out of one of the indexes (while our current system usually takes &lt;em&gt;O(log(n))&lt;/em&gt; to find, upcoming systems can do some or all of these searches in &lt;em&gt;O(1)&lt;/em&gt;). Consequently, instead of my rule network keeping a table in memory associated with every node, there is a section out of an index which exactly corresponds to this table.&lt;br /&gt;&lt;br /&gt;There are several advantages to this. First, the existence of the data in the database is defined by it being in the indexes. This means that all the data gets indexed, rules engine or not. Second, when the rules engine is run, there is no need to use the data to iteratively populate the tables for each node, as the index slices (or &lt;strong&gt;constraint resolutions&lt;/strong&gt;) are &lt;em&gt;already&lt;/em&gt; fully populated, by definition. Finally, our query engine caches constraint resolutions, and they do not have to be re-resolved if no data has gone in that can affect them (well, some of the caching heuristics can be improved for better coverage, but the potential is there). This means that the "tables" associated with each node will be automatically updated for us as the index is updated, and the work needed to handle updates is minimal.&lt;br /&gt;&lt;br /&gt;During the first run of Krule, none of the potential entailments have been made yet, so everything is potentially relevant. However, during subsequent iterations of the rules, Krule has no knowledge of which statements are new in the table on any given node. This means it will produce entailed statements that already exist, and are duplicates. Inserting these is unnecessary (and hence, suboptimal) and creates unwanted duplicates. We handle this in two ways.&lt;br /&gt;&lt;br /&gt;The first and simpler mechanism is that Mulgara uses &lt;a href="http://en.wikipedia.org/wiki/Set_(mathematics)"&gt;Set semantics&lt;/a&gt;, meaning that any duplicates are silently (and efficiently) ignored. Set semantics are important when dealing with RDF, and this is why I'm so frustrated at non-distinct nature of SPARQL queries... but that's a discussion for another time. :-)&lt;br /&gt;&lt;br /&gt;The more important mechanism for duplicate inserts is based on RDF having a property of being monotonically increasing. This is because RDF lets you assert data, but not to "unassert" it. OWL 2 has introduced explicit denial of statements, but this is useful for preventing entailments and consistency checking... it does not remove previously existing statements. In non-monotonic systems a constraint resolution may keep the same size if some statements are deleted while an equal number of statements are inserted, but in a monotonic system like RDF, keeping the same size means that there has been no change. So a node knows to pass its data on if the size of its table increases, but otherwise it will do nothing. I stumbled across this technique as an obvious optimization, but I've since learned that it has a formal name: &lt;em&gt;semi-na&amp;iuml;ve evaluation&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Krule Extensions&lt;/h3&gt; While this covers the "batch" use-case, what about the "iterative" use-case, where the user wants to perform inferences on streams of data as it is inserted into an existing database? In this case, the batch approach is too heavyweight, as it will infer statements that are almost entirely pre-existing. We might handle the non-insertion of these statements pretty well, but if you do the work again and again for every statement you try to insert, then it will add up. In this case, the iterative approach of standard RETE is more appropriate.&lt;br /&gt;&lt;br /&gt;Unfortunately, RETE needs to build its tables up by iterating over the entire data set, but I've already indicated that this is expensive for the size of set that may be encountered. However, the Krule approach of using constraint resolutions as the tables is perfect for pre-populating these tables in a standard RETE engine. I mentioned this to Alex a few months ago, and he pointed out that he did exactly the same thing once before when implementing RETE in &lt;abbrev title="Tucana Knowledge Store"&gt;TKS&lt;/abbrev&gt;.&lt;br /&gt;&lt;br /&gt;I haven't actually done this extension, but I thought I'd let people know that we haven't forgotten it, and it's in the works. It will be based on Krule configurations, so a lot of existing work will be reused.&lt;br /&gt;&lt;br /&gt;I don't want to overdo it in one post, so I'll write about RLog in the next one.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-2523761822983983225?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/2523761822983983225/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=2523761822983983225' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2523761822983983225'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2523761822983983225'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2009/02/programmable-logic-i-had-hoped-to-blog.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-7513571316750121818</id><published>2009-02-19T00:34:00.002-06:00</published><updated>2009-02-19T00:47:08.730-06:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Is it Me?&lt;/h3&gt; After struggling against nonsensical errors all day, I finally realized that either Eclipse was going mad, or I was. In frustration I finally updated a method to look like this:&lt;pre&gt;&lt;code&gt;  public void setContextOwner(ContextOwner owner) {&lt;br /&gt;    contextOwner = owner;&lt;br /&gt;    if (owner != contextOwner) throw new AssertionError("VM causing problems");&lt;br /&gt;  }&lt;/code&gt;&lt;/pre&gt;This code throws an AssertionError, and yes, there is only 1 thread. If it were multithreaded at least I'd have a starting point. The problem only appears while debugging in Eclipse.&lt;br /&gt;&lt;br /&gt;I'm not sure whether I'm happy to have isolated my problem down to a single point, or to be unhappy at something that is breaking everything I am trying to work on. I guess I'm unhappy, because it's kept me up much later than I'd hoped.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-7513571316750121818?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/7513571316750121818/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=7513571316750121818' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/7513571316750121818'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/7513571316750121818'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2009/02/is-it-me-after-struggling-against.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-6773794379073821893</id><published>2009-02-17T00:48:00.003-06:00</published><updated>2009-02-17T02:16:08.381-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='REST'/><category scheme='http://www.blogger.com/atom/ns#' term='RDF'/><title type='text'></title><content type='html'>&lt;h3&gt;Resting&lt;/h3&gt; I've had a couple of drinks this evening. Lets see if that makes me more or less intelligible.&lt;br /&gt;&lt;br /&gt;Now that I've added a &lt;abbr title="REpresentational State Transfer"&gt;REST&lt;/abbr&gt;-like interface to Mulgara, I've found that I've been using it more and more. This is fine, but to modify data I've had to either upload a document (a very crude tool) or issue write commands on the &lt;abbr title="Tucana Query Language"&gt;TQL&lt;/abbr&gt; endpoint. Neither of these were very RESTful, and so I started wondering if it would make sense to do something more direct.&lt;br /&gt;&lt;br /&gt;From my perspective (and I'm sure there will be some who disagree with me), the basic resources in an RDF database are graphs and statements. Sure, the URIs themselves are resources, but the perspective of RDF is that these resources are infinite. Graphs are a description of how a subset of the set of all resources are related. Of course, these relationships are described via statements.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Graphs and Statements&lt;/h3&gt; So I had to work out how to represent a graph in a RESTful way. Unfortunately, graphs are already their own URI, and this probably has nothing to do with the server it is on. However, REST requires a URL which identifies the host and service, and then the resource within it. So the graph URI has to be embedded in the URL, after the host. While REST URLs typically try to reflect structure in a path, encoding a URL makes this almost impossible. Instead I opted to encode the graph URI as a "graph" parameter.&lt;br /&gt;&lt;br /&gt;Statements posed a similar though more complex challenge. I still needed the graph, so this had to stay. Similarly, the other resources also needed to be encoded as parameters, so I added this as well. This left me with 2 issues: blank nodes and literals.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Literals&lt;/h3&gt; Literals were reasonably easy... sort of. I simply decided that anything that didn't look like a URI would be a literal. Furthermore, if it was structured like a SPARQL literal, then this would be parsed, allowing a datatype or language to be included. However, nothing is never &lt;em&gt;really&lt;/em&gt; easy (of course) and I found myself wondering about relative URIs. These had never been allowed in Mulgara before, but I've brought them in recently after several requests. Most people will ignore them, but for those people who have a use, they can be handy. That all seems OK, until you realize that the single quote character &amp;quot; is an &lt;em&gt;unreserved&lt;/em&gt; character in URIs, and so the apparent literal &lt;em&gt;&amp;quot;foo&amp;quot;&lt;/em&gt; is actually a valid relative URI. (Thank goodness for unit tests, or I would never have realized that). In the end, I decided to treat any valid URI as a URI and not a literal, &lt;em&gt;unless&lt;/em&gt; it starts with a quote. If you really want a relative URI of &lt;em&gt;&amp;quot;foo&amp;quot;&lt;/em&gt; then you'll have to choose another interface.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Blank Nodes&lt;/h3&gt; Blank nodes presented another problem. Initially, I decided that any missing parameter would be a blank node. That worked well, but then I started wondering about using the same blank node in more than one statement. I'm treating statements as resources, and you can't put more than one "resource" into a REST URL, so that would mean referring to the same "nameless" thing in two different method calls, which isn't possible. Also, adding statements with a blank node necessarily creates a new blank node every time, which breaks idempotency.&lt;br /&gt;&lt;br /&gt;Then what about deletion? Does nothing match, or does the blank node match everything? But doing matches like that means I'm no longer matching a single statement, which was what I was trying to do to make this REST and not RPC for a query-like command.&lt;br /&gt;&lt;br /&gt;Another option is to refer to blanks with a syntax like &lt;code&gt;_:123&lt;/code&gt;. However, this has all of the same problems we've had with exactly this idea in the query language. For instance, these identifiers are not guaranteed to match between different copies of the same data. Also, introducing new data that includes the same ID will accidentally merge these nodes incorrectly. There are other reasons as well. Essentially, you are using a name for something that was supposed to be nameless, and because you're not using URIs (like named things are supposed to use) then you're going to encounter problems. URIs were created for a reason. If you need to refer to something in a persistent way, then use a name for it. (Alternatively, use a query that links a blank node through a functional/inverse-functional predicate to uniquely identify it, but that's another discussion).&lt;br /&gt;&lt;br /&gt;So in the end I realized that I can't refer to blank nodes at all in this way. But I think that's OK. There are other interfaces available if you need to work with blank nodes, and &lt;a href="http://www.talis.com/platform/"&gt;some applications&lt;/a&gt; prohibit them anyway.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Reification&lt;/h3&gt; Something I wanted to come back to is this notion of representing a statement as 3 parameters in a URL (actually 4, since the graph is needed). The notion of representing a statement as a URI has already been addressed in reification, however I dismissed this as a solution here since reifying a statement does not imply that statement exists (indeed, the purpose of the reification may be to say that the statement is false). All the same, it's left me thinking that I should consider a way to use this interface to reify statements.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Methods&lt;/h3&gt; So the methods as they stand now are:&lt;br /&gt;&lt;table border="1"&gt;&lt;tr&gt;&lt;th&gt;method/ resource&lt;/th&gt;&lt;th&gt;Graph&lt;/th&gt;&lt;th&gt;Statement&lt;/th&gt;&lt;th&gt;Other&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th&gt;GET&lt;/th&gt;&lt;td align="center"&gt;N/A&lt;/td&gt;&lt;td align="center"&gt;N/A&lt;/td&gt;&lt;td align="center"&gt;Used for queries.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th&gt;POST&lt;/th&gt;&lt;td align="center"&gt;Upload graphs&lt;/td&gt;&lt;td align="center"&gt;N/A&lt;/td&gt;&lt;td align="center"&gt;Write commands (not SPARQL)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th&gt;PUT&lt;/th&gt;&lt;td align="center"&gt;Creates graph&lt;/td&gt;&lt;td align="center"&gt;Creates statement&lt;/td&gt;&lt;td align="center"&gt;N/A&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th&gt;DELETE&lt;/th&gt;&lt;td align="center"&gt;Deletes graph&lt;/td&gt;&lt;td align="center"&gt;Deletes statement&lt;/td&gt;&lt;td align="center"&gt;N/A&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;br /&gt;I haven't done HEAD yet (I intend to indicate if a graph or statement exists), and I'm ignoring OPTION.&lt;br /&gt;&lt;br /&gt;I've also considered what it might mean to GET a statement or a graph. When applied to a graph, I could treat this as a synonym for the query:&lt;pre&gt;&lt;code&gt;  construct {?s ?p ?o} where {?s ?p ?o}&lt;/code&gt;&lt;/pre&gt;Initially I didn't think it made much sense to GET a statement, but while writing this it occurs to me that I could return a reification URI, if one exists (this is also an option for HEAD, but I think &lt;em&gt;existence&lt;/em&gt; is a better function there).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Is There a Point?&lt;/h3&gt; Everything I've discussed here may seem pointless, especially since there are alternatives, none of it is standard, and I'm sure there will be numerous criticisms on my choices. On the other hand, I wrote this because I found that uploading documents at a time to be too crude for real coding. I also find that constructing TQL command to modify data to be a little too convoluted in many circumstances, and that a simple PUT is much more appropriate.&lt;br /&gt;&lt;br /&gt;So, I'm pretty happy with it, for the simple fact that &lt;em&gt;I&lt;/em&gt; find it useful. If anyone has suggested modifications or features, than I'll be more than happy to take them on board.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-6773794379073821893?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/6773794379073821893/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=6773794379073821893' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6773794379073821893'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6773794379073821893'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2009/02/resting-ive-had-couple-of-drinks-this.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-8431227637282433975</id><published>2009-02-13T09:02:00.003-06:00</published><updated>2009-02-13T12:09:00.515-06:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Prodding&lt;/h3&gt; Going back through a recent flurry of activity by &lt;a href="http://twitter.com/wmudge"&gt;Webster Mudge&lt;/a&gt; on Google Groups, I noticed a couple of things directly related to me.&lt;br /&gt;&lt;br /&gt;First was a link to &lt;a href="http://prototypo.blogspot.com/2009/02/skos-in-mulgaras-rlog.html"&gt;David Wood's post&lt;/a&gt; from last month in which he talks about how I did &lt;a href="http://www.w3.org/2004/02/skos/"&gt;SKOS&lt;/a&gt; using &lt;a href="http://www.mulgara.org/trac/wiki/Rules"&gt;RLog&lt;/a&gt; (some nice compliments BTW, thanks David). Both in this post and personally, David has been hassling me to integrate RLog into Mulgara. I'd love to get this done, but SPARQL and scalability have been priorities for me, and no one ever asked for RLog before. But it's been shuffling to the top of my list recently, so I'm going to see what I can get done in the next week, before I get loaded with new priorities.&lt;br /&gt;&lt;br /&gt;The other link was to &lt;a href="http://jena.sourceforge.net/SquirrelRDF/"&gt;SquirrelRDF&lt;/a&gt; and included the comment, &lt;span style="font-style:italic;"&gt;“Great idea, bummer it's tied to Jena.”&lt;/span&gt; This intrigued me, and I wondered if it was something Mulgara could do, so I checked it out. Only, once I got there I discovered that Mulgara already does it, and has done for years!&lt;br /&gt;&lt;br /&gt;That's one of the biggest problems with Mulgara: lack of documentation. People just aren't aware of what the system can do, and there's no easy way to find out. I'd love to fix this on the Wiki, but when I'm accountable for getting things done, and not for telling people how to do it, then I tend to opt for the "getting things done" work instead.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Resolvers&lt;/h3&gt; For anyone interested, Mulgara has a system we call the "Resolver" interface. The purpose of this is to present any data as live RDF data. Included with Mulgara are resolvers for accessing: &lt;a href="http://lucene.apache.org/"&gt;Lucene&lt;/a&gt; indexes; RDF files over HTTP; filesystem data; &lt;abbr title="Geographic Information System"&gt;GIS&lt;/abbr&gt; data; &lt;abbr title="Relational Data Base Management System"&gt;RDBMS&lt;/abbr&gt;s (via &lt;a href="http://java.sun.com/javase/technologies/database/"&gt;JDBC&lt;/a&gt;, and using &lt;a href="http://www4.wiwiss.fu-berlin.de/bizer/d2rq/"&gt;D2RQ&lt;/a&gt;); &lt;abbr title="Java ARchive"&gt;JAR&lt;/abbr&gt; files; plus a few resolvers specifically for representing aspects of literals and URIs stored in the database. Most are read-only interpretations of external data, but some are writable.&lt;br /&gt;&lt;br /&gt;We also have a related system called "Content Handlers". These are for handling raw file formats and returning RDF triples. We support the obvious RDF/XML and N3 file formats, but also interpret Unix MBox files and MP3 files (the latter was done as a tutorial). This mixes well with things like the HTTP and file resolvers, as it lets us refer to a graph such as &lt;a href="http://www.w3.org/2000/01/rdf-schema"&gt;http://www.w3.org/2000/01/rdf-schema&lt;/a&gt; in a query. In this example the graph will not be in the local database (it could be, but only if you'd created it), so the HTTP resolver will be asked to retrieve the contents from the URL. Once the data arrived, it would be sent to the RDF/XML content handler (havind recognized the "application/rdf+xml" &lt;abbr title="Multipurpose Internet Mail Extensions"&gt;MIME&lt;/abbr&gt; type), which will then turn it into a queryable local graph in memory. The query can continue then as if everything was local. If the data is on the local filesystem, or MIME type isn't recognized, then it will fall back to relying on filename extensions.&lt;br /&gt;&lt;br /&gt;It's because of the way these things hook together that allows us to hook SPARQL sources together easily. It may be messy, but it is perfectly possible to select from a graph with a URI like:&lt;br /&gt;&lt;pre&gt;&lt;code&gt;http://host/sparql?default-graph-uri=my%3Agraph&amp;&lt;br /&gt;query=%40prefix+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E+.&lt;br /&gt;+create+%7B+%3Fs+%3Fp+%3Fo+%7D+where+%7B+%3Fs+%3Fp+%3Fo+.&lt;br /&gt;+%3Fp+rdfs%3Adomain+%3Cmy%3AClass%3E+%7D&lt;/code&gt;&lt;/pre&gt;I've split the URI over a few lines to make it fit better, and I also used the graph name of &lt;code&gt;my:graph&lt;/code&gt; just to keep it shorter. It's legal, though unusual.&lt;br /&gt;&lt;br /&gt;Mulgara originally aimed at being highly scalable, and we're in the process of regaining that title (honest... the modest improvements we've had recently are orders of magnitude short of XA2). However, the sheer number of features and flexibility of the system is probably it's most compelling attribute at the moment. If only I could document it all, and spread the word.&lt;br /&gt;&lt;br /&gt;Oh well, back to the grind. At the moment I'm alternating between &lt;abbr title="REpresentational State Transfer"&gt;REST&lt;/abbr&gt;ful features (I want to PUT and DELETE individual statements) and a class that will transparently memory map a file larger than 2GB. For the latter, I'd love to offer and extension to &lt;a href="http://java.sun.com/javase/6/docs/api/java/nio/Buffer.html"&gt;java.nio.Buffer&lt;/a&gt;, but this package has been completely locked down by Sun. I hate not being able to extend on built-in functionality.  :-(&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-8431227637282433975?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/8431227637282433975/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=8431227637282433975' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/8431227637282433975'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/8431227637282433975'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2009/02/prodding-going-back-through-recent.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-4698312072094959141</id><published>2008-12-02T22:29:00.002-06:00</published><updated>2008-12-02T23:23:13.921-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='indexing'/><title type='text'></title><content type='html'>&lt;h3&gt;Dropping Indexes&lt;/h3&gt; One of the optimizations I'm making for XA 1.1 is the removal of 3 of our 6 statement indexes. The reason for this is pretty clear: they're almost &lt;em&gt;never&lt;/em&gt; used. Why would I want to double our space, and double our contention for the hard drive on data structures that are superfluous?&lt;br /&gt;&lt;br /&gt;To date, Mulgara's indexes have been completely symmetric. I still want to maintain this with respect to &lt;em&gt;subject&lt;/em&gt;, &lt;em&gt;predicate&lt;/em&gt; and &lt;em&gt;object&lt;/em&gt;, but I don't really see the need for it with graphs. (That said, the 2-column index in XA2 will have optimizations around common predicates, but in general there will still be symmetry). I've had people say that they want to use millions of graphs, but in reality I've yet to see it. The query languages (TQL, SPARQL, etc) haven't really supported large numbers of graphs anyway.&lt;br /&gt;&lt;br /&gt;The index orderings we've had to date have been:&lt;pre&gt;&lt;code&gt;  SPOG&lt;br /&gt;  POSG&lt;br /&gt;  OSPG&lt;br /&gt;  GSPO&lt;br /&gt;  GPOS&lt;br /&gt;  GOSP&lt;/code&gt;&lt;/pre&gt;For &lt;strong&gt;G&lt;/strong&gt;=Graph, &lt;strong&gt;S&lt;/strong&gt;=Subject, &lt;strong&gt;P&lt;/strong&gt;=Predicate, &lt;strong&gt;O&lt;/strong&gt;=Object.&lt;br /&gt;&lt;br /&gt;For anyone unfamiliar with these indexes, they permit a group of statements to be found given any possible pattern of 0, 1, 2, 3 or 4 elements.&lt;br /&gt;&lt;br /&gt;The first 3 indexes allow for searching on statements that may occur in any graph. However, almost all queries identify the graphs to be searched in, meaning that we always end up binding the "graph" node before looking for statements. That means that the first 3 indexes are &lt;em&gt;almost&lt;/em&gt; never used. However, it's the "almost" which is my problem at the moment.&lt;br /&gt;&lt;br /&gt;Fortunately, the first 3 indexes can be easily emulated with our "System graph". This countains a list of all the known graphs, particularly the graphs stored with the "System Resolver" (this is the part of the system that uses the above indexes). Using this information, it is possible to pre-bind the graph node for every possible query. However, I really want to do this at the lowest possible level, so the interface on the resolver remains unchanged.&lt;br /&gt;&lt;br /&gt;Dropping the first 3 indexes went smoothly, and 97% of the tests still work (OK, it's 96.99%, but who's counting?). However, the emulation of these indexes will probably take me a few days. That's a shame, as I'd prefer to get it all into the next release, but since I want to do a release before I go to Australia for Christmas (on Monday) then I'm pretty sure I can't do it in time (not if I want robust testing anyway).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Compromises&lt;/h3&gt; Emulating the indexes which allow unbound graphs, means that I'll need to bind the graph to a series of ALL the graphs in the system. Then for each of those graphs, I'll need to re-execute the resolution of the graph pattern being resolver. That means that for these types of queries, then it will increase in complexity with the number of graphs in the system. This goes completely against what we want in Mulgara, but as I said, it's such a rarely used feature that the cost seems mitigated.&lt;br /&gt;&lt;br /&gt;I had thought that I'd be doing a query to find the graphs, and then join this to the resolution of the graph pattern that we want, but that failed to take several things into account. First, resolutions from the resolver come back with a particular order, and the kind of join I was proposing was not going to be ordered the way we wanted (it would have been ordered for within each graph, and then ordered within the graph). Reordering may have been prohibitively expensive (depending on context), so this was out.&lt;br /&gt;&lt;br /&gt;It was while thinking through on this that I realized I can create a new Tuples "append" operation. The new append will take arguments that all have the same variables and the same ordering, and will perform a streamed merge-sort. This should give me exactly what I want.&lt;br /&gt;&lt;br /&gt;So the next thing I need is the complete list of graphs to bind the "G" node to when querying the indexes. I have thought that I'd be doing a query of the system graph for this, but my thinking has moved on from there. To start with, in order to make this query, I'll need the local node value for &lt;code&gt;&amp;lt;rdf:type&amp;gt;&lt;/code&gt; the URI for the "type" of graphs stored in the system resolver, and the system graph itself (a relative URI of &lt;code&gt;&amp;lt;#&amp;gt;&lt;/code&gt;). The creation of these occurs during bootstrapping, and is fortunately over before any possibility of my "unusual" queries.&lt;br /&gt;&lt;br /&gt;While thinking about getting the local node values for these URIs, it occurred to me that something similar to the mechanism to do this can be used to record whenever a graph is being created in the system graph. That means that I can store each of the graphs in a list (and re-populate this list on startup with a simple constraint resolution). This list then becomes the basis for the graph bindings when I'm trying to emulate the missing indexes.&lt;br /&gt;&lt;br /&gt;My first concern was that this might take too much space, thereby limiting the number of graphs that someone can have (as I said, some people have proposed using millions), but then I realized that my merge-join was going to need to reference the same number of resolutions as the number of graphs, and this would take more RAM anyway. It's really a moot point anyway, since the system would choke from performing a million lookups before you need to worry about an Out Of Memory condition. All this reminds me... I should worry too much about optimizations at such at early juncture. Premature optimization is the root of all evil.&lt;br /&gt;&lt;br /&gt;Anyway, I'll probably spend a day on this, and may even get it all going, but I won't have it tested in time for a release before the weekend. I'd better let Amit (my boss) know that he won't get it until Christmas.  :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-4698312072094959141?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/4698312072094959141/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=4698312072094959141' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/4698312072094959141'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/4698312072094959141'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/12/dropping-indexes-one-of-optimizations.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-1789534819774988294</id><published>2008-12-02T22:20:00.002-06:00</published><updated>2008-12-02T22:29:42.303-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='performance'/><category scheme='http://www.blogger.com/atom/ns#' term='disk'/><category scheme='http://www.blogger.com/atom/ns#' term='Graphs'/><title type='text'></title><content type='html'>&lt;h3&gt;Size&lt;/h3&gt; Disk usage is probably the second most common question I get about Mulgara, after speed. So to complement the plots from Monday, I've also plotted the disk usage for these "number" graphs.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GtcSYVnYjYQ/STYI5hJqN8I/AAAAAAAAAAw/9MwWC31q1zQ/s1600-h/image001.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 273px;" src="http://1.bp.blogspot.com/_GtcSYVnYjYQ/STYI5hJqN8I/AAAAAAAAAAw/9MwWC31q1zQ/s400/image001.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5275413797755566018" /&gt;&lt;/a&gt;&lt;br /&gt;The lowest line represents the space being used for URIs and Literals. The upper line is for the statements themselves. For convenience, the top line is the sum of the other two.&lt;br /&gt;&lt;br /&gt;This storage mechanism is doing no compression on the data whatsoever. The current code in XA2 is already using an order of magnitude less space, both because of more intelligent storage, and also because many blocks will be gzip compressed in our structures. Andrae's reasoning for that is that while CPUs are getting faster all the time, disks are not. This means that any processing we do on the data is essentially free, since the CPU can usually be done in less than the time it takes to wait for a hard drive to return a result, even a solid state drive.&lt;br /&gt;&lt;br /&gt;I should note that these graphs are all on a version of XA1.1 that is not yet released (it's in SVN, but not in the trunk yet). I've been hoping to get this into the next release, but because I'm doing a release by the end of this week, then I'm thinking it will have to be in the release after (before Christmas).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-1789534819774988294?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/1789534819774988294/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=1789534819774988294' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/1789534819774988294'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/1789534819774988294'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/12/size-disk-usage-is-probably-second-most.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_GtcSYVnYjYQ/STYI5hJqN8I/AAAAAAAAAAw/9MwWC31q1zQ/s72-c/image001.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-644599254200113973</id><published>2008-12-01T09:36:00.005-06:00</published><updated>2008-12-01T12:43:38.761-06:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Mulgara Stats&lt;/h3&gt; The one question everyone asks me about Mulgara is always some variation on "How does it scale?" It's never easy to answer, as it depends on the hardware you're using, the OS (Linux does memory mapping better than Windows), and of course, the data.&lt;br /&gt;&lt;br /&gt;I wanted to put off benchmarking until XA2 was released (early next year). I've also been hoping to have decent hardware to do it with, though I'm less sure about when that might happen. However, I've improved things by releasing the XA1.1 system recently, and it doesn't hurt to see how things run on desktop hardware.&lt;br /&gt;&lt;br /&gt;RDF data varies according to the number of triples, the number of distinct URIs and literals, and the size of the literals. Some data uses only a few predicates, a modest number of URIs, and enormous literals. Other data uses only a few short literals, and has lots of URIs. Then there is the number of triples being stored. As an example, doing complete RDFS inferencing will introduce no new resources, but can increase the number of triples in a graph by an order of magnitude.&lt;br /&gt;&lt;br /&gt;There are various standard sets and I intend to explore them, but in the meantime I'm going with a metric that Kowari/TKS used back in the days of Tucana, when we had a big 64bit machine with fast RAID storage. Funnily enough, these days I have a faster CPU, but I still don't have access to storage that is as fast as that box had.&lt;br /&gt;&lt;br /&gt;The data I've been working with is the "Numbers" data set that I first created about 4 years ago. I tweaked the program a little bit, adding a couple of options and updating the output. There are probably better ways to model numbers, but the point of this is just to grow a large data set, and it does that well. You can find the code &lt;a href="http://mulgara.org/files/misc/rdf-generator.tar.gz"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Hardware&lt;/h3&gt; The computer I've been using is my MacBook Pro, which comes with the following specs:&lt;pre&gt;Mac OS 10.5.5&lt;br /&gt;2.6 GHz Intel Core 2 Duo&lt;br /&gt;4GB 667 MHz DDR2 SDRAM&lt;br /&gt;4MB L2 Cache&lt;br /&gt;HDD: Hitachi HTS722020K9SA00&lt;br /&gt;     186.31GB&lt;br /&gt;     Native Command Queuing: Yes&lt;br /&gt;     Queue Depth: 32&lt;br /&gt;     File System Journaled HFS+&lt;/pre&gt;&lt;br /&gt;Note that there is &lt;strong&gt;nothing&lt;/strong&gt; about this machine that is even slightly optimized for benchmarking. If I had any sense, I'd be using Linux, and I wouldn't have a journaled filesystem (since Mulgara maintains its own integrity). Even if I couldn't have RAID, it would still be beneficial to use another hard drive. But as I said, this is a standard desktop configuration.&lt;br /&gt;&lt;br /&gt;Also, being a desktop system, it was impossible to shut down everything else, though I did turn off backups, and had as few running programs as possible.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;The Test&lt;/h3&gt; I used a series of files, generating numbers to each million mark, up to 30 million. The number of triples was approximately 8-9 times the largest number, with the numbers from 1 to 30 million generating 267,592,533 triples, or a little over a quarter of a billion triples.&lt;br /&gt;&lt;br /&gt;Each load was done with a clean start to Mulgara, and was done in a single transaction. The data was loaded from a gzipped RDF/XML file. I ignored caching in RAM, since the data far exceeded the amount of RAM that I had.&lt;br /&gt;&lt;br /&gt;At the conclusion of the load, I ran a query to count the data. We still have linear counting complexity, so this is expected to be an expensive operation (this will change soon).&lt;br /&gt;&lt;br /&gt;Due to the time needed for larger loads, I skipped most of the loads in the 20 millions. However, the curve for load times is smooth enough that interpolation is easy. The curve for counting is all over the place, but you'll have to live with that.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_GtcSYVnYjYQ/STQsArRJV8I/AAAAAAAAAAk/EUJusSZ_E-U/s1600-h/image001.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 321px;" src="http://2.bp.blogspot.com/_GtcSYVnYjYQ/STQsArRJV8I/AAAAAAAAAAk/EUJusSZ_E-U/s400/image001.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5274889453683955650" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The axis on the left is the number of seconds for loading, and the axis on the right is the number of seconds for counting. The X-axis is the number of triples loaded.&lt;br /&gt;&lt;br /&gt;Counting was less that a second up to the 8 million mark (70.8 million triples). This would be because most of the index could fix into memory. While the trees in the indexes do get shuffled around as they grow, I don't think that explains the volatility in the counting times I'm guessing that external processes had a larger influence here, since the total time was still within just a few minutes (as opposed to the full day required to load the quarter billion triples in the final load).&lt;br /&gt;&lt;br /&gt;Overall, the graph looks to be gradually increasing beyond linear growth. From experience with tests on XA1, we found linear growth, followed by an elbow, and then an asymptotic approach to a new, much steeper gradient. This occurred at the point where RAM could no longer effectively cache the indexes. If that is happening here, then the new gradient is still somewhere beyond where I've tested.&lt;br /&gt;&lt;br /&gt;My next step is to start profiling load times with the XA1 store. I don't have any real comparison here, except that I know that there is a point somewhere in the middle of this graph (depending on my RAM) where XA1 will suddenly turn upwards. I've already seen this from Ronald's tests, but I've yet to chart it against this data.&lt;br /&gt;&lt;br /&gt;I'm also very excited to see how this will compare with XA2. I'm meeting Andrae in Brisbane next week, so I'll find out more about the progress then.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-644599254200113973?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/644599254200113973/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=644599254200113973' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/644599254200113973'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/644599254200113973'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/12/mulgara-stats-one-question-everyone.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_GtcSYVnYjYQ/STQsArRJV8I/AAAAAAAAAAk/EUJusSZ_E-U/s72-c/image001.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-6135747305180855637</id><published>2008-09-18T22:57:00.004-05:00</published><updated>2008-10-16T21:12:58.378-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Disclaimer&lt;/h3&gt; New babies are wonderful, but the resulting sleep patterns are far from optimal. Please excuse me if I stop making sense halfway through any of the ensuing sentences.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Hash Tables&lt;/h3&gt; Whenever I have a spare moment (and sometimes when I don't) I'm forever re-examining how I think RDF should be indexed. After all, I've already found a few different ways to do it, both in Mulgara and out of it, and each have their own pros and cons.&lt;br /&gt;&lt;br /&gt;One of the most interesting indexes to consider is the Hash Table. Constant time reads and writes makes for a compelling argument in terms of scalability. Re-hashing an index during expansion is painful on systems that should scale, but I was recently reminded that amortized complexity is still linear, so I shouldn't be &lt;strong&gt;too&lt;/strong&gt; scared.&lt;br /&gt;&lt;br /&gt;Years ago &lt;acronym title="Tucana Knowledge Store"&gt;TKS&lt;/acronym&gt; (the forerunner to both Mulgara and Kowari) used a few on-disk hash tables, but they proved ineffective for us, and we moved to trees. But many of our assumptions back in 2000 no longer apply to modern systems, and I've already found several things worth re-examining for this reason. On top of that, Andy Seaborne was discussing using them for Jena, and while I was initially dubious, on further reflection I can see the reasoning.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Pros&lt;/h4&gt; &lt;strong&gt;It's &lt;em&gt;O(1)&lt;/em&gt;:&lt;/strong&gt; That's kind of a trump card. Everything else I have to say here is a discussion as to what could possibly be more important than being &lt;em&gt;O(1)&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Opaque Data:&lt;/strong&gt; Data store in a hash is treated as an atom, meaning there is no ordering or other meaning on the data. While this creates problems (mentioned below) it also provides the opportunity to distribute the data across a cluster like &lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt;. That's a big deal these days.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Cons&lt;/h4&gt; &lt;strong&gt;Re-Hashing:&lt;/strong&gt; The first problem I think about with on-disk hash tables is the cost of a re-hashing operation. These are expensive in memory, but on disk they are going to be frightful. Reading the original hash will be OK, as this is a linear scan through the file, but writing will be problematic, as the seeks are essentially random. That's a cost of &lt;em&gt;N&lt;/em&gt; seeks, for &lt;em&gt;N&lt;/em&gt; entries (ignoring seeks for reads, but they're amortized, and could even be on another drive). There may be some algorithms for clustering the writes, but if you're trying to scale on the size of your data, then this would be overwhelmed.&lt;br /&gt;&lt;br /&gt;The best way to address this is to allocate as much space as you can, and to be generous when growing. That could be a problem for some systems, but if you're &lt;em&gt;really&lt;/em&gt; in the business of scaling on data, then you'll be up for it.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Space:&lt;/strong&gt; Hash tables require a lot of empty space to work, else you end up with a lot of hashing collisions, and those lovely &lt;em&gt;O(1)&lt;/em&gt; properties go out the window (until you expand and re-hash, but I've already talked about that). I shouldn't really make a big deal out of this, especially when you consider that Mulgara was built using the idea that "disk is cheap", but it does still feel a little strange to be that lavish. Also, being extravagant with space can lead to speed issues as well, so it's always worth looking at with the critical eye, even if the final decision is to use the space.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;No Ordering:&lt;/strong&gt; Data in a hash table cannot be ordered. Well, OK, a &lt;em&gt;linked&lt;/em&gt; hash table can do it, but you only want to link by insertion order, or else all your &lt;em&gt;O(1)&lt;/em&gt; benefits are gone.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Hashes and SPARQL&lt;/h3&gt; Of the three cons I listed here, it's relatively easy to justify the concerns about re-hashing and space. In fact, once you decide that "space is no object", then re-hashing isn't such a big deal since you can just start with an enormous table that never (or &lt;em&gt;almost&lt;/em&gt; never) gets rehashed.&lt;br /&gt;&lt;br /&gt;The ordering issue bugged me for a while, and it was then that I realized that this actually works well for SPARQL. In fact, this looks like yet another case where the heritage of filtering is showing up again (though maybe it's a coincidence this time).&lt;br /&gt;&lt;br /&gt;When you use the appropriate resolvers in Mulgara (in either TQL or SPARQL, since resolvers are just mapped onto graph names) then data can be selected by "range". This lets us select an ordered set of date/times, numbers, or strings that occur between a pair of boundaries. (Particularly useful for something like selecting events during a particular time window). It is even useful for selecting URIs based on namespace. These selections are then joined to the remainder of the query to create a result. The end effect is processing much less data than simply selecting it all, and FILTERing it down by the data that meets the given criteria. We always pursued this in Mulgara, as we found that filtering could slow down certain queries by orders of magnitude.&lt;br /&gt;&lt;br /&gt;However, SPARQL was never designed for this kind of thing, and as a result it relies entirely on filtering to do its work. This usually bothers me, but for hash tables it actually works, since they don't provide the ability to select a range anyway, and hence &lt;em&gt;require&lt;/em&gt; filtering if you want to use them.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;To Tree or not to Tree?&lt;/h3&gt; I've been wedded to trees in Mulgara for so long that it feels weird just examining a system without them. Of course, I've already moved away from the use of trees with the new &lt;a href="http://gearon.blogspot.com/2008/04/writing-2-columns-in-my-last-post-i.html"&gt;statement store design&lt;/a&gt;, but I still thought that the data pool still had to be ordered, and hence, no hash tables.&lt;br /&gt;&lt;br /&gt;Now I can see the utility of using hash tables in this part of the system, providing you are prepared to using filtering for your results. Jena was always designed around these principles (it's easy to use, it's easy to implement, and it's &lt;em&gt;correct&lt;/em&gt;), so I understand why Andy would be attracted to it. However, I know that range queries are a big deal in Mulgara, so we really do need a tree somewhere.&lt;br /&gt;&lt;br /&gt;But perhaps we can mitigate some of the expense of tree indexes?&lt;br /&gt;&lt;br /&gt;Trees are really only needed for two types of query: ranges of data (meaning literals); and selecting strings or URIs by prefix. Neither of these are common operations, and are certainly not needed during the time-consuming load operation. So perhaps loads could be done entirely with a fast hash index, and afterwards a slow tree-based indexer could come through to order everything. Background indexing is nothing new, and even AllegroGraph does it, though I'm not sure how to manage a range query while waiting for an index to proceed.&lt;br /&gt;&lt;br /&gt;Another possibility would be to do inserts into a tree index, and simultaneously index the tree node with a hash index. After all, the tree nodes are not being reclaimed, and while their position in a tree may change, their data does not. This would require another seek/write during writing, but would save on &lt;em&gt;log(N)&lt;/em&gt; seeks when looking to see if a string or URI exists, which is the single most common operation during a load. That way there would be no background indexing to worry about waiting for, and the most common task drops from &lt;em&gt;log(N)&lt;/em&gt; to a single seek. Now &lt;em&gt;that&lt;/em&gt; has promise. I'll have to see if I can think of a decent Hadoop angle for it.&lt;br /&gt;&lt;br /&gt;So now I need to write the hash table file. We already have a few things that are close, so maybe I can leverage off one of those?&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Other Stuff&lt;/h3&gt; There's a LOT to write about with Mulgara, but this year I've tended to do the work rather than write about it. I believe that this is a false economy, since writing about things provides me with an invaluable log of what I did and when, and also helps me work out just what I need to be doing.&lt;br /&gt;&lt;br /&gt;On the other hand, late at night is not the time for me to be writing, especially when a baby is going to be waking me at various times between now and morning.&lt;br /&gt;&lt;br /&gt;All the same, I'll mention that I now have a couple of cute little servlets that let me do "HTTP GET" requests with &lt;a href="http://www.w3.org/TR/rdf-sparql-protocol/"&gt;SPARQL protocol&lt;/a&gt; parameters, and I get back either XML or &lt;a href="http://www.json.org/"&gt;JSON&lt;/a&gt; (depending on the value of an optional parameter called &lt;em&gt;out&lt;/em&gt;. Hmmm, maybe I should have called it &lt;em&gt;format&lt;/em&gt;?). One of the servlets is for TQL queries, while the main one is for SPARQL.&lt;br /&gt;&lt;br /&gt;These servlets also accept "HTTP POST" requests. In this case, the TQL servlet will allow commands that update data. The SPARQL servlet will eventually do this too, but not until I've implemented "&lt;a href="http://jena.hpl.hp.com/~afs/SPARQL-Update.html"&gt;SPARQL/Update&lt;/a&gt;". They will also accept MIME encoded files containing RDF data (RDF/XML, N3 and I think Turtle) and will load them into the default graph, which can be specified with the &lt;em&gt;default-graph-uri&lt;/em&gt; parameter.&lt;br /&gt;&lt;br /&gt;I haven't committed all of this code yet, since I ran into a bug when loading an RDF file. It turned out that this file finishes with the line:&lt;pre&gt;&lt;code&gt;&amp;lt;/rdf:RDF&amp;gt;&lt;/code&gt;&lt;/pre&gt; This line does not finish with a newline character, and this is confusing the ARQ parser we are using. Of course, I could just wrap the &lt;code&gt;InputStream &lt;/code&gt; object in something that appends a newline, but this is an unnecessary (and &lt;em&gt;horrible&lt;/em&gt;) hack, so I decided to look for the source of the problem.&lt;br /&gt;&lt;br /&gt;At this point I realized that we are still on Jena 2.1, while the world has moved on to &lt;a href="http://jena.sourceforge.net/downloads.html"&gt;2.5.6&lt;/a&gt;. Hopefully a move to 2.5.6 will fix this issue, so I decided to upgrade the Jar. Of course, this led to 2 other jars (&lt;code&gt;icu.jar&lt;/code&gt; and &lt;code&gt;arq.jar&lt;/code&gt;) along with &lt;em&gt;other&lt;/em&gt; tests failing (I think they were trying to compensate for a timezone bug, but this has been fixed now).&lt;br /&gt;&lt;br /&gt;While trawling through the Mulgara XSD classes I found what I think is the problem (compensation code for Jena not handling 0 months, though now it should). While there, I also learnt that despite parsing everything needed, the same data was being send to a Jena object for parsing. This seems quite redundant. It is also one of the few places that Jena classes are used (as opposed to just the ARP parser), so it would be great to drop this dependency if I can.&lt;br /&gt;&lt;br /&gt;So now a simple bug fix (not handling a missing newline character) seems to be leading me into all sorts of updates. Story of my life.&lt;br /&gt;&lt;br /&gt;OK, now I'm falling asleep between words, and have even caught myself starting to type something I started dreaming on 3 occasions. I think I've overstayed on my blog.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-6135747305180855637?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/6135747305180855637/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=6135747305180855637' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6135747305180855637'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6135747305180855637'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/09/disclaimer-new-babies-are-wonderful-but.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-6531274686443076623</id><published>2008-07-31T15:32:00.003-05:00</published><updated>2008-07-31T18:01:34.711-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='classpath'/><category scheme='http://www.blogger.com/atom/ns#' term='JAR'/><category scheme='http://www.blogger.com/atom/ns#' term='WAR'/><category scheme='http://www.blogger.com/atom/ns#' term='Jetty'/><category scheme='http://www.blogger.com/atom/ns#' term='Mulgara'/><category scheme='http://www.blogger.com/atom/ns#' term='class path'/><title type='text'></title><content type='html'>&lt;h3&gt;SPARQL&lt;/h3&gt; Perpetual coding doesn't leave much time for blogging. I'm in the middle of a long-running set of tests, so I figured I should take the time to write, even if I'm too tired.  :-)&lt;br /&gt;&lt;br /&gt;SPARQL on Mulgara always seems to have more to do than I have time or mandate for. That should be OK, given that SPARQL is now available through the SAIL API, but it's never quite that simple.&lt;br /&gt;&lt;br /&gt;To properly work with Sesame/SAIL we need to build (or at least deploy) Mulgara using Maven. Now I understand what Maven does... I've just never used it. On top of that, we have the &lt;em&gt;horrible&lt;/em&gt; build scripts that go into Mulgara, making the whole notion of re-creating the build system a little daunting. All the same, I've learned about creating a pom.xml, along with modules and inheritance, but I still need to read more docs on the topic. I'd like to get to this soon, but there are so many other pressing things.&lt;br /&gt;&lt;br /&gt;So working with SAIL isn't an out-of-the-box distribution yet, which is an impediment to using SPARQL. At this stage I think the Mulgara SAIL API is more of an advantage to Sesame than it is to us. Another reason why it would be good to get SPARQL going is because people are always &lt;em&gt;asking&lt;/em&gt; me for it. So even if I don't get 100% conformance, I should try to get it close. Anyone who needs it perfect can use the SAIL API.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Web Services&lt;/h3&gt; The best way to get SPARQL compliance is to run the test suite. That means you need some way to issue the queries and check the results. Now, I &lt;em&gt;could&lt;/em&gt; write the code to do this, but I know that other systems for running the test suite exist out there, and it would be better to use one of those if I could. However, those systems will all be using the SPARQL protocol for issuing queries, and that's one part I hadn't really touched yet.&lt;br /&gt;&lt;br /&gt;Fortunately, the protocol is just a web service, and Mulgara is already running web services. The response is just in XML, and I've written some code to do that already (though it's not checked in anywhere yet). I just need to glue it to a web service.&lt;br /&gt;&lt;br /&gt;Looking at it, the protocol is so simple that the service should be implementable with a relatively straightforward servlet. Servlets are quite easy to write, but deploying them is system dependent, so I thought I'd get the deployment part going first. I built a simple "hello world" servlet with the intent of expanding it into the real thing once it was integrated correctly.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Servlets&lt;/h3&gt; To start with, I followed the directions given for deploying a servlet in the &lt;a href="http://docs.codehaus.org/display/JETTY/Embedding+Jetty"&gt;Quick Start&lt;/a&gt; guide, and it all worked fine. Then I went to Mulgara to see how this would work.&lt;br /&gt;&lt;br /&gt;Now I'd been aware that Jetty hadn't been updated in Mulgara for a while, and I thought that this would be a good chance to update it. However the existing version was 4.2.19, while the latest (released) version is 6.1.19. Some of the APIs appeared to be completely incompatible, and while there was an upgrade guide from Jetty 5 to Jetty 6, there was nothing about Jetty 4. Obviously this task had been left for too long.&lt;br /&gt;&lt;br /&gt;So the first order of the day was &lt;strong&gt;not&lt;/strong&gt; to get a servlet deployed in Mulgara, but rather to upgrade Mulgara to use the latest Jetty. This also dovetailed with another task I've been wanting to do for some time, which was to clean up the file where all of the Jetty configuration happens: &lt;code&gt;EmbeddedMulgaraServer&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Upgrade&lt;/h3&gt; I eventually want to completely remove EmbeddedMulgaraServer and replace it with a lightweight program that loads up configurable modules. This will give us the benefits of having those modules ready for other types of deployment (another request I often get) as well as letting people customize the server, which is currently monolithic and unwieldy. I don't have time to get all of that done right now, but at least I got to tidy the code up to the point where this will be less intimidating. It also gave me a better view of what was going on in there (it regularly confuses developers who look at it).&lt;br /&gt;&lt;br /&gt;Mulgara had been deploying two sets of static pages and 2 web services in Jetty. The static pages included the documentation that is both obsolete (to be replaced by the gradually expanding Wiki), and available on the website. The other pages are all data files, which I believe are used for example scripts. I think it's a terrible idea to have these in the system, so I ripped them out. Moments later I thought better of it, and so I emailed the list to see what people thought. I was bemused to see that not only was this a welcome move, people wanted to get rid of the HTTP server altogether! (These people obviously want to access those individual modules I mentioned earlier). So then I created both an option in the config file, and a system property which can both disable the server (the system property takes precedence).&lt;br /&gt;&lt;br /&gt;That just left me with the 2 web applications in &lt;strong&gt;W&lt;/strong&gt;eb &lt;strong&gt;AR&lt;/strong&gt;chive files to deploy. This is where I came unstuck.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;WAR Files&lt;/h3&gt; I could not find any documentation on how to deploy a WAR file using the APIs in Jetty. So I muddled through the JavaDocs, picking up anything that looked promising. After an entire night of this, I eventually got something I thought might work, replacing &lt;code&gt;WebApplicationContext&lt;/code&gt; class with the &lt;code&gt;WebAppContext&lt;/code&gt; and trying to translate the differences in their APIs. I immediately got back an &lt;code&gt;IllegalStateException&lt;/code&gt; that occurred while the system was accessing the WAR file. While trying to work it out I delved into the Java libraries, and discovered that something had closed off the archive file while it was still in the process of reading it. It seemed too far down in the system to be anything I could have caused (or prevented), so I went searching online to see if anyone knew about it.&lt;br /&gt;&lt;br /&gt;It didn't take me long to see people mentioning this bug in relation to Jetty 5 about 2 years ago. It seemed strange that there wouldn't be a more recent reference, but that was the best I could get. Unfortunately, the response at the time was that the problem was indeed a bug with some of the Apache libraries that were used for this, which meant I was out of luck (sure, I could fix it, but that won't get me a deployed version of those libs any time soon).&lt;br /&gt;&lt;br /&gt;I saw Brian online (apparently traveling as a passenger in a car) and he told me that he'd heard of the problem, and suggested that I "expand" the archive to deploy it. I did this by manually pulling the WAR files into a temporary directory before pointing the WebAppContext at it. This avoided the IllegalStateException.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Class Paths&lt;/h3&gt; The deployment of these WAR files into Jetty 4 had a few things that didn't translate so well. The first was the configuration of something called a &lt;code&gt;SocketListener&lt;/code&gt;, which I figured out was replaced a &lt;code&gt;Connector&lt;/code&gt;. The second was in setting up the class paths. The code for this used to be:&lt;pre&gt;&lt;code&gt;  HttpContext contexts[] = httpServer.getContexts();&lt;br /&gt;  for (int i = 0; i &amp;lt; contexts.length; i++) {&lt;br /&gt;    contexts[i].setParentClassLoader(this.getClass().getClassLoader());&lt;br /&gt;  }&lt;/code&gt;&lt;/pre&gt;This seemed reasonable, though I wasn't sure why it was being done. I was about to learn.&lt;br /&gt;&lt;br /&gt;Jetty 6 no longer has the &lt;code&gt;Context.setParentClassLoader()&lt;/code&gt; method, though it is now possible to set the actual class loader for the context. However, the class loader I had available in that context (&lt;code&gt;this.getClass().getClassLoader()&lt;/code&gt;) was the same one that was already being used by that class. So I wasn't sure what to replace this with. Unfortunately, I made the mistake of choosing to set the class loader here anyway.&lt;br /&gt;&lt;br /&gt;When I tried running the program again, I was immediately being told of missing classes. Of course, neither these classes, nor any code for them existed on my system. I eventually worked out that these were classes that were generated from Java Servlet Pages (JSPs), which took me into the configuration for generating these pages.&lt;br /&gt;&lt;br /&gt;I hadn't realized we had JSPs in the system (will the cruft never end?!?) and I'd eventually like to get rid of these, even if I keep the web applications they're a part of. But for the moment, I had to upgrade those libs, and then update various build scripts which were trying to refer to the libs by name, and not with a generic variable (which we do for everything else - this lets us change versions relatively easily). I also discovered a "Tag" library for accessing Mulgara from JSPs. We don't seem to use it anywhere ourselves, and it just seems to be provided as a utility for users. The presence of this has me feeling reluctant to remove JSPs, but I'm still considering it.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Embedded JARs&lt;/h3&gt; Once the JSPs were running, I started getting errors about missing libraries that I expected were already in the class path. However, when I checked, I found that those libraries had NOT been included. It used to work, so I kept searching, and it didn't take me long to find them in the WAR file.&lt;br /&gt;&lt;br /&gt;So &lt;em&gt;this&lt;/em&gt; was the reason for the fancy classloader stuff. The classloader was supposed to find these JARs in the WAR file, and include them in its search. Only there was no such class loader in place. Hence my error.&lt;br /&gt;&lt;br /&gt;The Javadoc mentions a class called &lt;code&gt;WebAppClassLoader&lt;/code&gt;, which looked like an obvious candidate. However, the documentation made it appear that this class may not do very much, as it just extended the standard library class &lt;code&gt;URLClassLoader&lt;/code&gt;. All the same, I tried it, but it didn't seem to do anything. (This was my big mistake).&lt;br /&gt;&lt;br /&gt;I finally started adding the sources for all my libraries into my Eclipse environment, so I could debug it and see exactly what was happening. While time-consuming, it finally got me over the line. I also had a nice side benefit of learning just how the architecture of Jetty 6 works.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Deployed At Last&lt;/h3&gt; Tracing through the program, I found that a &lt;code&gt;WebAppContext&lt;/code&gt; calls &lt;code&gt;configureClassLoader&lt;/code&gt; on a &lt;code&gt;WebInfConfiguration&lt;/code&gt; that it creates. This explicitly checks if the class loader is a &lt;code&gt;WebAppClassLoader&lt;/code&gt;, and if it is, then it goes through the lib/ directory of the application, and adds any JAR files that it finds into its classpath.&lt;br /&gt;&lt;br /&gt;Since the configuration is checking for this specific class loader, then this is obviously the only way to do it, unless you write a class loader for yourself. The application never creates one for you, which seems strange. The creation of the object is also strange in that it needs to be provided the web application that it works on (so it knows where to find the classes and libs), and it has to be explicitly set as the class loader for that application. So you need to say something like:&lt;pre&gt;&lt;code&gt;  webapp.setClassLoader(new WebAppClassLoader(webapp));&lt;/code&gt;&lt;/pre&gt;I'm confused why &lt;code&gt;WebAppContext&lt;/code&gt; doesn't create automatically create a &lt;code&gt;WebAppClassLoader&lt;/code&gt; for itself, giving it a &lt;em&gt;this&lt;/em&gt; reference. You can always override it, but it would be rare to need to.&lt;br /&gt;&lt;br /&gt;Anyway, I now knew what to do, and so I did it. Of course, it still didn't work. More debugging. That was when I ran headlong into that class loader code I wrote back at the start of this process. After setting the class loader for the &lt;code&gt;WebAppContext&lt;/code&gt; this code was setting it back to the normal system class loader. That'll teach me for including code blindly.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Threads&lt;/h3&gt; So now everything was running "error free". I decided to throw a web browser at the WebUI application. Only, it wouldn't respond at all. I got a connection to the server, but it just sat there doing nothing.&lt;br /&gt;&lt;br /&gt;Finally, I tried duplicating what I was doing in a short application using a simple servlet. It all looked OK, so I wen through step by step, making sure I had it exactly the same... and it locked up there too. So then I started changing settings one at a time until I found the one that was causing the problem.&lt;br /&gt;&lt;br /&gt;On Jetty 4, two of the options we were setting on the &lt;code&gt;SocketListener&lt;/code&gt; were &lt;em&gt;minThreads&lt;/em&gt; and &lt;em&gt;maxThreads&lt;/em&gt;, however neither of these were options for &lt;code&gt;Connector&lt;/code&gt;. So I decided to make do with &lt;code&gt;AbstractConnector.setAcceptors(int)&lt;/code&gt;, which does a similar thing. However, I made the mistake of setting the number of acceptors to our previous &lt;em&gt;madThreads&lt;/em&gt; value, which was 255.&lt;br /&gt;&lt;br /&gt;If the number of acceptors is set this high, then the server is guaranteed to lock up. So I looked for the threshold at this this occurred. It turned out that the maximum value I could use was 24. It consistently works fine right up to this level, but any more and the system just blocks indefinitely. I checked out the source code, and discovered that all the acceptors are &lt;code&gt;Runnable&lt;/code&gt; objects that get invoked by threads in a thread pool, but there is nothing about the size of that pool or anything else I could see that would create this limit of 24.&lt;br /&gt;&lt;br /&gt;It also doesn't seem to matter what kind of Connector I'm using either, as the Acceptors are always the same.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;A New Servlet&lt;/h3&gt; I'm finally at a point where the system works as well as it did at the beginning of the week, only now it's doing it with Jetty 6. It needed to happen, but I wish it hadn't been so painful.&lt;br /&gt;&lt;br /&gt;I have other things to get to now, but I'll be trying to write this new SPARQL servlet soon. At least I have a modern framework to do it with now.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-6531274686443076623?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/6531274686443076623/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=6531274686443076623' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6531274686443076623'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6531274686443076623'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/07/sparql-perpetual-coding-doesnt-leave.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-4089886937172218415</id><published>2008-06-18T00:32:00.002-05:00</published><updated>2008-06-18T00:46:51.381-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;TV&lt;/h3&gt; Know what would make the &lt;a href="http://www.apple.com/appletv/"&gt;Apple TV&lt;/a&gt; a no-brainer for me? Allow it to &lt;a href="http://docs.info.apple.com/article.html?artnum=307319"&gt;share a DVD&lt;/a&gt; from your desktop machine, like the MacBookAir can, and start include BluRay as a shareable disc type.&lt;br /&gt;&lt;br /&gt;But no. I bet that interferes with a business model somewhere. :-(&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-4089886937172218415?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/4089886937172218415/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=4089886937172218415' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/4089886937172218415'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/4089886937172218415'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/06/tv-know-what-would-make-apple-tv-no.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-3606862386089652557</id><published>2008-06-10T23:03:00.002-05:00</published><updated>2008-06-10T23:58:21.107-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Thesis&lt;/h3&gt; I've finally started writing my thesis, so don't expect to see me blog much in the near term. I know I haven't been blogging much at all this year, but I'm guessing I'm about to get worse (or who knows? Maybe I'll procrastinate and blog more).&lt;br /&gt;&lt;br /&gt;I'm still in the introductory chapters, so I'm reviewing everyone else's work. I have a stack of references from a few years ago, but need to update some of it, and finally read some of the papers I put off all that time ago.&lt;br /&gt;&lt;br /&gt;One of the really startling things is reading about stuff that I had to discover for myself while implementing Mulgara. As a database developer you just do things because they seem pragmatic, and you figure that &lt;em&gt;everyone&lt;/em&gt; must do it that way. Then you read a paper where someone formalizes your assumptions and gives a name to it. I can think of several here, but the first that comes to mind is "DL-safe rules".&lt;br /&gt;&lt;br /&gt;DL-safe rules are simply rules where the variables in the head must also occur in the body. Well, building rules for OWL that meet this criteria seems obvious to me, but apparently it merited a &lt;a href="http://pellet.owldl.org/papers/kolovski06extending.pdf"&gt;couple&lt;/a&gt; of &lt;a href="http://www.comlab.ox.ac.uk/people/boris.motik/pubs/mss05query-journal.pdf"&gt;papers&lt;/a&gt; on the topic. For a start, I'm not sure how you'd even do this without making sure your variables in the head all come from the body. Second, the only way this would work (that I know of) is to start introducing blank nodes for existential statements.... and that way lies madness.&lt;br /&gt;&lt;br /&gt;For instance, if you define (somewhat informally):&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;#x2200;x &amp;#x2208; Man &amp;#x2192; &amp;#x2203;y : Man(y) &amp;#x22c0; father(y,x)&lt;br /&gt;&lt;br /&gt;Then simply by saying &lt;em&gt;Man(fred)&lt;/em&gt; you have an infinite loop. Incidentally, this is a trivial demonstration of how hard it can be to model the real world. The simple solution is to somehow incorporate a new type, like &lt;em&gt;Men-without-fathers&lt;/em&gt;, and put that in your rule (hmmm, doesn't the DL-Handbook mention something like that?). Whether you introduce an entity named &lt;em&gt;adam&lt;/em&gt; or somehow model evolution (good luck there) is up to you.&lt;br /&gt;&lt;br /&gt;Back to the example... Of course, in OWL you can just create a blank node for an unknown father, but if you're going to take it that far then you want to create a blank node for the father of the first blank node, etc. Maybe it's reasonable to simple create that first step, and not reason further on blank nodes, but now you're making a judgment call that:&lt;br /&gt;a) May not prove to be as useful as you'd envisaged.&lt;br /&gt;b) May have implications for your logic.&lt;br /&gt;&lt;br /&gt;Besides, what's the point in inferring a new node that you can't perform further inferences on? You'd just have a node there not saying anything except that it's a "father". But if you want to include it in a rule for determining &lt;em&gt;ancestor(x,y)&lt;/em&gt;, then suddenly it can be re-inferred on again, and you run the risk of an infinite loop once more.&lt;br /&gt;&lt;br /&gt;So DL-rules just make sense in OWL (at least, they do to me). It's strange to see people like Boris Motik take them so seriously.&lt;br /&gt;&lt;br /&gt;Speaking of Boris, he basically wrote the thesis I was hoping to write (well, sort of - fortunately I have a &lt;em&gt;few&lt;/em&gt; ideas of my own). I came to many of the same conclusions that he has, simply by virtue of implementing stuff for Mulgara (though by virtue of having another child, moving countries, interrupting my candidature, and holding down a full time job, I didn't publish anything in time). The difference between what I would have written and what Boris &lt;em&gt;did&lt;/em&gt; write, is that he knows the theory &lt;em&gt;way&lt;/em&gt; better than I'm every going to have the time for. I mean, I can follow it all, but it would never have occurred to me to give such algebraic formalism to everything the way he did. It's a little humbling to see someone do something like that so much better than you would have done.&lt;br /&gt;&lt;br /&gt;Oh well. I guess I'd better stop procrastinating and write some more.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-3606862386089652557?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/3606862386089652557/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=3606862386089652557' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/3606862386089652557'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/3606862386089652557'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/06/thesis-ive-finally-started-writing-my.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-77846031883464424</id><published>2008-05-26T22:02:00.005-05:00</published><updated>2008-05-26T23:11:14.290-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='SPARQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Mulgara'/><title type='text'></title><content type='html'>&lt;h3&gt;Mulgara Alpha&lt;/h3&gt; My last few weeks were spent trying to get Mulgara's &lt;a href="http://en.wikipedia.org/wiki/SPARQL" title="SPARQL" rel="wikipedia" target="_blank" class="zem_slink"&gt;SPARQL&lt;/a&gt; interfaces ready before the &lt;a href="http://www.semantic-conference.com/"&gt;Semantic Technology Conference 2008&lt;/a&gt;. I met the criteria Amit (from &lt;a href="http://www.topazproject.org/"&gt;Topaz&lt;/a&gt;) and I had agreed to beforehand, which allowed me to get out an &lt;a href="http://mulgara.org/news.html#date150508"&gt;Alpha release&lt;/a&gt; for the next version of Mulgara. There are still a couple of things missing, but the basics are all there now.&lt;br /&gt;&lt;br /&gt;The road to SPARQL took a couple of turns I hadn't expected.&lt;br /&gt;&lt;br /&gt;Back in February we were approached by &lt;a href="http://www.aduna-software.com/"&gt;Aduna&lt;/a&gt; who asked if we would be willing to support a level of integration between &lt;a href="http://www.openrdf.org/"&gt;Sesame&lt;/a&gt; and Mulgara. While none of the Mulgara developers had the time to work with them directly, we said that we would be very happy to try to support Aduna where we could. The majority of this work was done by &lt;a href="http://leighnet.ca/"&gt;James Leigh&lt;/a&gt; (a programmer who commands my respect more and more on a daily basis), and he was able to get it all done in remarkable time. Even more impressive was that his integration work is 100% SPARQL compliant, even though some of the underlying structure isn't quite there yet!&lt;br /&gt;&lt;br /&gt;My own work was to:&lt;ul&gt;&lt;li&gt;Parse SPARQL queries.&lt;/li&gt;&lt;li&gt;Convert this into the Mulgara Algebra.&lt;/li&gt;&lt;li&gt;Write new algebraic operations in the Mulgara query engine.&lt;/li&gt;&lt;/ul&gt;The work by Aduna was going to overcome the need for the first and second tasks, but I had already completed the first when we heard from Aduna, with most of the work left to be done required for both the &lt;a href="http://www.openrdf.org/doc/sesame/api/org/openrdf/sesame/sail/package-summary.html"&gt;SAIL&lt;/a&gt; interface and my own SPARQL implementation. Since this was the case, I decided to continue with my own interface, since there wasn't going to be much redundant work from that point onwards. Even with both interfaces working correctly, the SAIL API will be the one to use, as it also includes a SPARQL Protocol endpoint, which I haven't looked at yet.&lt;br /&gt;&lt;br /&gt;While the SAIL integration may have appeared to be independent from my own work, it turned out that James's contribution was invaluable. His need to pass all the SPARQL tests drove a lot of my query engine work, pointing out both missing features and bugs I was unaware of. I still have a couple of things to go, but James has been able to work around them at the higher layers for the time being. This has a performance penalty, but these will be dealt with in the next couple of weeks.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Notable Feature Implementations&lt;/h3&gt; &lt;h4&gt;Language Tags&lt;/h4&gt; One missing feature that completely floored me was that Mulgara was not supporting language tags on untyped literals. It turns out that this was slated for addition just as Tucana was closed, which is why it never made it. Even so, I must admit that I was surprised that it took that long for this feature to be scheduled!&lt;br /&gt;&lt;br /&gt;Fortunately, language tags were quick and easy to implement. The main issue was in the existing tests, as nearly half of our files use literals with language tags in them, and none of the "expected results" included them.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Repeating Variables&lt;/h4&gt; Another issue was in "basic graph patterns" that use a repeating variable. Mulgara already had some code to deal with this, but it was failing in most cases. Unfortunately, I responded to this as a "bug report", and fell into the trap of fixing the existing code. I got it working after a day, only to be told the next day that it still failed if the variable is repeated in the position of the graph name.&lt;br /&gt;&lt;br /&gt;At that point I stepped back from the problem, and realized that the solution was actually quite easy. All you need do is replace the repeating variable with a set of unique names, and create a conjunction of the constraint repeated with the variables in rotating positions. After mentioning this to Andrae he informed me that he'd worked this out a few years before (even though someone else was implementing the code at the time), but he forgot to let me know. Oh well, at least I'm doing it correctly now.&lt;br /&gt;&lt;br /&gt;While looking to implement this fix, I realized that the best way to perform this substitution would be via Andrae's query transformation SPI. This lets you search through a query structure, and replace elements with something more appropriate for the engine to work with. It was while working with this I realized that it provides me with a tool that will let me solve a problem I've had for some time.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Transitive&lt;/h4&gt; The &lt;em&gt;trans&lt;/em&gt; feature in Mulgara is a mechanism that lets the user mark the predicate in a constraint as &lt;em&gt;transitive&lt;/em&gt;. While it works really well, the syntax in TQL is ugly. However, the query transformer offers an alternative. Instead of wrapping a standard constraint in a &lt;em&gt;trans(...)&lt;/em&gt; operator, the predicate can be typed as being &lt;em&gt;transitive&lt;/em&gt; in a separate constraint. I was tempted to use the URI of &lt;code&gt;owl:TransitivePredicate&lt;/code&gt; for this task, but this will interfere with declarations in ontologies, so a local URI will be much more appropriate (something like &lt;code&gt;mulgara:TransitivePredicate&lt;/code&gt;). The &lt;em&gt;really&lt;/em&gt; cool thing is that this will be sharable with SPARQL queries as well. That means we can start opening some of our functionality up to SPARQL users, while not needing to extend the syntax of that language. In fact, there are a few functions we can implement in this way, allowing us to do a lot in SPARQL without sacrificing the speed and functionality of TQL.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Date Times&lt;/h4&gt; One question I regularly received from James was about date times. Unfortunately, Mulgara stores these canonically (using UTC), and hence does not round-trip these values. The solution is to store the timezone offset along with the value. Another tricky thing is to record if a time of "midnight" is recorded as "00:00:00" or as "24:00:00", as both are valid, and both need to be returned as they were provided, and not in a normalized form. I haven't done this one yet, but I expect to get it done by the end of the week.&lt;br /&gt;&lt;br /&gt;I had a comment from Andy Seaborne that despite timezones being described in hours and minutes, this only requires a resolution of quarter-hour intervals, so I can probably squeeze this into some existing storage somewhere. I appreciate the advice, but it leaves me wondering which timezone appears with a 15 minute offset from its nearest neighbors!&lt;br /&gt;&lt;br /&gt;In the meantime, James got around the problem by removing the &lt;code&gt;xsd:dateTime&lt;/code&gt; specific code from the version of Mulgara he is working with, so it gets treated as an unknown type. This modification can be removed as soon as I fix the issue (which I expect to be by the end of this week).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Memorial Day&lt;/h3&gt; There is still an enormous amount of information to cover on Mulgara, SPARQL, and especially the SemTech conference, but I'm falling asleep as I type. It's currently Memorial Day here in the USA, and since getting back from the conference on Friday night, I've had a huge weekend with my family. Yesterday I took both of the boys in a trailer for "&lt;a href="http://www.bikethedrive.org/"&gt;Bike the Drive&lt;/a&gt;", which is a lot more cycling than I've done for a few months. Swimming and running have kept me relatively fit, but it still tired me out! Consequently I just can't think now, so I'll pick this up again later.&lt;div class="zemanta-pixie" style="margin: 5px 0pt; width: 100%;"&gt;&lt;a class="zemanta-pixie-a" href="http://www.zemanta.com/" title="Zemified by Zemanta"&gt;&lt;img class="zemanta-pixie-img" src="http://img.zemanta.com/pixie.png?x-id=d609d394-58ad-44ee-a3eb-a6d4d7e94e85" style="border: medium none ; float: right;"&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-77846031883464424?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/77846031883464424/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=77846031883464424' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/77846031883464424'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/77846031883464424'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/05/mulgara-alpha-my-last-few-weeks-were.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-6102132533686860203</id><published>2008-04-13T14:01:00.002-05:00</published><updated>2008-04-13T21:12:35.224-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='indexing'/><category scheme='http://www.blogger.com/atom/ns#' term='XA2'/><category scheme='http://www.blogger.com/atom/ns#' term='RDF'/><title type='text'></title><content type='html'>&lt;h3&gt;Writing 2-columns&lt;/h3&gt; In my last post I described a scheme for representing 2 columns. But the moment I first thought of it, I decided it was too impractical. After all, each "triple" gets represented with 10 entries. If I want to include a graph identifier (i.e. a Quad store) then it goes up to 12 entries. If I want to cut down on disk seeking, then the idea seemed to be of little more than academic interest.&lt;br /&gt;&lt;br /&gt;Then a little while ago I was explaining this scheme to a friend (Inderbir), and in the process I tried to explain why this was going to be impractical, but in the course of the discussion a few things occurred to me.&lt;br /&gt;&lt;br /&gt;The statements to represent form a series of "doubles", which need to be indexed two ways: by each column. The data for a single statement will appear like this:&lt;pre&gt;&lt;code&gt;  Statement, _statement_x&lt;br /&gt;  SubjectIdentifier, _subject_x&lt;br /&gt;  PredicateIdentifier, _predicate_x&lt;br /&gt;  ObjectIdentifier, _object_x&lt;br /&gt;  _statement_x, _subject_x&lt;br /&gt;  _statement_x, _predicate_x&lt;br /&gt;  _statement_x, _object_x&lt;br /&gt;  _subject_x, my:subject&lt;br /&gt;  _predicate_x, my:predicate&lt;br /&gt;  _object_x, my:object&lt;/code&gt;&lt;/pre&gt;Where anything whose name starts with an underscore is a unique identifier. As I'd already mentioned, now that we use 64 bit identifiers in Mulgara, it makes sense to create these from an incrementing &lt;code&gt;long&lt;/code&gt; value.&lt;br /&gt;&lt;br /&gt;Given that each identifier only gets used for one statement, then the statement, subject, predicate, and object identifiers will all be allocated together, and will be consecutive. Indeed, if these identifiers are kept separate from the identifiers that will be allocated for the URIs and Literals of the statement, then the statement can be presumed to always be a multiple of 4, and the subject, predicate, and object identifiers will be 1, 2, and 3 greater, respectively. This means that the bottom two bits of the IDs can be used to represent the type of the ID, meaning that the first 4 statements in the above list can be inferred, rather than stored. Also, since the IDs for the subject, predicate, and object positions can be calculated by adding 1, 2, or 3, then the next three statements don't need to be stored either. Cutting the data down to 3 entries suddenly makes it look more interesting.&lt;br /&gt;&lt;br /&gt;I should note at this point that I still expect to represent URIs and Literals with IDs that can be mapped to or from the data they represent. While the mechanism for doing this in Mulgara needs to be improved, it is still an important concept, as it reduces redundant storage of strings, and the comparison of Long values allows for faster joins. However, I do intend to return to this idea.&lt;br /&gt;&lt;br /&gt;After reducing the data to be stored, we now have:&lt;pre&gt;&lt;code&gt;  _subject_x, my:subject&lt;br /&gt;  _predicate_x, my:predicate&lt;br /&gt;  _object_x, my:object&lt;/code&gt;&lt;/pre&gt;Indeed, since each of those IDs are consecutive, and always increasing, then in the index that is sorted by the first column, all three statements will go to the end of the file. This means that the file need not ever have a seek operation performed on it while it is being written to. Operating systems are usually optimized for append-only writing, so this is another bonus.&lt;br /&gt;&lt;br /&gt;It is also worth noting that since the predicates are always consecutive, there is no need to write each of them either. Instead, the following can be written for each statement:&lt;pre&gt;&lt;code&gt;  _statement_x, my:subject, my:predicate, my:object&lt;/code&gt;&lt;/pre&gt;With this data, all of the above can be inferred. Indeed, the need for the statement to take up an ID on it's own can be dropped, and the subject, predicate, and object IDs are calculated by adding 0, 1, and 2 to the first ID. This leaves space for a fourth element, such as a graph identifier, before needing more than 2 of the low-order bits to give the type of the identifier.&lt;br /&gt;&lt;br /&gt;On the other file, we will be storing the same data in reverse order:&lt;pre&gt;&lt;code&gt;  my:subject, _subject_x&lt;br /&gt;  my:predicate, _predicate_x&lt;br /&gt;  my:object, _object_x&lt;/code&gt;&lt;/pre&gt;In this case, the identifiers for the URIs, blank nodes, and literals of the subject, predicate and object will be all over the place (and will be regularly re-used), so there is no guarantee of ordering here. This means we have to go back to standard tree-based indexing of the data. However, we only have 3 search operations to go through here, which is significantly better than the searching we currently do in Mulgara.&lt;br /&gt;&lt;br /&gt;Note that all of the above applies to statements with more than 3 elements as well. Each new element in a statement increases the size of the single write on the first index by one more &lt;code&gt;long&lt;/code&gt; value, and adds one more seek/write operation to the second index. This is far less expensive than expanding the size of the "complete" indexes used in Mulgara.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Retrieving&lt;/h3&gt; I'll stop for a moment, and take a look at what a read operation looks like.&lt;br /&gt;&lt;br /&gt;The first index file is written to linearly. Each record is identical in size, and the ID that starts the record is monotonically increasing. If the store were write-once-read-many (WORM), then the ID could be skipped altogether as this information would be inferred from the offset within the file. This may be useful for some applications, but I'd prefer to delete information in place (rather than creating a white-out list for later merging), meaning that the ID is still required in this case.&lt;br /&gt;&lt;br /&gt;For this kind of structure, the file can be searched using a binary search. Also, the largest offset that an ID can appear at is the value of that ID multiplied by the size of a record, meaning that the number of seeks required for a search can be greatly reduced.&lt;br /&gt;&lt;br /&gt;The second index is a standard tree. B-Trees are well known for not seeking much, so for a first cut, I would suggest this (though Andrae have other plans further down the line).&lt;br /&gt;&lt;br /&gt;To find all statements that match one element (say, the predicate), then this requires a search on the tree-index, to find the first time that URI appears. The associated predicate ID is paired with a set of IDs that represent the use of that URI in statements (sometimes as predicate, sometimes as subject or object). These IDs are in consecutive order, and so can be merged with the first index as a linear operation. Adding in another element to search by (say, we are looking for a given predicate/object pair) then this becomes another search on the second index, and another linear merge.&lt;br /&gt;&lt;br /&gt;Linear merges aren't too bad here, as it is always a linear operation to go through all of the data anyway (meaning that it can't be avoided). The only case where this is an unnecessary expense is if the "count" of a set of statements is required.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Efficiency in the Tree&lt;/h3&gt; While considering the above structure, it occurred to me that this index is having to store identifiers for the RDF nodes over and over, even though they all appear next to one another. There are ways of compressing this, but it made me question the redundancy altogether. What if the item was just stored once, and the "satellite data" (to use the term for data associated with key) was instead it's own structure? I thought that maybe this could be a tree, but then it occurred to me that the data represents statement IDs, and will therefore always be inserted in increasing order. So a list is most appropriate.&lt;br /&gt;&lt;br /&gt;So now I could have each entry in this tree point to a list of statements that this RDF node participates in. Since the list will always be appended to, it makes sense that this is kept in another file, using a linked list of blocks. However, to cut down on seeks, the first few elements of the list would do well to appear with the node in the original tree.&lt;br /&gt;&lt;br /&gt;So what sort of satellite data should be stored? For reading, the head of the list has to be stored, though as just mentioned, I think that this should be inline with the satellite data. The tail of the list should also be stored, else it would require a linear seek to work out where to insert, and this is not scalable. To give some help with management of the list, the size should also be recorded. This also makes counting trivial.&lt;br /&gt;&lt;br /&gt;Up until now there has been a presumption that the identifiers of elements in a statement follow a particular bit pattern. However, if the satellite data contains three lists instead of one, then the number of the list is enough to indicate which position the node is used in. For instance, the node of &lt;code&gt;&amp;lt;rdf:type&amp;gt;&lt;/code&gt; may have a few entries in the list for &lt;em&gt;subject&lt;/em&gt; (indicating that it is the "subject" in just a few statements), may have a few entries in the &lt;em&gt;object&lt;/em&gt; list (indicating that there are a few statements which refer to this URI), but will have millions (or more) statements in the &lt;em&gt;predicate&lt;/em&gt; list, because this URI indicates a commonly used predicate.&lt;br /&gt;&lt;br /&gt;If the presence of a statement ID in one list or another indicates that this node is used in a particular capacity for that statement, then this means that the presumption of using the low order bits of the ID for this purpose is removed. That gives us a little more flexibility.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Data Pool&lt;/h3&gt; All of the above presumes that there exists a mechanism to map URIs, strings, and other literal data on to an ID, and to map those IDs back into the original data. Historically, Mulgara has referred to the the store that performed this operation as the "String Pool". Since URIs are encoded as a string, and the first iteration of Mulgara only stored literals in a lexical form, this name was accurate. However, with the inclusion of numbers, dates, and other datatypes, it might be more accurate to refer to this construct as a "Data Pool" instead.&lt;br /&gt;&lt;br /&gt;Part of the data pool structure of Mulgara uses a tree containing some (or all) of the data as a key, and a long value as the ID it is mapped to. Storing entries that are keyed on strings or other data is a lot like the second index just mentioned. So now I started to reconsider the presumption of a separate data pool altogether.&lt;br /&gt;&lt;br /&gt;Instead of writing to the linear file first, the idea is to write to the tree index first. This involves a search. If the data is found, then the statement ID will be appended to the end of the appropriate list (this updates the linked list block, possibly spilling over into a new block, and then rewrites the tail/size of this list in the tree). If the data is &lt;strong&gt;not&lt;/strong&gt; found, then a new entry is placed in the tree, two lists are initialized to nil, and the third is given the allocated statement ID. The list is not yet long enough to spill into the file full of linked lists, so this isn't too expensive. For a B-Tree with space, this will require writing of just a single block!&lt;br /&gt;&lt;br /&gt;Now it isn't feasible to store &lt;em&gt;everything&lt;/em&gt; in the tree as a key, so only the head of the data would need to go directly into the tree. The remainder of the data is still needed, but rather than trying to manage this data re-usably, the ideas from last post about keeping all the data in the pool can be adopted. In this case the data can simply be appended to a third file. The offset of this append then becomes the ID of that data. This ID is stored along with the rest of the satellite data in the tree. It is also the ID that gets stored in the first linear index file which can now be written to.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;More Efficiency&lt;/h3&gt; So now instead of a "Data Pool" and 2 files, the design is now for 4 files. Two of them are only ever appended to, one always has direct seeks before writing, and only one of them is a tree that requires searching before a write can happen. Given that this is the entire store, then that's not too shabby! It's a darn sight better than the 196 files in Mulgara, almost all of which need multiple seeks to do &lt;em&gt;anything&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;But can I do better?&lt;br /&gt;&lt;br /&gt;Andrae had already been looking at reworking the string/data pool, and a lot of things are quite obvious to do. For a start, any data that can fit into 54 bits (or so) ought to have its value encoded into its ID, with the top bits used for type identification. That many bits lets you encode all bytes, chars, shorts, ints and floats, as well as the majority of long values (and possibly a lot of doubles as well). Any date within a century of now will also fit in. This means that many items that are not strings don't need any extra storage. So along with the type bits, there would be another bit to indicate whether or not the data is encoded in the ID, or if it is found in the data file. Anything that can be encoded into the ID won't have to go into the data file, though it would still go into the indexes so statements using it can still be found. The main difference is that any statements discovered to contain one of these IDs would not require the extra seek to get the remaining information.&lt;br /&gt;&lt;br /&gt;Another significant change has already been proposed by Andrae over a year ago. In this case, the different types of the data will be stored in different indexes, which are each optimized to handle such data. This increases the number of files, but only one of these files will be accessed at a time. Also, since each of these types are literals, there is no need for lists describing &lt;em&gt;subject&lt;/em&gt; or &lt;em&gt;predicate&lt;/em&gt; statements.&lt;br /&gt;&lt;br /&gt;Similarly, blank nodes will have their own file, only they will not require any extra data beyond the lists, and no predicate list will be required.&lt;br /&gt;&lt;br /&gt;Getting back to the fundamental types of strings and URIs, Andrae pointed out that &lt;a href="http://en.wikipedia.org/wiki/Trie"&gt;Tries&lt;/a&gt; are an appropriate structure for reducing space requirements. This is perfect for managing the plethora of URIs that appear in the same namespace (or that just start with "http://"), as common prefixes to strings are not repeated in this structure. Like other tree structures, this would let us store arbitrary satellite data, meaning they are perfectly adaptable to this structure.&lt;br /&gt;&lt;br /&gt;Interestingly, if we expand the trie to become a suffix trie, then we can get full text searching, which is one of the most common requests that Mulgara gets.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Hashed Predicates&lt;/h3&gt; The example I gave above about how &lt;em&gt;&amp;lt;rdf:type&amp;gt;&lt;/em&gt; mostly participates in statements as a predicate, is common of many predicates. In many situations, the list of predicates to be used is quite small. In particular, there are likely to be just a few predicates that will be used the majority of the time, such as &lt;em&gt;&amp;lt;rdf:type&amp;gt;&lt;/em&gt;, &lt;em&gt;&amp;lt;rdfs:domain&amp;gt;&lt;/em&gt;, &lt;em&gt;&amp;lt;rdfs:range&amp;gt;&lt;/em&gt;, as well as many application specific values.&lt;br /&gt;&lt;br /&gt;Since these URIs are going to be accessed all the time, there isn't a lot of point in burying them deep in the URI tree. Instead, the most common URIs could each be given their own file, which indicates the "predicate statement list" for those URIs. Those URIs can be included in the tree for their subject and object lists, but the code that searches for predicates would skip the tree and go directly to the file instead. Any operations which require iterating over all the predicates can insert these values in via the algorithm, rather than getting it from the tree structure.&lt;br /&gt;&lt;br /&gt;However, which URIs would be stored this way? This may vary from one application to another. So instead of hard coding the values in, they could be placed in a configuration file. Then the application would know to map these values directly to their own files instead. Since the filenames can be allocated by the system, they can be created with a hashing algorithm, or possibly be placed in the configuration file along with the predicate URI list.&lt;br /&gt;&lt;br /&gt;I'd still prefer to configure this rather than allow ALL predicates to be done this way, as any predicates that are not used so commonly will not take the resources of another file. It also allows the system to have an arbitrary number of predicates beyond the most commonly used. But by having these files dedicated to common predicates, any requests for statements with a given predicate will require a single seek to the start of that file, and will immediately give the list, along with its size.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Comparisons&lt;/h3&gt; The evening after I presented this to the Mulgara/Topaz developers back in March, I happened to attend a presentation given on applying columnar databases to RDF. This described storing subject/object pairs in files, with one file per predicate. This particular optimization is similar, but it has a good fallback for when you run out of files for your predicates (after all, searching in good-sized B-Tree typically only requires a couple of seeks). This scheme also provides the ability to search for statements on subject or predicate, which apparently is less efficient in the presented system.&lt;br /&gt;&lt;br /&gt;A nice feature that is shared by both this scheme and the columnar scheme is that selecting statements always gives sorted values that can be joined with linear merge-joins.&lt;br /&gt;&lt;br /&gt;However, given the flexibility of this structure, I've been encouraged to write it up and let people know about it. Well, I've started that, but I thought it would be good to get &lt;em&gt;something&lt;/em&gt; out there straight away, hence this post.&lt;br /&gt;&lt;br /&gt;In the meantime, in amongst my SPARQL work I'm trying to build a proof-of-concept. I've done the complexity calculations to see both the worst case and the expected case, but it doesn't take much effort to see that it involves a massive reduction in the seeking, reading and writing done by Mulgara at the present. I won't be including all the optimizations discussed here, but I still expect it to be around two orders of magnitude faster, and to take up a couple of orders of magnitude less space.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Final Notes&lt;/h3&gt; None of the above discusses deletions, transactions, or any of that other stuff needed to make a database useful in the real world. These issues haven't been forgotten, but in order to present the structure I wanted to concentrate on the minimalism in reading and writing to the structure.&lt;br /&gt;&lt;br /&gt;My plan for deletions is to go through the various lists and mark them with invalid identifiers (e.g. -1). These will have to be skipped linearly during read operations, which means that removing data has little impact on speed (except that blank IDs will never need to be converted into URIs, Literals, etc). At a later time, either by an explicit cleanup operation, or a background task, a cleanup thread will compact the data by shifting it all down to fill the gaps. Of course, this will require some locking for consistency, though since everything is ordered, there may be a chance to minimize locking by skipping any data that repeats or appears out of order.&lt;br /&gt;&lt;br /&gt;Andrae has also spent a lot of time working on a theoretic framework for concurrent write transactions in RDF. His work is quite detailed and impressive. Fortunately, the engineering application of this work is completely consistent with this framework, so we hope to eventually integrate the two. In the meantime, Andrae's work will form the basis for XA2, which in turn will be taking a few avenues to permit this scheme to be easily integrated at a later date.&lt;br /&gt;&lt;br /&gt;So for now, I have to get SPARQL up and running, while also looking for time to finish the proof of concept and writing everything up. I suppose I should be doing that instead of blogging.  :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-6102132533686860203?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/6102132533686860203/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=6102132533686860203' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6102132533686860203'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6102132533686860203'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/04/writing-2-columns-in-my-last-post-i.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-6149780138894045592</id><published>2008-04-08T23:35:00.003-05:00</published><updated>2008-04-09T11:50:36.591-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='indexing'/><category scheme='http://www.blogger.com/atom/ns#' term='64 bit'/><category scheme='http://www.blogger.com/atom/ns#' term='storage'/><category scheme='http://www.blogger.com/atom/ns#' term='RDF'/><category scheme='http://www.blogger.com/atom/ns#' term='Mulgara'/><title type='text'></title><content type='html'>&lt;h3&gt;Indexing&lt;/h3&gt; I'm in the process of writing a number of things up at the moment, including the following description of RDF storage. But since academic papers take so long to write, and they're boring, I thought I'd blog the main bit of one of the things I'm writing about.&lt;br /&gt;&lt;br /&gt;This all came about due to &lt;a href="http://gearon.blogspot.com/2004/08/proof-reading-once-again-its-way-too.html"&gt;a description&lt;/a&gt; I wrote a few years ago about the number of columns needed to store data that was &lt;em&gt;N&lt;/em&gt; columns wide. (Wow! Is it really 4 years?) It came down to a process and equation, of finding the minimum value of an expression, as &lt;em&gt;S&lt;/em&gt; varies from 1 to &lt;em&gt;N&lt;/em&gt;:&lt;pre&gt;min&lt;sub&gt;S=1..N&lt;/sub&gt; (&lt;em&gt;N&lt;/em&gt;!/(&lt;em&gt;N&lt;/em&gt;-&lt;em&gt;S&lt;/em&gt;)!&lt;em&gt;S&lt;/em&gt;!)&lt;/pre&gt;This gives a result of 3 indices for &lt;em&gt;symmetrically&lt;/em&gt; storing triples, 6 indices for quads, 10 indices for quintuples, and so on. Note that this is the number of indices needed if you want to be able to use &lt;em&gt;any&lt;/em&gt; search criteria on your tuples. This may indeed be the case for triples and quads, but if an element of the tuple becomes a unique ID (like it does for reification), then there is no need for symmetric indexing.&lt;br /&gt;&lt;br /&gt;The rapid growth of this equation is a clear indicator that we want to keep the number of columns as low as possible. For expediency Mulgara moved from 3 columns to 4, so that we could encode graph identifiers with the triples, but that came at the expense of doubling the number of indices. This is really a big deal, as each index in Mulgara takes several files for managing the resources in the index, and for holding the index itself. Each piece of information that has to be read or written means another disk seek. This can be mitigated by read and write-back caching by the operating system, but as the amount of data exceeds what can be handled in memory, then these benefits evaporate. So keeping the number of indices down is a big deal.&lt;br /&gt;&lt;br /&gt;Ronald Brachman's work in '77 shaped the future direction of description logics, including the use of the idea that everything can be represented using binary and unary predicates. RDF is defined using binary predicates, and unary predicates are simulated using the &lt;code&gt;rdf:type&lt;/code&gt; predicate, which means that RDF is inherently capable of representing description logics, and indeed, any kind of knowledge representation. The issue is that it can be inefficient to represent certain kinds of structures.&lt;br /&gt;&lt;br /&gt;The RDF representation of &lt;a href="http://www.w3.org/TR/rdf-schema/#ch_reificationvocab"&gt;reification&lt;/a&gt; requires 3 statements for reification (plus one that can be inferred) and these are independent of the actual statement itself. An extra column can eliminate these 3 statements altogether, but the indexes grow accordingly. Graph membership can be accomplished using extra statements as well, and again, this can be trivially eliminated with an extra column. The question is, when do the extra columns (with the consequent factorial growth) become more expensive than adding in more statements? Should the number of indices be limited to 4? To 3?&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;2 Columns&lt;/h3&gt; I always found it interesting that the equation above has a solution for &lt;em&gt;N&lt;/em&gt;=2. I considered this to be an artifact of the equation, but it bugged me all the same. So then a couple of years ago I gave it some thought, and realized that it is indeed possible to represent a triple using "doubles". Of course, once a triple can be represented, then anything can be represented. The question is efficiency.&lt;br /&gt;&lt;br /&gt;If the indices were to contain only 2 columns, then this means that only unary predicates could be used. This implies that the predicates define a type. After some thought I realized that I could use unique types to identify each element of an RDF statement, and then a unique type to represent the statement itself. Of course, there is nothing new under the sun, and just recently I discovered that the &lt;a href="http://citeseer.comp.nus.edu.sg/36655.html"&gt;CLASSIC&lt;/a&gt; system introduced unique atomic concepts for each individual in the system in a similar way.&lt;br /&gt;&lt;br /&gt;To map the following triple:&lt;pre&gt;&lt;code&gt;  &amp;lt;my:subject&amp;gt; &amp;lt;my:predicate&amp;gt; &amp;lt;my:object&amp;lt;&lt;/code&gt;&lt;/pre&gt; to unary predicates, I used a scheme like the following:&lt;pre&gt;&lt;code&gt;  Statement(_statement_x)&lt;br /&gt;  SubjectIdentifier(_subject_x)&lt;br /&gt;  PredicateIdentifier(_predicate_x)&lt;br /&gt;  ObjectIdentifier(_object_x)&lt;br /&gt;  _statement_x(_subject_x)&lt;br /&gt;  _statement_x(_predicate_x)&lt;br /&gt;  _statement_x(_object_x)&lt;br /&gt;  _subject_x(my:subject)&lt;br /&gt;  _predicate_x(my:predicate)&lt;br /&gt;  _object_x(my:object)&lt;/code&gt;&lt;/pre&gt;Where each of &lt;code&gt;_statement_x&lt;/code&gt;, &lt;code&gt;_subject_x&lt;/code&gt;, &lt;code&gt;_predicate_x&lt;/code&gt; and &lt;code&gt;_object_x&lt;/code&gt; are unique identifiers, never to be used again. In fact, my use of underscores as a prefix here indicates that I was thinking of them as a kind of blank node: unique, but without a distinguishing label.&lt;br /&gt;&lt;br /&gt;When I first came up with this scheme, I thought it a curiosity, but hardly useful. It seemed that significant work would need to be done to reconstruct a triple, and indexing so many items would require a lot of seeking on disk. I was also concerned about the "reckless" use of the address space for identifiers in creating unique IDs for so many elements.&lt;br /&gt;&lt;br /&gt;Then recently I was describing this scheme to a friend, and I realized that when I considered some other ideas I'd been working on lately, then there was something to this scheme after all.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Disk Seeking&lt;/h3&gt; I've been very disappointed with Mulgara's loading speed on certain types of data recently. If the data has a lot of unique URIs and strings, then the size of the store was getting too large, and the length of time taken to store the data was too long. I was also surprised at the gigabytes of file storage being used when the data files were only a few hundred megabytes. Mulgara is supposed to be scalable, and this wasn't acceptable behavior.&lt;br /&gt;&lt;br /&gt;Consequently, I've been doing more work with algorithms and data structures recently. I have not been trying to supplant &lt;a href="http://mulgara.org/trac/wiki/XA2Proposals"&gt;Andrae's work&lt;/a&gt; but was instead hoping to tweak the existing system a little in order to improve performance.&lt;br /&gt;&lt;br /&gt;The first thing that becomes apparent is that the plethora of files in Mulgara is a real bottleneck. Each file on its own may be efficient (not all are), but cumulatively they cause a disk to seek all over the place. Since this is probably the single most expensive action a computer can take (other than a network request), then reducing the seeks is a priority.&lt;br /&gt;&lt;br /&gt;Profiling the code led to a couple of improvements (these have been rolled into the &lt;a href="http://mulgara.org/news.html#date030408"&gt;Mulgara 1.2 release&lt;/a&gt;), but also showed that the biggest issue is the String Pool (more properly called the "Data Pool" since it now stores any kind of data). This is a facility that maps any kind of data (like a URI or a string) to a unique number, and maps numbers into the data they represent. With a facility like this, Mulgara is able to store triples (or quads) as groups of numbers. We call these numbers "Graph Nodes", or &lt;em&gt;gNodes&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;The string pool was spending a lot of time just searching to see if a URI or string to be inserted into the graph was already mapped to a number, and inserted it if not. Some work was also being done to keep track of what had been allocated in a given transaction phase, so that any allocated resources (like disk blocks) could be freed and reallocated if the data were ever removed. However, items are rarely removed from the string pool. Removals mostly occur when an entire graph is dropped, and these graphs are often dropped just before a slightly modified version of the same data is to be inserted. In this case, the same data will be removed from the string pool, and then re-inserted. That's a lot of work for nothing. It makes much more sense to leave everything in the string pool, and only remove unused items when explicitly requested, or perhaps as a background task. (Unused items can be easily identified since they don't exist in the statement indices).&lt;br /&gt;&lt;br /&gt;If the string pool were changed to be a write-once-read-many pool, then a lot of the structures that support resource reuse (Free Lists, which are a few files each) can be removed from the string pool. Of course, the reduced reading/writing involved with removing and re-inserting data would also benefit. So this looked promising.&lt;br /&gt;&lt;br /&gt;Another idea is to take any data that fits into less than 64 bits (say, 58 bits) and store it directly in the ID number instead of in the pool. The top bits can then indicate the type of the value, and whether or not it is "stored" or if it is simply encoded in the ID. This covers a surprising range of required numbers, and most dates as well. This idea was mentioned to me in SF last year, and it sounded good, only I had &lt;em&gt;completely&lt;/em&gt; forgotten that Andrae had already proposed it a year before (sorry Peter, you weren't first). But wherever the idea came from, it promised to dramatically help dates and numbers. In fact, it helps all the data, since the tree no longer has as many elements stored in it.&lt;br /&gt;&lt;br /&gt;There were also other ideas, such as moving the tree type of the index. We mitigated the use of AVL trees in the indices by using pointers to large blocks of data. However, this becomes a subtraction of a constant in the complexity analysis, while a wider tree becomes a division by a constant. Constants don't usually mean much in complexity analysis, but when each operation represents a disk seek, then the difference becomes significant. While this is something that must be looked at, it didn't make sense when we knew that XA2 is coming, and that the trees will change anyway.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Address Space&lt;/h3&gt; You may have noticed that I'm talking a lot about resource reallocation, and 64 bits in the same breath. This shows some of the history of Mulgara. The system originally ran on 32 bits, where not reusing resources was a guaranteed way to wrap around in the number space and cause no end of problems. When the system was upgraded to 64 bits, it still made sense to manage resources for reallocation, as some resources were still limited. However, resources that represented IDs in an address space were not reconsidered, and they ought to have been. Looking at what literals could be encoded in a 64 bit value (and how many bits should be reserved for type data) was the impetus I needed to make me look at this again.&lt;br /&gt;&lt;br /&gt;Given that every resource we allocated took a finite time that was often bounded by disk seeks, it occurred to me that we were not going to run out of IDs. If we only used 58 bits, then we could still allocate a new resource every microsecond and not run out of IDs for over 9000 years. A more reasonable design period is 100 years (yes, this is a wide margin of safety), and constant allocation of resources at a microsecond per resource means that we still only need 52 bits. So we're safe not reusing IDs, and indeed, we have over a byte of information we can use in this ID to do some interesting engineering tricks.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Structure&lt;/h3&gt; So I had a number of these lessons fresh in mind when I recently tried to describe just why a 2 column store was inefficient. During the course of the conversation I started seeing ways in which I could apply some of these techniques in a useful way. It took a while for it to come together, but I now have something that really shows some promise.&lt;br /&gt;&lt;br /&gt;The details here are reasonably detailed, so it makes sense to take a break here, and write it all up in a fresh post in the next day or so. A little more sleep might also help prevent the rambling that I've noticed coming into this post.  :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-6149780138894045592?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/6149780138894045592/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=6149780138894045592' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6149780138894045592'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6149780138894045592'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/04/indexing-im-in-process-of-writing.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-8793337319364749737</id><published>2008-04-01T23:24:00.002-05:00</published><updated>2008-04-02T00:42:06.270-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OWL'/><category scheme='http://www.blogger.com/atom/ns#' term='RDF'/><title type='text'></title><content type='html'>&lt;h3&gt;Collections&lt;/h3&gt; So I'm trying to work out what is necessary in OWL, and what is necessary and sufficient. Actually, I just want "necessary and sufficient", but knowing the difference helps.  :-)&lt;br /&gt;&lt;br /&gt;Anyway, while working through this blog, I worked it out. But it probably won't hurt to write it down anyway...&lt;br /&gt;&lt;br /&gt;I had narrowed my problem down to the following:&lt;br /&gt;&lt;br /&gt;If I had a Collection like:&lt;pre&gt;  &amp;lt;rdf:Description rdf:about="http://example.org/basket"&amp;gt;&lt;br /&gt;    &amp;lt;ex:hasFruit rdf:parseType="Collection"&amp;gt;&lt;br /&gt;    &amp;lt;rdf:Description rdf:about="ex:banana"/&amp;gt;&lt;br /&gt;    &amp;lt;rdf:Description rdf:about="ex:apple"/&amp;gt;&lt;br /&gt;    &amp;lt;rdf:Description rdf:about="ex:pear"/&amp;gt;&lt;br /&gt;  &amp;lt;/ex:hasFruit&amp;gt;&lt;/pre&gt;Then this is translated to:&lt;pre&gt;&amp;lt;ex:basket&amp;gt; &amp;lt;ex:hasFruit&amp;gt; _:l1 .&lt;br /&gt;_:l1 &amp;lt;rdf:first&amp;gt; &amp;lt;ex:banana&amp;gt; .&lt;br /&gt;_:l1 &amp;lt;rdf:rest&amp;gt; _:l2 .&lt;br /&gt;_:l2 &amp;lt;rdf:first&amp;gt; &amp;lt;ex:apple&amp;gt; .&lt;br /&gt;_:l2 &amp;lt;rdf:rest&amp;gt; _:l3 .&lt;br /&gt;_:l3 &amp;lt;rdf:first&amp;gt; &amp;lt;ex:pear&amp;gt; .&lt;br /&gt;_:l3 &amp;lt;rdf:rest&amp;gt; &amp;lt;rdf:nil&amp;gt; .&lt;/pre&gt;Now is this list open or closed? This is an important question for OWL, since collections are used to construct sets such as intersections.&lt;br /&gt;&lt;br /&gt;If it's open, then I could add in another piece of fruit...&lt;pre&gt;&amp;lt;ex:basket&amp;gt; &amp;lt;ex:hasFruit&amp;gt; _:l0 .&lt;br /&gt;_:l0 &amp;lt;rdf:first&amp;gt; &amp;lt;ex:orange&amp;gt; .&lt;br /&gt;_:l0 &amp;lt;rdf:rest&amp;gt; _:l1 .&lt;/pre&gt;This would work, but it implies that I can infer that every element of the list can be directly connected to the basket.  i.e.&lt;pre&gt;&amp;lt;ex:basket&amp;gt; &amp;lt;ex:hasFruit&amp;gt; _:l0 .&lt;br /&gt;&amp;lt;ex:basket&amp;gt; &amp;lt;ex:hasFruit&amp;gt; _:l1 .&lt;br /&gt;&amp;lt;ex:basket&amp;gt; &amp;lt;ex:hasFruit&amp;gt; _:l2 .&lt;br /&gt;&amp;lt;ex:basket&amp;gt; &amp;lt;ex:hasFruit&amp;gt; _:l3 .&lt;/pre&gt;Now this makes sense to me, but I don't recall seeing it anywhere in RDF. For instance, it's not in the &lt;a href="http://www.w3.org/TR/rdf-mt/"&gt;semantics document&lt;/a&gt; for RDF or RDFS. The &lt;a href="http://www.w3.org/TR/rdf-mt/#collections"&gt;section on Collections&lt;/a&gt; does say that RDF does not require any well-formedness on the structure of the list (indeed, branched structures are explicitly mentioned), but since only OWL-Full allows arbitrary RDF structures, it isn't generally applicable to what I'm interested in.&lt;br /&gt;&lt;br /&gt;I'd come to this question while I was checking that an owl:intersectionOf with "complete" modality was necessary and sufficient. I presumed that it was, but it doesn't hurt to check. After all, I've been caught out in the open world before. :-)&lt;br /&gt;&lt;br /&gt;I first went to the abstract syntax for &lt;a href="http://www.w3.org/TR/owl-semantics/syntax.html#owl_Class_syntax"&gt;class axioms&lt;/a&gt; to find out how "partial" modalities were encoded, vs. "complete". The &lt;a href="http://www.w3.org/TR/owl-semantics/mapping.html#owl_equivalentClass_mapping"&gt;triples encoding of the abstract syntax&lt;/a&gt; shows that "partial" is simply a list of rdfs:subClassOf statements for each element in the intersection, while "complete" uses an RDF collection. Actually, the expression "SEQ" is used, but sequences are then described as being of type rdf:List, and not rdf:Seq (which, incidentally, &lt;em&gt;are&lt;/em&gt; extensible, but no OWL aficionado will have anything to do with them, so I knew &lt;em&gt;that&lt;/em&gt; wasn't a possibility).&lt;br /&gt;&lt;br /&gt;Now to make sure that "complete" really &lt;em&gt;is&lt;/em&gt; complete, I needed to ensure that lists couldn't be extended.&lt;br /&gt;&lt;br /&gt;There &lt;em&gt;is&lt;/em&gt; a hint that lists can't be extended in OWL-DL in the &lt;a href="http://www.w3.org/TR/2004/REC-owl-guide-20040210/#differentFrom"&gt;OWL Guide&lt;/a&gt;:&lt;br /&gt;&lt;em&gt;"If we wanted to add a new winery in some other ontology and assert that it was disjoint from all of those that have already been defined, we would need to cut and paste the original owl:AllDifferent assertion and add the new maker to the list. There is not a simpler way to extend an owl:AllDifferent collection in OWL DL. In OWL Full, using RDF triples and the rdf:List constructs, other approaches are possible."&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;That raises the intriguing possibility that in OWL-Full an intersection can never be complete. But since OWL-Full is undecidable anyway, I guess that's not something I need to worry about.&lt;br /&gt;&lt;br /&gt;That then brought me back to the description for &lt;a href="http://www.w3.org/TR/2004/REC-owl-guide-20040210/#SetOperators"&gt;Set Operators&lt;/a&gt; which I haven't read in a while. And in reading this I realized that I was a moron for forgetting it...&lt;br /&gt;&lt;em&gt;The members of the class are completely specified by the set operation.&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;The text then goes on to describe that an individual that is a member of each element of an intersection is then a member of the intersection. In other words, membership in each element is a &lt;em&gt;necessary and sufficient&lt;/em&gt; condition for membership in the intersection. Had lists been open, then membership would have merely been necessary, but not sufficient, since there could be another class in the intersection that has not been asserted (yet).&lt;br /&gt;&lt;br /&gt;So &lt;em&gt;complete&lt;/em&gt; is indeed "necessary and sufficient". But if I'd just looked at the Guide in the first place I could have saved myself a bit of time. Sometimes I feel like an idiot... and then I go and compound it by writing about my stupidity on my blog.&lt;br /&gt;&lt;br /&gt;Oh well, this SPARQL implementation won't write itself. I'm down to OPTIONAL - which I expect to take about an hour, and the algebra integration. I'd better make that transformation clean, as I expect to be doing it again soon for the &lt;a href="http://www.openrdf.org/doc/sesame2/api/org/openrdf/query/algebra/package-summary.html"&gt;Sesame algebra&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Somehow I also need to find some time to finish writing that paper about 2 column RDF indexes. Did I mention that I think they're a cool idea?  :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-8793337319364749737?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/8793337319364749737/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=8793337319364749737' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/8793337319364749737'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/8793337319364749737'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/04/collections-so-im-trying-to-work-out.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-1419147329725354127</id><published>2008-03-18T20:59:00.003-05:00</published><updated>2008-03-18T21:48:51.176-05:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Functions&lt;/h3&gt; Whew! I've finally finished filter functions.&lt;br /&gt;&lt;br /&gt;I was just about done when I had two issues show up for me. First off, I realized that each parameter of &lt;a href="http://www.w3.org/TR/rdf-sparql-query/#funcex-regex"&gt;regex&lt;/a&gt; takes an expression that resolves to a simple literal. In other words, it is possible to calculate a different pattern and/or flag for every line &lt;em&gt;&amp;lt;shudder/&amp;gt;&lt;/em&gt;. OK, so I wouldn't do it, but the spec says it, so I did it. Not that it was hard. It just seems obtuse.&lt;br /&gt;&lt;br /&gt;While I'm on it, the &lt;a href="http://www.w3.org/TR/xpath-functions/#regex-syntax"&gt;flags&lt;/a&gt; for &lt;em&gt;regex&lt;/em&gt; don't quite match the flags in Java. Granted, they're ALMOST the same, but if I want to be a stickler about this things, then it's not quite there. The most apparent difference is that the "x" character is not the same as enabling the &lt;a href="http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#COMMENTS"&gt;COMMENTS&lt;/a&gt; flag in Java - though it's similar. In fact, in Java 5, the COMMENTS flag does not even appear as an option in the Javadoc, though a quick scan of the library source shows that it is.&lt;br /&gt;&lt;br /&gt;Once I found small differences (which frankly I expected to find) I decided not to look for any more. The point is that I am &lt;strong&gt;not&lt;/strong&gt; going to implement my own regex engine. Sure, it would be a great learning experience (I know that suffix trees get me part of the way - but I'd have to learn some more to get all of it), but it would take me months, and for no useful purpose. I'm surprised they didn't just choose a standard engine and say "use a standards-compliant regex engine, like &lt;em&gt;XXX&lt;/em&gt;". As it is, it looks like everyone will be &lt;em&gt;nearly&lt;/em&gt; there, but never quite make it.&lt;br /&gt;&lt;br /&gt;The next problem was that I hadn't looked carefully enough at the definition of &lt;a href="http://www.w3.org/TR/rdf-sparql-query/#func-RDFterm-equal"&gt;&lt;code&gt;equal&lt;/code&gt;&lt;/a&gt;. I was mostly right, but it turns out that if you compare two literals that are different, then you don't return false: you throw a &lt;em&gt;type&lt;/em&gt; exception. That just feels broken. Yes, I understand the semantics, but it's a perfectly common thing to do to check that two literals are the same. Having unexpected data throw an exception from a perfectly formed query might make the type theoreticians happy, but from the perspective of a software developer it looks like bad judgement.&lt;br /&gt;&lt;br /&gt;Ironically, you &lt;em&gt;CAN&lt;/em&gt; choose to return true for two different literals if you have a specific extension that handles direct comparisons between their types. For instance, you can check if "5"^^xsd:integer is equal to "5"^^xsd:long. Or perhaps you want to compare "5"^^temp:celsius and "41"^^temp:fahrenheit. If you want to get the same lexical form, then you use the &lt;code&gt;sameterm()&lt;/code&gt; function, so that case is covered. But what if you want to compare two literals to have the same semantic value, and simply return &lt;code&gt;false&lt;/code&gt; if they don't? Maybe I need to re-read this spec, because it doesn't work for me. Still, I've implemented it as asked, if it was more annoying to do so.&lt;br /&gt;&lt;br /&gt;So now I have a lot of unit tests to write. Yes, I know the &lt;acronym title="Test Driven Development"&gt;TDD&lt;/acronym&gt; purists will be out to get me, but the exact implementation and interfaces were still floating a little when I started, and besides, it &lt;em&gt;is&lt;/em&gt; faster to write code with the tests written after. This is mostly because you don't have to change the tests if you realize you need to change the interfaces. And time is something I'm working hard against at the moment.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Filter&lt;/h3&gt; Andrae had a go at me for looking to make filters annotations on the constraints in the &lt;acronym title="Abstract Syntax Tree"&gt;AST&lt;/acronym&gt; for the query. I didn't see a problem with this (and there is no operational difference) until Andrae pointed out that it would have a big impact on the optimizer and query re-writer, since each node can have more that one type: a filtered version and an unfiltered version.&lt;br /&gt;&lt;br /&gt;He was suggesting that I use the conjunction code to apply filters (and the concrete syntax of SPARQL &lt;a href="http://www.w3.org/TR/rdf-sparql-query/#GroupPatterns"&gt;almost seems to imply&lt;/a&gt; that FILTER is added in as a conjunction - though this might just be to allow alternative syntaxes) but I pointed out that this will get awkward as the &lt;a href="http://www.w3.org/TR/rdf-sparql-query/#func-bound"&gt;&lt;code&gt;BOUND()&lt;/code&gt;&lt;/a&gt; function requires that variables not be guaranteed to be pre-bound. This led to a discussion of the use of &lt;code&gt;BOUND()&lt;/code&gt;, and I was able to show that it is often used in conjunction with &lt;code&gt;NOT&lt;/code&gt; and &lt;code&gt;OPTIONAL&lt;/code&gt; to emulate &lt;code&gt;subtraction&lt;/code&gt; functionality. When he saw what I meant, he was quite congratulatory of SPARQL for taking a &lt;em&gt;log(n)&lt;/em&gt; operation and making it linear in &lt;em&gt;n&lt;/em&gt;.&lt;br /&gt;&lt;em&gt;(For any non-Australians reading this.... yes, that was sarcasm)&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;At least this conversation made me realize that filtering the output of each &lt;code&gt;Tuple&lt;/code&gt; would be a mistake (good thing I haven't written this yet). Instead I'll be implementing FILTER in the AST as a new constraint element that wraps another constraint (this makes it easy for the optimizer and transformer to ignore) and to create a new operation akin to &lt;code&gt;MINUS&lt;/code&gt; that will do the work. Currently &lt;code&gt;MINUS&lt;/code&gt; removes elements on the left that match (via variable bindings) elements on the right. The new code will remove them based on failing the &lt;code&gt;FILTER&lt;/code&gt; test.  Simple.  :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-1419147329725354127?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/1419147329725354127/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=1419147329725354127' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/1419147329725354127'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/1419147329725354127'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/03/functions-whew-ive-finally-finished.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-2562630207413132643</id><published>2008-03-15T12:05:00.004-05:00</published><updated>2008-03-17T14:44:00.970-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='types'/><category scheme='http://www.blogger.com/atom/ns#' term='SPARQL'/><category scheme='http://www.blogger.com/atom/ns#' term='filter'/><title type='text'></title><content type='html'>&lt;h3&gt;Writing&lt;/h3&gt; I've been trying to sit down and write for over a week, but each time I try I end up writing code instead. I've even fallen behind reading &lt;a href="http://slashdot.org/"&gt;Slashdot&lt;/a&gt;. I've been getting a lot of messages from people wanting to know what happened last week, what our plans are for Mulgara, etc, but I just haven't been able to respond. That's what happens when a developer tries to work in the real world. I handle the real world, and I can handle code, but not at the same time. :-(&lt;br /&gt;&lt;br /&gt;For the moment, I have priorities with work that I have to see to, so I'll be concentrating on technical things for a while. However, there &lt;em&gt;are&lt;/em&gt; a few things happening with Mulgara, so I'll try to mention them as I go. In the meantime, I'm working on SPARQL queries.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;SPARQL&lt;/h3&gt; The two main features that we're missing now are &lt;a href="http://www.w3.org/TR/rdf-sparql-query/#optionals"&gt;&lt;code&gt;OPTIONAL&lt;/code&gt;&lt;/a&gt; and &lt;a href="http://www.w3.org/TR/rdf-sparql-query/#termConstraint"&gt;&lt;code&gt;FILTER&lt;/code&gt;&lt;/a&gt;. Looking at &lt;code&gt;OPTIONAL&lt;/code&gt; some time ago I realized that it's a hybrid between &lt;code&gt;ConstraintConjunction&lt;/code&gt; (the inner join aspect), and &lt;code&gt;ConstraintDisjunction&lt;/code&gt; (matches on the left side leaving unbound columns). I worked on something similar when I did &lt;code&gt;ConstraintDifference&lt;/code&gt; a few years ago, so I know that this is easy. Hence, I put this part off until last.&lt;br /&gt;&lt;br /&gt;In the last week or so (in between the meeting in San Francisco, and getting a nasty virus) I've been on filters. Right now I'm down to some classes to represent the operator definitions for all the functions like &lt;a href="http://www.w3.org/TR/rdf-sparql-query/#func-bound"&gt;&lt;code&gt;bound()&lt;/code&gt;&lt;/a&gt;, &lt;a href="http://www.w3.org/TR/rdf-sparql-query/#func-isIRI"&gt;&lt;code&gt;isIRI()&lt;/code&gt;&lt;/a&gt; and &lt;a href="http://www.w3.org/TR/rdf-sparql-query/#funcex-regex"&gt;&lt;code&gt;regex()&lt;/code&gt;&lt;/a&gt;. I already have the functionality implemented, but you still need to represent it in an abstract syntax if you're going to construct expressions at query time. So it's all just some boiler plate code to represent the parameters and pass the context on down to any variables that need resolving. After that, I'm on to the unit tests. In an ideal world, I'd test &lt;em&gt;everything&lt;/em&gt; but in reality I have less time than that. Many of the functions are so similar that I'll just be testing a good sample of each of them.&lt;br /&gt;&lt;br /&gt;Looking at the list in the SPARQL definition, you might think that there aren't too many functions at all, but you would be wrong. For a first approximation, many of the functions have to be reimplemented for each type of parameter. I've even gone to the effort of making sure that working on an &lt;code&gt;&amp;lt;xsd:int&amp;gt;&lt;/code&gt; returns an &lt;code&gt;&amp;lt;xsd:int&amp;gt;&lt;/code&gt; (when appropriate), and that an &lt;code&gt;&amp;lt;xsd:short&amp;gt;&lt;/code&gt; returns an &lt;code&gt;&amp;lt;xsd:short&amp;gt;&lt;/code&gt;. Since I was already trying to keep floating point numbers and integers apart, then this seemed to be a natural extension. Then I have to consider the types of numbers typed into the SPARQL query, literal numbers typed in to the query, and variables that get bound to numbers during processing. This raises the complexity considerably.&lt;br /&gt;&lt;br /&gt;My first attempt had me doing largish methods that have copious "&lt;code&gt;if (value instanceof ...)&lt;/code&gt;" statements in them. This is clunky and brittle. The moment I went to do it a second time, I decided to throw it out, and do it all with maps to functors (where are &lt;a href="http://blogs.sun.com/jag/date/20080131"&gt;closures&lt;/a&gt;?!?). This actually worked well, and has the advantage of giving short and simple functions, and consistent patterns to follow in implementations. I'd have liked to use generics a little more, but they are really suited for interpreting code you are writing, rather than code that is being structured from a parser. Consequently, in one class I ended up writing a little &lt;a href="http://www.ruby-lang.org/en/"&gt;Ruby&lt;/a&gt; script to write the series of functor classes I needed for arithmetic operations! Scary, I know, but it works quite well.  It was either that or a series of if/then/else blocks taking me down dark passages I never want to enter.&lt;br /&gt;&lt;br /&gt;The frustrating thing is that via autoboxing, you can write the same arithmetic over and over again, and have it do different things. For instance, the expression:&lt;pre&gt;&lt;code&gt;x * y&lt;/code&gt;&lt;/pre&gt; can result it totally different return types depending on whether x and y are Doubles, Floats, Integers, etc. This is common when programming in Java using the native types (like &lt;code&gt;double&lt;/code&gt; and &lt;code&gt;int&lt;/code&gt;) but this must be established at compile time, not when processing query. That means you want to have access to every combination of parameters at run time. This can be done with autoboxing, and defining classes with interfaces that return &lt;code&gt;java.lang.Number&lt;/code&gt;s. Then the code &lt;code&gt;x*y&lt;/code&gt; can be written over and over, and it means something different each time. &lt;a href="http://java.sun.com/j2se/1.5.0/docs/guide/language/generics.html"&gt;Java generics&lt;/a&gt; are nice, but they are a long way short of C++ templates, a fact especially obvious when you want to use them on native types (along with a hundred other reasons). But Generics + &lt;a href="http://java.sun.com/j2se/1.5.0/docs/guide/language/autoboxing.html"&gt;Autoboxing&lt;/a&gt; can sometimes get you some of the way.&lt;br /&gt;&lt;br /&gt;OK, so that gave me access to each combination of parameters, but surely there's a better way to do it dynamically? Well, not in Java. The only approaches I've seen in the past either use heuristics to work out which version of arithmetic to run, or else it promotes everything into a standard type (like Double). The latter has arithmetic problems, and gives an inappropriate type for the result.  The former can just be complex to read, write, and verify.&lt;br /&gt;&lt;br /&gt;The problem comes back the CPU having different instructions for the different forms of arithmetic. A compiler has no problems selecting which one to use, but that is because it has access to the entire library of instructions. Conversely, a parser is not expected to have access to all instructions, leading to the problems I'm talking about. So you either choose a subset of instructions to work with (ie. upcast everything), or else you provide all instructions in a library, and then map the parameters into the correct instruction - either with the heuristic tree or something like a hash map.&lt;br /&gt;&lt;br /&gt;Dynamic languages have a much easier time of it. For a start, they usually have all instructions at their disposal in the interpreter. Many (though not all) of them also simplify their numeric types to only a couple of types. Whatever they use, the poor programming schmuck writing his own interpreter (that would be me) need only write &lt;code&gt;x*y&lt;/code&gt; and let the dynamic language developer work out what he wanted. At the very least, we can emit it in a string and do an &lt;code&gt;eval()&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;Oh well, I shouldn't complain. I have all the functions written out (via Ruby) and a hash map that lets me get what I need trivially. With the exception that there is a lot of machine generated code that looks like the same thing over and over, the whole system comes down to just a few lines of easily verifiable code - which is what I like to see. Following the code path you'll see that any kind of operation just goes through a few steps and it's done.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-2562630207413132643?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/2562630207413132643/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=2562630207413132643' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2562630207413132643'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2562630207413132643'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/03/writing-ive-been-trying-to-sit-down-and.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-903178916431797225</id><published>2008-03-09T12:30:00.003-05:00</published><updated>2008-03-10T10:32:03.104-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='job'/><category scheme='http://www.blogger.com/atom/ns#' term='Google'/><category scheme='http://www.blogger.com/atom/ns#' term='Mulgara'/><title type='text'></title><content type='html'>&lt;h3&gt;Review&lt;/h3&gt; You know you've been lax keeping up with your blog when your mother comments that you haven't updated it in a while.&lt;br /&gt;&lt;br /&gt;Part of the reason for my silence has been due to a lot of changes going on for me lately, some of which I was obliged to keep quiet about at the time. More recently, I've been working hard on Mulgara, and when it's come to a choice between coding or blogging, then coding had a higher imperative. But today I find myself in &lt;a href="http://www.flysfo.com/"&gt;SFO&lt;/a&gt; feeling to wrung out to code, so it seems like a good opportunity to play some catch up on my blog.&lt;br /&gt;&lt;h3&gt;Talis&lt;/h3&gt; Way back in the middle of 2007 I was contacted by &lt;a href="http://www.talis.com/"&gt;Talis&lt;/a&gt; who were wondering if I would be interested in working with them on semantic web systems, and possibly on Mulgara. My job at the time (with &lt;a href="http://www.herzumsoftware.com/"&gt;Herzum Software &lt;/a&gt; and the spin-off &lt;a href="http://www.fourthcodex.com/"&gt;fourthcodex&lt;/a&gt;) was supposed to be based on Semantic Web technology, with a sizable proportion devoted to &lt;a href="http://mulgara.org/"&gt;Mulgara&lt;/a&gt;. However, this had not happened for the 2 years I had been there, and so I was willing to consider this proposal. Also, I was getting great enjoyment and occasional inspiration from Paul Miller's &lt;a href="http://talk.talis.com/"&gt;Talking with Talis&lt;/a&gt; interviews (and even gaining an interest in libraries, courtesy of Richard Walis's productions). I'd also met &lt;a href="http://iandavis.com/blog/about"&gt;Ian Davis&lt;/a&gt; at &lt;a href="http://www.semantic-conference.com/"&gt;SemTech&lt;/a&gt; earlier in the year, and had noted with interest that &lt;a href="http://dannyayers.com/"&gt;Danny Ayers&lt;/a&gt; had recently made the move as well.&lt;br /&gt;&lt;br /&gt;So in August I took a few days from work and flew to England for an interview. I was really impressed with the guys in Birmingham, both technically and personally, and had a great time. While my understanding of the details has changed at various times, it seems that Talis have an approach of investing in Semantic Web technology without an requirement of immediate return. They are also providing support to a growing Semantic Web community with the expectation that this will lead to a data infrastructure on which they can layer semantic applications at a higher level than is possible today. To me this seems to be both very forward thinking, as well as operating for the mutual benefit of themselves and the community at large. As an Australian I also found that the similarities in culture with the British gave me a level of comfort beyond what I usually have here in America.&lt;br /&gt;&lt;br /&gt;Whether I would be working in semantics, or in the storage layer to enable semantic work by others, this really seemed like a place I'd enjoy working. However, the position would be telecommuting, and I need a visa sponsor while I live here in the USA. Talis were aware of this, and though they said they were in the process of setting up a legal entity over here, the delays this brought about have led to events overtaking this opportunity.&lt;br /&gt;&lt;br /&gt;That said, I'm still trying to keep channels open with everyone there, and I'm hoping that I'll be able to work with them in the future, in whatever capacity that may be.&lt;br /&gt;&lt;h3&gt;Google&lt;/h3&gt; Shortly before the trip to England, I found myself thinking of distributing immutable tree nodes (from Mulgara's internal storage) over a cluster, with the idea of improving scalability of speed and size for RDF storage. These thoughts led to ideas of leveraging a system like the &lt;a href="http://research.google.com/archive/gfs-sosp2003.pdf"&gt;GFS&lt;/a&gt; or &lt;a href="http://research.google.com/archive/bigtable-osdi06.pdf"&gt;BigTable&lt;/a&gt;. &lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; is also interesting in this regard, but not as advanced or scalable as the systems at Google. With this in mind, and being particularly frustrated at work, I checked out the &lt;a href="http://www.google.com/intl/en/jobs/index.html"&gt;Google jobs page&lt;/a&gt;, and discovered that they had engineering positions available in Chicago. So I filled in their online forms and sent it off. Disappointingly, the next day I received a form-reply email explaining that I wasn't what they were after.&lt;br /&gt;&lt;br /&gt;A few weeks later I met &lt;a href="http://www.ericjohnolson.com/blog/"&gt;Eric Olson&lt;/a&gt; at &lt;a href="http://techcocktail.com/"&gt;Tech Cocktail&lt;/a&gt;. Eric was still working at Google at the time, and said that he'd mention my name. I have no idea if he did or not, but a couple of weeks later, a Google recruiter in California rang me and asked if I would be available for a phone interview. This was delayed while I went to England, and then delayed further as that recruiter left and another took on my case, but it finally happened in September. It was very strange to do an interview again, when I've conducted so many in the last couple of years. I've also managed to avoid the "normal" interview process for most of the last decade, since I have usually been interviewed or offered positions by people who already knew me, either personally or by reputation.&lt;br /&gt;&lt;br /&gt;All the same, this interview went well, as did the next phone interview. So Google organized tickets for me to fly out to Mountain View and interview on site. I hadn't seriously considered a job with them to this point, but I thought it would be interesting to follow the process through.&lt;br /&gt;&lt;br /&gt;Visiting the Mountain View campus was quite an experience. It is vast, and has been gradually subsuming the surrounding business district in recent years. Getting around is often done by shuttle bus, or bicycle. People bring their own bikes, but there are a number of Google bikes parked around the place, with helmets available in large bins in the lobby of each building. Not having been given a building number to go to, I started at the central building, where I was quickly spotted and assisted by a security guard. Indeed, I was very impressed at the rapid and efficient response of on-campus security, especially as they were also very helpful and courteous.&lt;br /&gt;&lt;br /&gt;The receptionist I was directed to was also helpful, showing where I needed to go, arranging a shuttle bus, providing a visitor's badge and directions, and a fruit juice (Google have large fridges full of &lt;a href="http://www.nakedjuice.com/"&gt;Naked&lt;/a&gt; juice in every lobby I saw. They also have more exotic flavors available than I have seen anywhere before or since).&lt;br /&gt;&lt;br /&gt;Passing by the truck that had come to provide cheap haircuts to staff, I proceeded by a central courtyard which had a full sized Tyrannosaurus Rex skeleton (with &lt;a href="http://blogs.sun.com/jag/entry/rip_pink_flamingo_1957_2006"&gt;pink flamingo&lt;/a&gt; in it's mouth - several of it's cousins scattered the lawn) and a large sign proclaiming that there would be a Farmers' Market there at 11am that day.&lt;br /&gt;&lt;br /&gt;One bus trip later, I was where I needed to be, and being given a tour of the building. The variety of free coffee and other beverages was really impressive, as was the local version of Google's famous cafeterias. But the thing that really got me was seeing a projected list of Google's text searches scrolling up the wall. These are not done in real time (they would go by too fast) and have been filtered for inappropriate content (no searches for pornography, for instance), but they still served to drive home exactly where you were. This was ground zero. Those searches were resolved &lt;strong&gt;here&lt;/strong&gt;.&lt;br /&gt;&lt;br /&gt;The queries were also interesting to watch go by. There were questions on movies, Britney Spears, medical conditions, landmarks, and many questions in foreign languages, some of which were in foreign character sets, like Simplified Chinese. Watching these going by, it is immediately apparent where ideas like &lt;a href="http://www.google.com/zeitgeist"&gt;Google Zeitgeist&lt;/a&gt; came from.&lt;br /&gt;&lt;br /&gt;I then went on to have my interviews. There were about 4 of them, with a break for lunch which I had with one of the people I'd had a phone interview with. While a few of the questions were more general, most of them were about how I'd solve programming problems, with an emphasis on doing things to a "Google level of scaling". Funnily enough, my last few years of Mulgara work were perfect for this. On a couple of occasions I even found myself describing code I had written, rather than describing an abstract answer. I also got the chance to ask more about how Google works, and what it's like to be there. I was impressed by everyone's enthusiasm for their work, and for the company culture in general. A couple of people I spoke with also had children, and while they admitted that in the past Google had not been very good at supporting people with young children, in recent years this had improved significantly. But the thing that everyone talked about the most was the "perks". These extend into areas you couldn't imagine, and they are constantly evolving. Unlike most companies who occasionally institute a perk for their staff, possibly guided by a suggestion box, Google has a department whose sole mission it is to identify and implement perks.&lt;br /&gt;&lt;br /&gt;Finally the day came to an end, and I was able to head up to San Francisco. I had a very enjoyable evening with &lt;a href="http://fotap.org/~osi/"&gt;Peter&lt;/a&gt; and Trish, and the next day spent several hours having Mulgara discussions with Amit and Ronald at &lt;a href="http://www.topazproject.org/"&gt;Topaz&lt;/a&gt;. I was very pleased to get in this last meeting, and had shuffled things around with Google to make sure it could happen.&lt;br /&gt;&lt;br /&gt;As most of my friends know, a few weeks later Google made me an offer. While the base salary was simple enough, I was bemused at the complexity of the arrangements for paying bonuses, stock options, and common stock. It is the first job offer I've ever had that came with a set of equations attached. While not going into details, I &lt;em&gt;will&lt;/em&gt; say that it was very lucrative - if you came close to meeting your goals. I hadn't really considered accepting an offer until this point, but an offer like that would make anyone seriously reconsider. Consequently I agonized over this for a couple of weeks, right up to the deadline that Google set. In the meantime, I visited the Chicago site (where I insisted I would want to work, despite being asked several times if I'd move to Mountain View), and again was impressed with their setup. In fact, I've had a few people suggest that the setup at Mountain View is getting a little out of control in some ways, but this was not an issue for Chicago at all.&lt;br /&gt;&lt;br /&gt;I finally decided to turn Google down, and let them know as soon as I got back from Thanksgiving. I'd had advice from a few people, including some from inside of Google, who all pointed out that my work in the Semantic Web would be totally subsumed by working at Google. I had thought to do something with the "&lt;a href="http://googleblog.blogspot.com/2006/05/googles-20-percent-time-in-action.html"&gt;20% projects&lt;/a&gt;" that Google is known for, but it was pointed out that because bonuses are based on meeting (and exceeding) goals, then the option to use 20% of your time on something not related to your immediate work was often forgone. You also have to wonder how much of your bonuses, options, and common stock you'd get to see if you tried to keep a balanced lifestyle and didn't achieve your annual goals (apparently these are supposed to be set at a level that is challenging to achieve).&lt;br /&gt;&lt;br /&gt;Another serious consideration was one I hadn't expected. Despite having signed an &lt;acronym title="Non Disclosure Agreement"&gt;NDA&lt;/acronym&gt;, I learned nothing about Google that isn't already known to the public. Consequently, to an outsider it looked like the company was not doing anything really "interesting". I'm sure they are, but there was nothing inspiring about what they had to tell me. For most of the things I considered to be "cool" technology, I was told that those things were pretty much done, and the work they now do is in different areas altogether. In fact, the majority of the people I spoke to worked in AdWords and Billing. They were very enthusiastic about their work, and given the novelty of their service and the scale they have to work at, then I'm sure it's challenging and interesting work, but it didn't inspire me at all.&lt;br /&gt;&lt;br /&gt;Most of all, I've spent my career working with people who know a lot more than I do, to my enjoyment and benefit, and yet, no one I spent time with really impressed me with their knowledge of skills. Don't get me wrong - they were all quite competent and intelligent people. But I really expect something special out of the people I work with, if they are to bring out the best in me. Now I &lt;em&gt;know&lt;/em&gt; that Google has employed some of the brightest people in the industry, but the sheer size of the company convinced me that I'm unlikely to find myself working with those people.&lt;br /&gt;&lt;br /&gt;For those not paying attention, these last few paragraphs are all a means of justifying to myself that I made the right choice. It wasn't an easy choice to make, since Google &lt;em&gt;does&lt;/em&gt; seem like a cool company, the perks were huge, and the remuneration was potentially substantial. But I'm pretty sure I did the right thing, and as one friend said, he thinks it is &lt;em&gt;much&lt;/em&gt; cooler to say that you've turned down a Google offer than to have accepted one.  :-)&lt;br /&gt;&lt;h3&gt;Fedora Commons&lt;/h3&gt; Coming up to Christmas, I was finally getting a chance to do some Mulgara work during office hours. This was a huge thing for me, as I had been getting more and more frustrated about it for the previous two years when I was supposed to be doing this. Then in the final days before Christmas my boss, and several others I worked with in fourthcodex, decided that they wanted to do something different in semantic technologies, and resigned. Without a team to work with, there wasn't a lot of scope for me to do semantic work any more, and I was told to stop working on Mulgara again. &lt;em&gt;Sigh&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;While some semantic options were being pursued, the fact remained that Herzum Software desperately needed some more senior coders, and it looked very much like I would end up on projects that were of little interest to me. A notable one here was a .Net project that would have me working on site in Pittsburgh. This was something that nobody wanted, including my family, and everyone I was working with on Mulgara.&lt;br /&gt;&lt;br /&gt;Talis tried to help at this point (and I'm very grateful that they did), but their interim solution would have made it illegal for Anne to keep her &lt;a href="http://www.rumsumsum.com/"&gt;new business&lt;/a&gt; running, and I couldn't do that to her. But then, &lt;a href="http://www.topazproject.org/"&gt;Topaz&lt;/a&gt; and &lt;a href="http://www.fedora-commons.org/"&gt;Fedora Commons&lt;/a&gt; came back to me with an offer to work for them (which distinct organizations, there is an administrative relationship between them, and both are contributing to the &lt;a href="http://www.plos.org/"&gt;Public Library of Science&lt;/a&gt;). I've already written about my decision to accept this, which brings me up to today.&lt;br /&gt;&lt;br /&gt;I've officially been working for Fedora Commons for about a month now. I've been dividing my time between the &lt;a href="http://www.w3.org/TR/rdf-sparql-query/"&gt;SPARQL&lt;/a&gt; implementation and responding to support and debugging requests. However, this week has been different. We got all the developers from Topaz and Fedora Commons together, to discuss our plans for the year, and how to manage the process. Mulgara has also been generating some more external interest again, and since we form the core of the active developers, we wanted to discuss ways in which we can work with the community, particularly developers.&lt;br /&gt;&lt;h3&gt;Features&lt;/h3&gt; The most important features we are implementing in the coming year are SPARQL, multiple concurrent writers, and significantly greater scalability. We have been talking about the last one for a long time, but no one has had the time (or money) to do anything about it. This has now changed, and the work is commencing very soon now. It's been a long time in coming, so I'm quite inspired to get it done now.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://netymon.blogspot.com/"&gt;Andrae&lt;/a&gt; was present for the meeting, and presented some very impressive results to his research on transactionality for multiple writers on an &lt;acronym title="Resource Description Framework"&gt;RDF&lt;/acronym&gt; graph. Not only has he demonstrated a mathematically sound foundation for this work, but he has also included an impressive level of engineering for scalability in his designs.&lt;br /&gt;&lt;br /&gt;In the meantime, I have come up with a new scheme for indexing RDF, which appears to have significantly better complexity results than what we currently do. Fortunately, the majority of this work is orthogonal to Andrae's designs, with the consequence that the improvement to scalability will be cumulative between both redesigns. I'm pretty chuffed at this.  :-)  I will be writing more on the indexing shortly, but I have been under some pressure to write this up as an academic paper as well, so that may take priority over my blog.&lt;br /&gt;&lt;br /&gt;Significantly, we had James Leigh from &lt;a href="http://www.aduna-software.com/"&gt;Aduna&lt;/a&gt; at the meeting as well. Aduna are the company behind the &lt;a href="http://www.openrdf.org/"&gt;Sesame RDF store&lt;/a&gt;, which has been one of the big open source alternatives to Mulgara. They are interested in merging our systems to a certain extent, to the benefit of both. After hearing James out, it sounds like a really good idea (though I may end up throwing away the SPARQL parsing that I've finished - &lt;em&gt;sigh&lt;/em&gt; again). I'm not sure when it will happen as everyone as a lot of immediate priorities to get through, but everyone has expressed support for implementing the &lt;a href="http://www.openrdf.org/doc/sesame2/api/org/openrdf/sail/package-summary.html"&gt;SAIL&lt;/a&gt; API on Mulgara. This is very significant for us, as it will provide a host of new reasoning features, the ability for existing Sesame users to easily try Mulgara, and a SPARQL protocol interface (I'd just been working on the query language for the moment). In turn, I'm hoping that we can demonstrate these new levels of scalability and concurrency for Sesame.&lt;br /&gt;&lt;br /&gt;A lot more came out of the meeting, but that was the crux of it. Rather than pre-empt some of the things that are still in motion, I'll let others explain their end of things.&lt;br /&gt;&lt;br /&gt;I'm very happy to see this level of interest in Mulgara, and I'm excited to see all these new features starting to be realized at last.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-903178916431797225?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/903178916431797225/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=903178916431797225' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/903178916431797225'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/903178916431797225'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/03/review-you-know-youve-been-lax-keeping.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-3307821216995406546</id><published>2008-01-24T23:41:00.000-06:00</published><updated>2008-01-25T00:13:36.148-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='JRDF'/><category scheme='http://www.blogger.com/atom/ns#' term='Mulgara'/><title type='text'></title><content type='html'>&lt;h3&gt;More Correspondence&lt;/h3&gt; Going back through my emails with Andy, I realize there's still a lot that might be of interest to include here. Hopefully the parts I choose to copy/paste don't appear to disjoint.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;JRDF&lt;/h3&gt; Andy was confused about the role of &lt;a href="http://jrdf.sourceforge.net/"&gt;JRDF&lt;/a&gt; in Mulgara, as well he might be. He was trying to work with a minimal set of classes, yet he kept needing JRDF even when he wasn't using the JRDF interfaces.&lt;br /&gt;&lt;br /&gt;You can think of JRDF as having 2 faces.  First, it provides definitions for classes that represent RDF nodes - URIResource, Literal, BlankNode.  Second, there's an interface for inserting and querying for RDF statements.  Initially, someone decided to use JRDF as the definition for nodes (I think it was Andrew, and I think he chose to use it since he'd already written this code while he was at home, and it made sense for him to reuse it).  Some time later, Andrew decided that the interfaces for manipulating and querying for statements should also be implemented, since we were already using the JRDF code. (Now this is in my blog, I'm sure Andrew will clarify this point!)&lt;br /&gt;&lt;br /&gt;So internally, yes we use JRDF.  It's mostly for the interfaces and abstract classes associated with &lt;code&gt;URIResource&lt;/code&gt;, &lt;code&gt;Literal&lt;/code&gt;, and &lt;code&gt;BlankNode&lt;/code&gt;.  There are also interfaces for &lt;code&gt;SubjectNode&lt;/code&gt;, &lt;code&gt;PredicateNode&lt;/code&gt;, and &lt;code&gt;ObjectNode&lt;/code&gt; which are used when putting triples in and out of Mulgara.  At the lower levels, Mulgara is 100% symmetric around all 4 nodes (it used to be 3 nodes, but as most people should know now, we moved it up to 4).  However, when the data gets pushed through these interfaces, this imposes certain type restrictions.  This is why Mulgara won't let you use a literal as a subject, or a blank node as a predicate.&lt;br /&gt;&lt;br /&gt;For this reason, you'll need the JRDF classes, even if you never use the JRDF interfaces.  Yes, I know it's annoying.  One of the many reasons I want to reimplement a lot of Mulgara (another big reason is that I want to use a less restrictive licence - specifically &lt;a href="http://www.opensource.org/licenses/apache2.0.php"&gt;Apache&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Blank Nodes&lt;/h3&gt; Blank nodes can have any label an implementor chooses to use, so long as it meets certain criteria. In Mulgara, they are shown as an underscore, colon, and then a number. Andy was trying to figure out the significance of the numbers shown here, and how they get allocated.&lt;br /&gt;&lt;br /&gt;These numbers are actually the raw graph node identifiers (a 64 bit &lt;em&gt;long&lt;/em&gt;), or &lt;em&gt;gNodes&lt;/em&gt;. All gNodes are allocated from the Node Pool, which is just a Free List.&lt;br /&gt;&lt;br /&gt;... describing free lists..... Oh boy.....&lt;br /&gt;&lt;br /&gt;To start with, any new requests for gNodes just come from an incrementing long.  However, if you ever delete all the statements that use a gNode, then that gNode will be "released", meaning that it's added to the FreeList.  So now, whenever you ask for a new gNode, the FreeList will try to give you any released gNodes first before it returns the incremented internal long value.&lt;br /&gt;&lt;br /&gt;However, that's a vast simplification.  If you released a gNode in the current transaction, then these will be given back to you first (until exhausted).  Next, it will try to give you any nodes released in old transactions that are not part of a currently "open" result set.  Once all open resources that refer to a set of gNodes have been closed, then the FreeList is able to hand them out.  Finally, it uses the incrementing long.&lt;br /&gt;&lt;br /&gt;All of this reflects the 32 bit thinking that the system started with.  There is little need to re-use gNode values when you have a 64 bit system (if you allocate a gNode every millisecond, then it will take you half a billion years to use them all up, so I think we're safe).  We need to update it, but unfortunately, there are arrays which are indexed by the gNode ID, meaning we can't just increment the long all the time.  With the 32 bit approach this was OK, since the ID values were packed from the bottom.  But if we move to an incrementing number for gNodes (simplifying things greatly - and speeding them up) then we will need a new on-disk structure for this array.&lt;br /&gt;&lt;br /&gt;OK, this isn't describing Mulgara now. It's really my recent musings on making it all faster.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;The Server&lt;/h3&gt; Andy was waiting for the server to start up, and his TQL client appeared to be getting confused with the intermediate startup state. There wasn't a lot said here, but I want to reiterate it anyway.&lt;br /&gt;&lt;br /&gt;IMHO The server is WAY too heavy.  I'm all for the services provided... but I think they need to be provided in an external framework, and let the database be a module that gets loaded by that framework.  The fact that it starts so many services really bothers me.  I'd fix this, if I had time.&lt;br /&gt;&lt;br /&gt;Mind you, I'm being a bit harsh when I say "fix".  It works.  It's just I believe it needs to be made of smaller parts, which are either independent, or build on one another.  The current server is monolithic.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-3307821216995406546?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/3307821216995406546/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=3307821216995406546' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/3307821216995406546'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/3307821216995406546'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/01/more-correspondence-going-back-through.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-1475954423439578881</id><published>2008-01-24T22:40:00.000-06:00</published><updated>2008-01-24T23:41:21.028-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='performance'/><category scheme='http://www.blogger.com/atom/ns#' term='memory map'/><category scheme='http://www.blogger.com/atom/ns#' term='RDF'/><title type='text'></title><content type='html'>&lt;h3&gt;Current Work&lt;/h3&gt; Since I can't do much at work during my "two weeks notice", I've been asked to stay at home this week. I'm still being paid, and have to be available if I'm needed, but in reality it's just a holiday. With the visa interview next week I'm not as relaxed as I'd like, but it's been a good week. I've enjoyed spending more time with Anne and the boys, along with my mother-in-law, who left here on Tuesday. But after having a few days to clear my head, I'm trying to get back to Mulgara. Unfortunately, my new computer has not arrived yet, so I'm back to my old G4 PowerBook in the meantime. It's fine for use with &lt;a href="http://www.vim.org/"&gt;VIM&lt;/a&gt; and even Safari, but it's choking whenever I try to do real work on it.&lt;br /&gt;&lt;br /&gt;I've spent a couple of days trying to catch up on email, and now I'm looking at getting back to actual coding. I &lt;em&gt;should&lt;/em&gt; be doing SPARQL (and I'm looking forward to that), but I allowed myself to get side-tracked on some performance code.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;String Pools&lt;/h3&gt; Indy offered to do some profiling of a large load, and immediately came back to show me that we spend most of our time reading the string pool. This makes sense, as every triple that gets inserted needs to be "localized" into internal graph nodes (&lt;em&gt;gNodes&lt;/em&gt;). This means searching for the URI or literal, and getting back the associated gNode, or creating one if it doesn't exist.&lt;br /&gt;&lt;br /&gt;There are two ways to index something like a string and map it to something else. You can use a hashmap, or a tree. Hashmaps are much faster with constant complexity, but have a number of problems. They take up a lot of disk space, they can be expensive if they need to expand, and they provide no ordering, making it impossible (well, not impossible, but you have to do tricky things) to get ranges of values.  Trees don't suffer from any of these problems, but they have logarithmic complexity, and require a lot of seeking around the disk.&lt;br /&gt;&lt;br /&gt;For the moment, the string pool maps URIs and literals to gNodes by storing them in a tree. It's an AVL tree, to reduce write complexity to O(1), though we don't store multiple values per node (unlike the triple indices), meaning the tree is very deep.&lt;br /&gt;&lt;br /&gt;The code has many, many possibilities for optimization. We'll be re-architecting this soon in XA2 (there's going to be a big meeting about it in SF next month), but for the moment, we're working with what we have.&lt;br /&gt;&lt;br /&gt;The first thing that Indy noticed was some code in &lt;code&gt;Block.get(int,ByteBuffer)&lt;/code&gt;. This was iterating its way through copying bytes from one buffer to another. This seems ludicrous, especially when the documentation to &lt;a href="http://java.sun.com/javase/6/docs/api/java/nio/ByteBuffer.html#put(java.nio.ByteBuffer)"&gt;ByteBuffer.put(ByteBuffer)&lt;/a&gt; explicitly describes how it is faster than doing the same thing in an iterative loop. A simple fix to this apparently sped up loads by 30%! I wasn't profiling this, so I can't vouch for it, but Indy seemed certain of the results.&lt;br /&gt;&lt;br /&gt;Initially I had though that this couldn't have been code from David an myself, but I checked back in old versions of Kowari, and it's there too. All I can think of is that one of us must have been sick, and the other absent. At least it's fixed. I've also noticed a couple of other places where iterative copies seem to be happening. I'd like to fix them, but there may be no point. Instead I'll let the profiler guide me.&lt;br /&gt;&lt;br /&gt;After thinking about it for a while, I started wondering why one buffer was being copied into another in the first place. The AVL trees in particular are memory mapped, whenever possible, and memory mapping is explicitly supposed to avoid copying between buffers. Buffer copies may seem cheap compared to disk seeks, but these are regularly traversed indices, so the majority of the work will be done in cached memory.&lt;br /&gt;&lt;br /&gt;A little bit of inspection showed me what was going on. The first part of a URI or strong (or any kind of literal) is kept in the AVL tree, while anything that overflows 72 bytes is kept in another file. The code that does comparisons loads the first part of the data into a fresh buffer, and appends the remainder if it exists, before working with it. However, much of the time the first part is all that is needed. When this is the case there is no need to concatenate 2 separate buffers together, meaning that the original (hopefully memory mapped) buffer can be used. I fixed this in one place, but I think it's appearing in other areas as well. I'll have to work through this, but again I shouldn't go down any paths that the profiler doesn't deem necessary.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Dublin Core PURL&lt;/h3&gt; After making my adjustments, I tried running the tests, and was upset to see that 5 had failed. This seemed odd, since I'd worked on such fundamental code that a failure in my implementation should have prevented anything from working at all, rather than stopping only 5 tests. I had to go out this morning, so it bothered me for hours until I could check the reason.&lt;br /&gt;&lt;br /&gt;It turned out that the problem was coming from tests which load RDF from a URL: &lt;a href="http://purl.org/dc/elements/1.1"&gt;http://purl.org/dc/elements/1.1&lt;/a&gt;. &lt;a href="http://purl.org/"&gt;Purl.org&lt;/a&gt; is the home of persistent URLs, so if a document ever changes location, the URL for it does not need to be changed. So using this URL in a test seems appropriate, providing you can assume an internet connection while testing. But unexpectedly, the contents of this file changed just last week, which led to the problems.&lt;br /&gt;&lt;br /&gt;Given that this is a document describing a standard for Dublin Core, and given that it has a version associated with it, I am startled to see that the contents of the file changed. Shouldn't the version number increase? While I find it bizarre, at least I found it before people started complaining about it. It will be in the next release.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Moving On&lt;/h3&gt; Now that I've addressed the initial profiling black spots, I can move on to the things I ought to be doing, namely SPARQL (being an engineer I would prefer to be squeezing every last drop of performance out of this thing, but I have to manage priorities). I have to talk to a few people about Mulgara tomorrow, but aside from that, I'll be working in the JavaCC file most of the day.... I hope.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-1475954423439578881?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/1475954423439578881/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=1475954423439578881' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/1475954423439578881'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/1475954423439578881'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/01/current-work-since-i-cant-do-much-at.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-8515229346979139052</id><published>2008-01-13T17:39:00.000-06:00</published><updated>2008-01-14T00:13:17.296-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='pool'/><category scheme='http://www.blogger.com/atom/ns#' term='AVL'/><category scheme='http://www.blogger.com/atom/ns#' term='JavaCC'/><category scheme='http://www.blogger.com/atom/ns#' term='CST'/><category scheme='http://www.blogger.com/atom/ns#' term='AST'/><category scheme='http://www.blogger.com/atom/ns#' term='SPARQL'/><category scheme='http://www.blogger.com/atom/ns#' term='architecture'/><category scheme='http://www.blogger.com/atom/ns#' term='Mulgara'/><title type='text'></title><content type='html'>&lt;h3&gt;Mulgara Correspondence&lt;/h3&gt; Recently I've been in a few email discussions with &lt;a href="http://seaborne.blogspot.com/"&gt;Andy Seaborne&lt;/a&gt; about the architecture of Mulgara. He's been looking at a new &lt;a href="http://seaborne.blogspot.com/2008/01/jena-mulgara-example-of-implementing.html"&gt;Jena-Mulgara bridge&lt;/a&gt;, but when he's had the time it appears he's been looking into how Mulgara works. There are certainly areas where Mulgara could be a &lt;em&gt;lot&lt;/em&gt; better (distressingly so), so we will be changing a number of things in the not-to-distant-future. But in the meantime I'm more than happy to explain how things currently work. It's been a worthwhile exchange, as Andy knows what he's on about, so he's given me some good ideas. It's also nice to talk about some of the issues with indexing with someone who understands the needs, and can see the trade offs.&lt;br /&gt;&lt;br /&gt;Since I wrote so much detail in some of these emails, I asked Andy (just before he suggested it himself) if he'd mind me posting some of the exchange up here. One could argue that if I hadn't been writing to him then I'd have had the time to write here, but the reality is that his questions got me moving whereas the self-motivation required for blogging has failed me of late.&lt;br /&gt;&lt;br /&gt;There will be a lack of context from the emails, but hopefully I'll be able to edit it into submission. I should also issue a warning that what I wrote presumes you have some idea of what RDF is, and that you can look at the &lt;a href="http://mulgara.org/svn/mulgara/trunk/"&gt;Mulgara code&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;AVL Trees&lt;/h3&gt; If you want to keep read/write data on disk, and have it both ordered and efficient to search at the same time, then a &lt;a href="http://www.nist.gov/dads/HTML/tree.html"&gt;tree&lt;/a&gt; is usually the best approach.  There are other things you can do, but they all involve a tradeoff.  Trees are usually considered the best thing to go with.&lt;br /&gt;&lt;br /&gt;Databases usually have &lt;a href="http://www.nist.gov/dads/HTML/btree.html"&gt;B-trees&lt;/a&gt; of some type (there are a few types). These work well, but Mulgara instead opted to go with &lt;a href="http://www.nist.gov/dads/HTML/avltree.html"&gt;AVL trees&lt;/a&gt;, with a sorted list associated with each node. This structure is much more efficient for writing data, but less efficient for deletion. This suits us well, as RDF is often loaded in bulk, and it gets updated regularly, but bulk deletions are less frequent. I mention the complexity of this later on.&lt;br /&gt;&lt;br /&gt;Andy asked about our AVL trees, with comments showing that he was only looking at one of their uses.  I think that understanding a particular application of this structure is easier if the general structure is understood first.&lt;br /&gt;&lt;br /&gt;AVL trees are used in two places: The triple indexes (indices), and the "StringPool" (which is really an Object pool now).&lt;br /&gt;&lt;br /&gt;The trees themselves don't hold large amounts of data.  Instead each node holds a "payload" which is specific to the thing they are being used to index.  In the case of the "triples" indexes, this payload includes:&lt;ul&gt;&lt;li&gt;The number of triples in the block being referenced.&lt;/li&gt;&lt;li&gt;The smallest triple stored in the block.&lt;/li&gt;&lt;li&gt;The largest triple stored in the block.&lt;/li&gt;&lt;li&gt;The ID of the 8K block where the triples are stored (up to 256 of them).&lt;/li&gt;&lt;/ul&gt;I'm only using the word "triple" because that's what we stored once upon a time (circa 2001).  In reality, we store quads.  On the first pass, the fourth value was a set of security values, but this quickly became a graph ID.  Unfortunately, this happened back when everyone referred to graphs as "models", so you'll see the code uses the name "model" instead of "graph" everywhere.  (I'd like to change this).&lt;br /&gt;&lt;br /&gt;There is also some inefficiency, as we use a lot of 64 bit values, which means that there are a lot of bits set to zero.  There are plans to change the on-disk storage this year to make things much more efficient.  Fortunately, the storage is completely modular, so all we need to do to use a new storage mechanism is to enter the factory classes into an XML configuration file.&lt;br /&gt;&lt;br /&gt;The code in &lt;code&gt;org.mulgara.store.statement.xa.XAStatementStoreImpl&lt;/code&gt;, shows that there are 6 indices.  These are ordered according to "columns" 0, 1, 2, and 3, with the following patterns:  0123, 1203, 2013, 3012, 3120, 3201.  The numbers here are just a mapping of:  Subject=0, Predicate=1, Object=2, Model=3.  Of course, using this set of indices lets you find the result of any "triple pattern" (in SPARQL parlance) as a range inside the index, with the bounds of the range being found with a pair of binary searches.&lt;br /&gt;&lt;br /&gt;We use AVL trees because they are faster for writing than B-Trees.  This is because they have an O(1) complexity for write operations when doing insertions.  They can have O(log(n)) complexity while deleting, but since RDF is supposed to be about asserting data rather than removing it, then the extra cost is usually OK.  :-)&lt;br /&gt;&lt;br /&gt;The other important thing to know about Mulgara AVL trees is that they are stored in &lt;em&gt;phases&lt;/em&gt;.  This means we have multiple roots for the trees, with each root representing a &lt;em&gt;phase&lt;/em&gt;.  All phases are read-only, except the most recent.  The moment a phase updates a node, then it does a copy-on-write for that node, and all parents (transitively) up to a node that has already been copied for the current phase, or the root (whichever comes first).  In this way, there can be multiple representations of the data on disk, meaning that old read operations are always valid, no matter what write operations have occurred since then.  Results of a query are therefore referencing phases, the nodes of which can be reclaimed and reused when the result is closed, or garbage collected (we log a warning if the &lt;acronym title="Garbage Collector"&gt;GC&lt;/acronym&gt; cleans up a phase).&lt;br /&gt;&lt;br /&gt;Because all reads and writes are done on phases, the methods inside &lt;code&gt;TripleAVLFile&lt;/code&gt; are of less interest than the methods in the inner class &lt;code&gt;TripleAVLFile.Phase&lt;/code&gt;.  Here you will find the find methods that select a range out of an index, based on one, two, or three fixed values.&lt;br /&gt;&lt;br /&gt;The String Pool also uses an AVL tree (just one), though it has a very different payload.  However, the whole phase mechanism is still there.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Object Pool&lt;/h3&gt; Andy noted that the comments for the &lt;code&gt;ObjectPool&lt;/code&gt; class say that it is for reducing constructor overhead, but a cursory inspection revealed that they did more.&lt;br /&gt;&lt;br /&gt;There's some complexity to avoid pool contention between multiple threads.  Each pool contains an array of "type pools" (see the inner type called &lt;code&gt;ObjectStack&lt;/code&gt;), indexed by an (manually) enumerated type.  You want an object of type ID=5, then you go to element 5 in that array, and you get a pool for just that type.  This pool is an &lt;code&gt;ObjectStack&lt;/code&gt;, which is just an array that is managed as a stack.&lt;br /&gt;&lt;br /&gt;Whenever a new &lt;code&gt;ObjectPool&lt;/code&gt; is created it is chained onto a singleton &lt;code&gt;ObjectPool&lt;/code&gt; called the &lt;code&gt;SHARED_POOL&lt;/code&gt;.  To avoid a synchronization bottleneck, each thread uses the pool that it created, but will fall back to using the "next" pool in the chain (almost always the &lt;code&gt;SHARED_POOL&lt;/code&gt;) if it has run out of objects for some reason.  Since this is only a fallback, then there shouldn't be much waiting.&lt;br /&gt;&lt;br /&gt;I know that some people will cringe at the thought of doing object pooling with modern &lt;acronym title="Java Virtual Machine"&gt;JVM&lt;/acronym&gt;s. However, when Mulgara was first written (back when it was called &lt;acronym title="Tucana Knowledge Store"&gt;TKS&lt;/acronym&gt;) this sort of optimization was essential for efficient operation. With more recent JVMs, we have been advised to drop this pooling, but there have been a few reasons to hold back on making this change. First, we have tried to maintain a level of portability into new versions of the JVM (this is not always possible, but we have tried nonetheless), and this change could destroy performance on an older JVM. Second, we do some level of caching of objects while pooling them. This means that we don't always have to initialize objects when they are retrieved. Since some of this initialization comes from disk, and we aren't always comfortable relying on the buffer cache having what we need, then this may have an impact. Finally, it would take some work to remove all of the pooling we do, and recent profiles have not indicated that it is a problem for the moment. I'd hate to do all that work only to find that it did nothing for us, or worse, that it slowed things down.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;32/64 Bits and Loading&lt;/h3&gt; Andy was curious about our rate of loading data on 32 bit system or 64 bit systems, given simple data of short literals and URIs (&lt;100 characters or so). Unfortunately, I didn't have many answers here.&lt;br /&gt;&lt;br /&gt;In 2004 Kowari could load 250,000,000 triples of the type described in under an hour on a 64 bit Linux box (lots of RAM, and a TB of striped disks).  However, it &lt;em&gt;seems&lt;/em&gt; that something happened in 2005 that slowed this down. I don't know for certain, but I've been really disappointed to see slow loads recently.  However, I don't have a 64 bit Linux box to play with at the moment, so it's hard to compare apples with apples.  After the SPARQL implementation is complete, profiling the loads will be my highest priority.&lt;br /&gt;&lt;br /&gt;64 bit systems (excluding Apples, since they don't have a 64 bit JVM) operate differently to 32 bit systems.  For a start, they memory map all their files (using an array of maps, since no single map can be larger than 2GB in Java).  Also, I &lt;em&gt;think&lt;/em&gt; that the "&lt;code&gt;long&lt;/code&gt;" native type is really an architecturally  native 64 bit value, instead of 2x32 bit values like they have to be on a 32 bit system.  Since we do &lt;strong&gt;everything&lt;/strong&gt; with 64 bit numbers, then this helps a lot.&lt;br /&gt;&lt;br /&gt;After writing this, Inderbir was able to run a quick profile on a load of this type. He immediately found some heavily used code in &lt;code&gt;org.mulgara.store.xa.Block&lt;/code&gt; where someone was doing a copy from one &lt;code&gt;ByteBuffer&lt;/code&gt; to another by iterating over characters. I cannot imagine who would have done this, since only DavidM and I have had a need to ever be in there, and we certainly would not have done this. I also note that the operation involves copying the contents of &lt;code&gt;ByteBuffer&lt;/code&gt;s, but this doesn't make a lot of sense either, since the class was built an an abstraction to avoid exactly that (whenever possible).&lt;br /&gt;&lt;br /&gt;I haven't seen the profile, but Inderbir said that a block copy gave him an immediate improvement of about 30%. I'd also like to check the stack trace to confirm if a block copy is really needed here anyway. Thinking about it, it might be needed for copying one part of a memory-mapped file to another, but it should be avoided for files that are being accessed with read/write operations.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Embedded Servers&lt;/h3&gt; Andy was also intrigued at the (sparse) documentation for Embedded Mulgara Servers. While on this track, he pointed out that &lt;code&gt;LocalSession&lt;/code&gt; has "DO NOT USE" written across it. I've seen this comment, but don't know why it's there. I should look into what &lt;code&gt;LocalSession&lt;/code&gt; was supposed to do. In the meantime, I recommended not worrying about it - I don't.&lt;br /&gt;&lt;br /&gt;The &lt;code&gt;Session&lt;/code&gt; implementation needed for local (non-client-server) access is &lt;code&gt;org.mulgara.resolver.DatabaseSession&lt;/code&gt;. It should be fine, as this is what the server is using.&lt;br /&gt;&lt;br /&gt;When doing &lt;acronym title="Remote Method Invocation"&gt;RMI&lt;/acronym&gt;, you use &lt;code&gt;RemoteSessionWrapperSession&lt;/code&gt;. I didn't name these things, but the standard here is that the part before "&lt;em&gt;Wrapper&lt;/em&gt;" is the interface being wrapped, and the thing after "&lt;em&gt;Wrapper&lt;/em&gt;" is the interface that is being presented. So &lt;code&gt;RemoteSessionWrapperSession&lt;/code&gt; means that it's a &lt;code&gt;Session&lt;/code&gt; that is a wrapper around a &lt;code&gt;RemoteSession&lt;/code&gt;.  The idea is to make the &lt;code&gt;Session&lt;/code&gt; look completely local.  The reason for wrapping is to pick up the &lt;code&gt;RemoteException&lt;/code&gt;s needed for RMI and convert them into local exceptions. At the server end, you're presenting a &lt;code&gt;SessionWrapperRemoteSession&lt;/code&gt; to RMI. This is wrapping a &lt;code&gt;Session&lt;/code&gt; to look like a &lt;code&gt;RemoteSession&lt;/code&gt; (meaning that all the methods declare that they throw &lt;code&gt;RemoteException&lt;/code&gt;). Obviously, from the server's perspective, the &lt;code&gt;Session&lt;/code&gt; being wrapped here must be local.  And the session that is local for the server is &lt;code&gt;DatabaseSession&lt;/code&gt;.  So to "embed" a database in your code, you use a &lt;code&gt;DatabaseSession&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;The way to get one of these is to create an &lt;code&gt;org.mulgara.resolver.Database&lt;/code&gt;, and call &lt;code&gt;Database.newSession()&lt;/code&gt;. Databases need a lot of parameters, but most of them are just configuration parameters that are handled automatically by &lt;code&gt;org.mulgara.resolver.DatabaseFactory&lt;/code&gt;.  Look in this factory for the method:&lt;pre&gt;&lt;code&gt; public static Database newDatabase(URI uri, File directory, MulgaraConfig config);&lt;/code&gt;&lt;/pre&gt;A &lt;code&gt;MulgaraConfig&lt;/code&gt; is created with using the URL of an XML configuration file.  By default, we use the one found in &lt;em&gt;conf/mulgara-x-config.xml&lt;/em&gt;, which is loaded into the jar:&lt;pre&gt;&lt;code&gt; URL configUrl = ClassLoader.getSystemResource("conf/mulgara-x-config.xml");&lt;br /&gt; MulgaraConfig config = MulgaraConfig.unmarshal(new InputStreamReader(configUrl.openStream()));&lt;br /&gt; config.validate();&lt;/code&gt;&lt;/pre&gt;(configUrl has a default of:  &lt;em&gt;jar:file:/path/to/jar/file/mulgara-1.1.1.jar!/conf/mulgara-x-config.xml&lt;/em&gt;)&lt;br /&gt;&lt;br /&gt;As an aside, it's supposed to be possible to do all of this by creating an &lt;code&gt;EmbeddedMulgaraServer&lt;/code&gt; with a &lt;code&gt;ServerMBean&lt;/code&gt; parameter that isn't doing RMI.  Unfortunately, there are no such &lt;code&gt;ServerMBeans&lt;/code&gt; available.  (Maybe I should write one?)&lt;br /&gt;&lt;br /&gt;Also, I believe that the purpose of the embedded-dist Ant target is to create a Jar that has these classes along with all the supporting code, but without anything related to RMI.  So the embedded Jar should be all you need for this, but I haven't used it myself, so I'm just making an educated guess.  :-)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;SPARQL&lt;/h3&gt; Since I had been working on this up until mid-December, it is worth noting where I am with it.&lt;br /&gt;&lt;br /&gt;&lt;acronym title="Tucana Query Language"&gt;TQL&lt;/acronym&gt; includes graph names as a part of the query, and graph names have been URLs (not URIs) - meaning that they include information on how to find the server containing the graph.  Unfortunately, the guys who wrote TQL integrated the session management into the query parsing (if that doesn't make you shake your head in disbelief then you're a more forgiving person than I am).  I've successfully decoupled this, and now return an &lt;acronym title="Abstract Syntax Tree"&gt;AST&lt;/acronym&gt; that a session manager can work with.  This also means that graph names no longer have to describe the location of a server, meaning we can now support arbitrary URIs as graph names.  This now puts the burden on the session manager to find a server, but that's easy enough to set up with configuration, a registry, or scanning the AST for graph names if we want backward compatibility.&lt;br /&gt;&lt;br /&gt;The next part has been parsing SPARQL.  (Something that Andy should be intimately familiar with, given that his name is all over the &lt;a href="http://www.w3.org/TR/rdf-sparql-query/"&gt;documents&lt;/a&gt; I reference).&lt;br /&gt;&lt;br /&gt;With so many people talking about extensions to SPARQL, and after discussing this with a few other people, we decided to go with an &lt;a href="http://en.wikipedia.org/wiki/LALR"&gt;LALR parser&lt;/a&gt;.  This means I've had to write my own lexer/parser definition, instead of going with one of the &lt;a href="http://www.w3.org/2001/sw/DataAccess/rq23/parsers/"&gt;available definitions&lt;/a&gt;, like the &lt;a href="http://www.w3.org/2001/sw/DataAccess/rq23/parsers/sparql.jj"&gt;JavaCC definition&lt;/a&gt; written by Andy.  We do have &lt;a href="http://sablecc.org/"&gt;SableCC&lt;/a&gt; already in Mulgara, but everyone present agrees that this is a BAD LALR parser so I had to use something new.  I chose &lt;a href="http://beaver.sourceforge.net/"&gt;Beaver&lt;/a&gt;/&lt;a href="http://jflex.de/"&gt;JFlex&lt;/a&gt;.  It's going well, but I still have a lot of classes to write for the &lt;acronym title="Concrete Syntax Tree"&gt;CST&lt;/acronym&gt;.  The time taken to do this has me wondering if everyone is being a little too particular about the flexibility of an LALR solution, and maybe I should just go back to Andy's JavaCC definition. OTOH, I really like Beaver/JFlex and having an independent module that can do SPARQL using this parser may be a good thing.&lt;br /&gt;&lt;br /&gt;Fortunately the SPARQL spec now has a pretty good grammar specification and terminals, though one or two elements seemed redundant, and I've jumped over them (such as &lt;a href="http://www.w3.org/TR/rdf-sparql-query/#rPNAME_LN"&gt;PNAME_LN&lt;/a&gt;.  Instead I defined &lt;code&gt;IRIref ::= IRI_REF | PNAME_NS PN_LOCAL?&lt;/code&gt; ).  I've been getting some simple CSTs out of it so far, but have a way to go yet.&lt;br /&gt;&lt;br /&gt;Once I have it all parsing, of course, I have to transform the result into an AST.  Fortunately, most of the SPARQL AST is compatible with Mulgara.  The only exceptions are the &lt;a href="http://www.w3.org/TR/rdf-sparql-query/#termConstraint"&gt;FILTER&lt;/a&gt; operator (SPARQL's "constraints") and the &lt;a href="http://www.w3.org/TR/rdf-sparql-query/#optionals"&gt;OPTIONAL&lt;/a&gt; operator.  I'm pretty sure I can handle OPTIONAL as a cross between a disjunction (which can leave unbound variables) and conjunctions (which matches variables between the left and the right).  Filters should be easy, since all our resolutions are performed through nested, lazy evaluation of conjunctions and disjunctions.  Handling the syntax of filters is another matter, but I expect it to be more time consuming than difficult.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Holdup&lt;/h3&gt; Since writing these comments about implementing SPARQL, I haven't had time to work on it again. Hopefully that will change soon with the new job. But in the meantime, the loss of time has me thinking that I should reconsider using a pre-built SPARQL definition for a less expressive parser, and come back to Beaver/JFlex at a later date.&lt;br /&gt;&lt;br /&gt;I've heard that the &lt;a href="http://jena.sourceforge.net/"&gt;Jena&lt;/a&gt; JavaCC grammar may be a little heavily geared towards Jena, but I've been given another definition by my friend &lt;a href="http://fotap.org/~osi/"&gt;Peter&lt;/a&gt; which is more general and apparently passes all the relevant tests. I suppose I should go and learn &lt;a href="https://javacc.dev.java.net/"&gt;JavaCC&lt;/a&gt; now.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-8515229346979139052?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/8515229346979139052/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=8515229346979139052' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/8515229346979139052'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/8515229346979139052'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/01/mulgara-correspondence-recently-ive.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-4851658162208329748</id><published>2008-01-12T12:22:00.000-06:00</published><updated>2008-01-12T13:40:48.872-06:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Work&lt;/h3&gt; Between the realities of working hours, young children, the need for exercise, etc, I just don't make the time to blog that I used to.  But the main reason I rarely blog now is because of my job. It can be hard to  know what you can talk about when you work in a closed source world. However, there have been a few changes here lately.&lt;br /&gt;&lt;br /&gt;In the latter part of last year, I was asked to write a SPARQL implementation for Mulgara. Two and a half years ago I was told I'd get to do a reasonable amount of Mulgara work, but when it came down to it I could only write for Mulgara in my evenings and weekends. I know that most open source developers are limited like this, but it's still not easy when you have small children. It was also frustrating, given that I had different expectations.&lt;br /&gt;&lt;br /&gt;So I was pleased to be given this new task. SPARQL is sorely needed, and I was more than happy to do it during working hours. Since I was back to open source work, I &lt;em&gt;could&lt;/em&gt; have blogged more, but I was trying to use all my spare moments on the computer to get ahead with the project. That isn't always as productive as it appears, as the process of blogging can really help with programming, but it can work for a short term push.&lt;br /&gt;&lt;br /&gt;Then just before Christmas, a number of people left the company I work for, including the guy who authorized me to work on Mulgara. So I was asked to stop, while everything was worked out. With all of the Semantic Web staff leaving except for me, the company can't really continue in this area. The word I was getting over the break was that the owner of the company didn't "know what to do" with me. I can speculate on what might happen as a result, but I won't do that here. It certainly wouldn't involve Mulgara work.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Fedora Commons&lt;/h3&gt; There is a lot I want to accomplish on Mulgara in the near term, as I think an open source framework with the capabilities we are aiming for will enable a number of significant developments. If I can participate in getting Mulgara there, then perhaps I can play a part in what happens later. To this end, I have accepted a position with &lt;a href="http://www.fedora-commons.org/"&gt;Fedora Commons&lt;/a&gt; to work on Mulgara full time.&lt;br /&gt;&lt;br /&gt;Fedora have been building their code on top of Mulgara (and Kowari before that) for some years. They work with the &lt;a href="http://www.topazproject.org/trac/"&gt;Topaz Project&lt;/a&gt; and together these groups have provided technical infrastructure for the &lt;a href="http://www.plos.org/"&gt;Public Library of Science&lt;/a&gt; (PLoS) (see the &lt;a href="http://www.plosone.org/"&gt;PLoS-ONE&lt;/a&gt; open access journal for an example of a deployment that uses Topaz and Fedora, along with Mulgara). I'm just starting to get a feel for the various relationships, so I'll leave the description there.&lt;br /&gt;&lt;br /&gt;The important thing from my perspective is that both Topaz and Fedora Commons have a charter that supports the use and deployment of open source software. Also, PLoS is about making research material available to the entire community, enabling research to reach everyone who should see it. Despite commercial interests to the contrary, there are many people who think this needs to happen (a good interview on this is &lt;a href="http://talk.talis.com/archives/2007/05/peter_murrayrus.html"&gt;here&lt;/a&gt;), as even the US government has &lt;a href="http://slashdot.org/article.pl?sid=07/12/27/0219228"&gt;made moves in this direction&lt;/a&gt;. So this work not only fits in with my own goals, it also helps enable something I really believe in.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Role&lt;/h3&gt; In the course of negotiating this position with Fedora Commons, I realized that an exact statement of my roles and responsibilities had not been made. So I thought about exactly what I'd &lt;strong&gt;like&lt;/strong&gt; to do, and proposed that. While I knew that our goals were aligned, I was still pleasantly surprised to have them come back and agree with me completely. I'm pretty happy about this... I've never been in a position to name exactly what I wanted to do before.  :-)&lt;br /&gt;&lt;br /&gt;So my work will basically come down to 3 things:&lt;ul&gt;&lt;li&gt;Mulgara development.&lt;/li&gt;&lt;li&gt;Consulting with Topaz and Fedora Commons on architecture and design.&lt;/li&gt;&lt;li&gt;Supporting and growing the Mulgara community.&lt;/li&gt;&lt;/ul&gt;Of course, all of these are to be done in alignment with Fedora Commons priorities, but this has already the case for some time with my after hours work (Fedora Commons and Topaz are the heaviest users of Mulgara at the moment). The second point I put in because I am always happy to do this whenever asked, and I think it is important to keep my hand in when it comes to the bigger picture. And the last point? Well, that's what makes this new position so cool.  :-)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Visa&lt;/h3&gt; Of course, I need to get a new visa for the new position. Since visas are only issued expeditiously by a US consulate, then I need to travel to Canada in order to get it (it takes 6 months if I don't want to travel). It &lt;strong&gt;ought&lt;/strong&gt; to be easy, but if for some reason the application gets denied, then I'm not even allowed back in the USA in order to "settle my affairs". Of course, that would be a nightmare for Anne, having to pack up our house while looking after two children under 4. I understand the risk is small, but with such dire consequences I'm feeling a little nervous. I don't think I'll feel really happy about the new job until I have the visa paperwork that guarantees it for me.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Notice&lt;/h3&gt; So I've handed in my two weeks notice. I made sure I had met all my commitments before resigning, and my current boss is traveling overseas for 3 weeks starting tomorrow, so I have no idea what I'll be doing for the next 14 days. If I can, I'll start on SPARQL again. After all, the company I'm leaving is still using Mulgara in some of it's projects, so they'll still benefit from the work. I guess I'll find out for sure on Friday.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-4851658162208329748?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/4851658162208329748/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=4851658162208329748' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/4851658162208329748'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/4851658162208329748'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2008/01/work-between-realities-of-working-hours.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-4136427749824845787</id><published>2007-12-02T22:46:00.000-06:00</published><updated>2007-12-02T23:12:02.017-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='tragedy'/><title type='text'></title><content type='html'>&lt;h3&gt;Tragedy&lt;/h3&gt; This evening I heard some &lt;a href="http://www.cleveland.com/news/plaindealer/index.ssf?/base/cuyahoga/119658854495020.xml&amp;coll=2&amp;thispage=1"&gt;tragic news&lt;/a&gt;, involving &lt;a href="http://www.semantic-conference.com/2007/sessions/r3.html"&gt;Chimezie Ogbuji&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I don't really know Chime, though I've met his brother &lt;a href="http://zepheira.com/team/uche/"&gt;Uche&lt;/a&gt; a few times (to quote a friend, "A very stylish man").  Anyone who follows the semantic web mailing lists will have seen numerous messages from both men.&lt;br /&gt;&lt;br /&gt;I just want to express my deepest condolences to the Ogbuji family.  I also sincerely hope that the youngest will recover.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-4136427749824845787?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/4136427749824845787/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=4136427749824845787' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/4136427749824845787'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/4136427749824845787'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/12/tragedy-this-evening-i-heard-some.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-2127912868568085178</id><published>2007-11-13T23:42:00.000-06:00</published><updated>2007-11-14T23:19:25.056-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='code'/><title type='text'></title><content type='html'>&lt;h3&gt;Synchronicity&lt;/h3&gt; After writing about complex behavior emerging from networks of simple non-linear elements this morning, I read &lt;a href="http://slashdot.org/"&gt;Slashdot&lt;/a&gt; this evening to see a story on &lt;a href="http://science.slashdot.org/science/07/11/13/2319204.shtml"&gt;just that topic&lt;/a&gt;.  Strange.&lt;br /&gt;&lt;br /&gt;Other than that I worked to get the new interpreter system working against the existing test suite.  It's mostly there, but there are still a few bugs left.&lt;br /&gt;&lt;br /&gt;Ironically, the transaction bug of the day was occurring in a section of code where I was doing a lot of testing to see exactly what command had been issued, and responding accordingly.  However, I have an &lt;acronym title="Abstract Syntax Tree"&gt;&lt;a href="http://en.wikipedia.org/wiki/Abstract_syntax_tree"&gt;AST&lt;/a&gt;&lt;/acronym&gt; that works for me, and after staring at it for 10 minutes I suddenly realized that all the problems would go away if I used the same code for each type of command.  Consequently, 12 lines turned into 2, and all the bugs went away. Thank goodness for being able to call into a clean interface design.&lt;br /&gt;&lt;br /&gt;This isn't the first time that I've fixed a problem by removing code. Sometimes I wonder if real software engineering is about &lt;em&gt;removing&lt;/em&gt; lines of code rather than inserting them.  Pretty much destroys the "Lines of Code" metric that some companies like to employ (word to the wise - don't work for these companies).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-2127912868568085178?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/2127912868568085178/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=2127912868568085178' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2127912868568085178'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2127912868568085178'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/11/synchronicity-after-writing-about.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-3043705592934934884</id><published>2007-11-13T12:00:00.000-06:00</published><updated>2007-11-13T13:11:06.663-06:00</updated><title type='text'></title><content type='html'>&lt;h3&gt;Laryngitis&lt;/h3&gt; I'm unable to speak above a hoarse whisper today (a fact that my two year old son is delighting in) due to some kind of virus. I'm over the worst of it, but I'm a little lightheaded, so if this post rambles more than usual you'll know why. :-)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;OWL... Again&lt;/h3&gt; I'm much further into the 2nd Edition of the Description Logic Handbook (and to my deep satisfaction, wading through stuff I already know), and can see some interesting stuff coming up in the next few chapters. I'm also learning some interesting points in the discussions that go on in the &lt;a href="http://lists.w3.org/Archives/Public/public-owl-dev/2007OctDec/"&gt;OWL developers list&lt;/a&gt;. And it reoccurs to me that something is wrong here.&lt;br /&gt;&lt;br /&gt;If it takes profession developers, and even professional academics this long to get a real handle on how OWL works, then how on earth can we expect the rest of the world to get it right? The idea of the Semantic Web is to link data from lots of different sources, but that implies we need lots of people out there who can structure that data in a way that will allow the linking to be consistent (and I'm referring to the &lt;em&gt;logic&lt;/em&gt; meaning for "consistent").&lt;br /&gt;&lt;br /&gt;Conversely, in order to create a semantic web, we need precise descriptions of things, and that implies Description Logic. The inventors of OWL were not trying to be obtuse - indeed, I think they desired the opposite effect.  However, years of Description Logic research has led to an understanding that seemingly insignificant details in a language can have dramatic effects. So OWL had to be carefully built and constrained so as to prevent the future semantic web from shooting itself in the foot. But this leads us directly into this language of horrible complexity with subtle rules that even catch the experts off guard occasionally.&lt;br /&gt;&lt;br /&gt;So what's the solution? Well for the moment, the industry is doing what it always does. It muddles through using what expertise the developer community has, and incrementally drags itself up to greater consistency (hopefully) and complexity (certainly). It's hardly ideal, but then, it's no different to what usually happens with software. This is why Windows used to blue-screen all the time, and why I'm unable to run &lt;a href="http://www.microsoft.com/windowsxp/home/default.mspx"&gt;Windows XP&lt;/a&gt; in &lt;a href="http://www.parallels.com/"&gt;Parallels&lt;/a&gt; without &lt;a href="http://www.apple.com/macosx/"&gt;Leopard&lt;/a&gt; losing the ability to start new programs or kill off old ones (I'm really hoping Apple fix that one!). It leaves me concerned about wisdom of this approach.&lt;br /&gt;&lt;br /&gt;On the other hand, there seems to be little alternative to this kind of design if we want to design for semantics. OWL is simply a representation of an underlying mathematics that is fundamental to what we are trying to represent. But if it turns out to be too complex to design this stuff as a community (I believe individuals are capable of it, but not enough to make a "web" out of the semantics), then that means we can't really design this at all. But we &lt;em&gt;know&lt;/em&gt; that semantics are possible, since our brains deal with them, and our brains are little more (ha ha) than enormous networks of simple, non-linear elements. There are general guidelines (giving functional areas like the &lt;a href="http://en.wikipedia.org/wiki/Prefrontal_cortex"&gt;prefrontal cortex&lt;/a&gt; for higher thoughts and the &lt;a href="http://en.wikipedia.org/wiki/Amygdala"&gt;amygdala&lt;/a&gt; for emotional starting emotional thoughts). In other words, build a large enough network of simple constructs, with general design guidelines), but the details can vary dramatically, even between identical twins, and as we grow and learn then the network starts to adapt and modify itself.&lt;br /&gt;&lt;br /&gt;Despite the randomness (neural network theory even demonstrated that randomness is essential), and despite all the lack of detailed "design", the brain is the only instrument we currently have that can process semantics. Almost all of its processing capabilities come about as an emergent property simply from building up a large enough network of interacting elements. So maybe the idea of the semantic web isn't that far fetched after all.  We just need to get things mostly right at a local level, and when we link it all together something special will emerge. I don't think this is what the proponents of the semantic web had in mind when they first set out, but it might be what we end up with.&lt;br /&gt;&lt;br /&gt;We are already seeing emergent properties coming out of networks that hit some critical mass. This is the effect behind &lt;a href="http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html"&gt;Web 2.0&lt;/a&gt; - whatever that means. And that is the point here. The label "Web 2.0" is a recognition of &lt;em&gt;something&lt;/em&gt; that has "emerged" from these networks when connected with the right technologies. Because it wasn't explicitly designed, then it's hard to exactly pinpoint just what it is, but most people in the industry agree that it's there - even if they don't agree where it's boundaries lie.&lt;br /&gt;&lt;br /&gt;Having semantics emerge rather than being designed in would seem to be a natural extension of what we're seeing now, especially when we are getting partial semantics in small systems already (courtesy of such technologies as OWL).  But is there enough structure, and is it of the correct type for true semantics to finally emerge from the network?&lt;br /&gt;&lt;br /&gt;OK, now I'm just going off on a wild tangent. At least I didn't look at the whole OWL problem and give up on it today. Perhaps our partial and not-quite-correct systems will have a part to play in a larger network.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-3043705592934934884?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/3043705592934934884/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=3043705592934934884' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/3043705592934934884'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/3043705592934934884'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/11/laryngitis-im-unable-to-speak-above.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-6903146802567802423</id><published>2007-10-25T23:33:00.000-05:00</published><updated>2007-10-26T00:25:55.249-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='TQL'/><category scheme='http://www.blogger.com/atom/ns#' term='AST'/><category scheme='http://www.blogger.com/atom/ns#' term='SPARQL'/><category scheme='http://www.blogger.com/atom/ns#' term='Mulgara'/><title type='text'></title><content type='html'>&lt;h3&gt;The Road to SPARQL&lt;/h3&gt; A long time ago, an interface was built for talking to Mulgara. At the time a query language was needed. We did implement &lt;a href="http://www.w3.org/Submission/2004/SUBM-RDQL-20040109/"&gt;RDQL&lt;/a&gt; from &lt;a href="http://jena.sourceforge.net/"&gt;Jena&lt;/a&gt; in some earlier versions (that code is still in there), but quickly realized that we wanted more and slightly different functionality to what this language offered, and so TQL was born. Initially it was envisioned that there would be a text version for direct interaction with a user (interactive TQL, or iTQL), and an equivalent programmatic structure more appropriate for computers, written in &lt;acronym title="eXtensible Markup Language"&gt;XML&lt;/acronym&gt; (XML TQL, or xTQL). The latter was never developed, but this was the start of iTQL. (The lack of xTQL is the reason I've been advocating that we just call it TQL).&lt;br /&gt;&lt;br /&gt;But a language cannot exist in a vacuum. Queries have to go somewhere. This led to the development of the &lt;code&gt;ItqlInterpreter&lt;/code&gt; class, and the associated developer interface, &lt;code&gt;ItqlInterpreterBean&lt;/code&gt;. These classes accept queries in the form of strings, and give back the appropriate response.&lt;br /&gt;&lt;br /&gt;So far so good, but here is where it comes unstuck. For some reason, someone decided that because the &lt;acronym title="Uniform Resource Identifier"&gt;URI&lt;/acronym&gt;s (really, &lt;acronym title="Uniform Resource Locator"&gt;URL&lt;/acronym&gt;s) of Mulgara's graphs describe the server that the query should be sent to, then the interpreter should perform the dispatch as soon as it sees the server in the query string. This led to &lt;code&gt;ItqlInterpreter&lt;/code&gt; becoming a horrible mess, combining grammar parsing and automatic remote session management.&lt;br /&gt;&lt;br /&gt;I've seen this mess many times, and much as I'd like to have fixed it, the effort to do so was beyond my limited resources to do so.&lt;br /&gt;&lt;br /&gt;But now Mulgara needs to support &lt;acronym title="Simple Protocol and RDF Query Language"&gt;&lt;a href="http://www.w3.org/TR/rdf-sparql-query/"&gt;SPARQL&lt;/a&gt;&lt;/acronym&gt;. While SPARQL is limited in its functionality, and imposes inefficiencies when interpreted literally (consider using &lt;code&gt;&lt;a href="http://www.w3.org/TR/rdf-sparql-query/#optionals"&gt;OPTIONAL&lt;/a&gt;&lt;/code&gt; on a variable and &lt;code&gt;&lt;a href="http://www.w3.org/TR/rdf-sparql-query/#termConstraint"&gt;FILTER&lt;/a&gt;&lt;/code&gt;ing on if it is &lt;code&gt;bound&lt;/code&gt;) the fact that it is a standard makes it extremely valuable to both the community, and to projects like Mulgara.&lt;br /&gt;&lt;br /&gt;SPARQL has its own &lt;a href="http://www.w3.org/TR/rdf-sparql-protocol/"&gt;communications protocol&lt;/a&gt; for the end user, but internally it makes sense for us to continue using our own systems, especially when we need to continue maintaining TQL compatibility. What we'd like to do then, is to create an interface like &lt;code&gt;ItqlInterpreter&lt;/code&gt;, which can parse SPARQL queries, and use the same session and connection code as the existing system. This means we can re-use the entire system between the two languages, with only the interpreter class differing, depending on the query language you want to use.&lt;br /&gt;&lt;br /&gt;Inderbir has built a query parser for me, so all we'd need to do would be to build a query &lt;acronym title="Abstract Syntax Tree"&gt;AST&lt;/acronym&gt; with this parser, and we have the new SPARQLInterpreter class. There are a couple of missing features (like &lt;code&gt;filter&lt;/code&gt; and &lt;code&gt;optional&lt;/code&gt; support), but these are entirely compatibile with the code we have now, and easily feasible extensions.&lt;br /&gt;&lt;br /&gt;However, in order to make this happen, ItqlInterpreter finally had to be dragged (kicking and screaming) into the 21st century, by splitting up it functionality between parsing and connection management. This has been my personal project over the last few months, and I'm nearly there.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Connections&lt;/h3&gt; One of the things that I never liked about Mulgara is that I couldn't create connections for it like you do in databases &lt;a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-real-connect.html"&gt;like MySQL&lt;/a&gt;. There is the possibility of doing this with &lt;code&gt;Session&lt;/code&gt; objects, but this has never been a user interface, and there were some operations that are not handled easily with sessions.&lt;br /&gt;&lt;br /&gt;My solution was to create a &lt;code&gt;Connection&lt;/code&gt; class. The idea is to create a connection to a server, and to issue queries on this connection. This allows some important functionality. To start with, it permits client control over their connections, such as connection pooling, and multiple parallel connections. It also enables the user to send queries to any server, regardless of the URIs described in the query. This is important for SPARQL, as it may not include a model name in the query, and so the server has to be established when forming the connection. This is also how the SPARQL network protocol works.&lt;br /&gt;&lt;br /&gt;Sending queries to any server was not possible until recently, as a server always presumed that it held the graphs being queried for locally. However, serendipity lend a hand here, as I created the &lt;code&gt;DistributedQuery&lt;/code&gt; resolver just a few months ago, enabling this functionality.&lt;br /&gt;&lt;br /&gt;As an interesting aside, when I was creating the &lt;code&gt;Connection&lt;/code&gt; code I discovered an unused class called &lt;code&gt;Connection&lt;/code&gt;. This was written by Tom, and the accompanying documents explained that this was going to be used to clean up &lt;code&gt;ItqlInterpreter&lt;/code&gt;, deprecating the old code. It was never completed, but it looks like I wasn't the only one who decided to fix things this way. It's a shame Tom didn't make further progress, or else I could have merged my effort in with his (saving myself some time).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Compatibility&lt;/h3&gt; So I don't break any existing Mulgara clients, my goal has been to provide 100% compatibility with &lt;code&gt;ItqlInterpreterBean&lt;/code&gt;. To accomplish this, I've created a new &lt;code&gt;AutoInterpreter&lt;/code&gt; class, which internally delegates query parsing to the real interpreter. The resulting AST is then queried for the desired server, and a connection is made, using caching wherever possible. Once this is set up, the query can be sent across the connection.&lt;br /&gt;&lt;br /&gt;This took some work, but it now appears to mostly work. I initially had a few bugs where I forgot certain cases in the TQL syntax, such as backing up to the local client rather than the default of the remote server. But I have overcome many of these now.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Debugging&lt;/h3&gt; The main problem has been where &lt;code&gt;ItqlInterpreterBean&lt;/code&gt; was built to allow specified sessions to be set, or operations that were set to operate directly on specified sessions. I think I'm emulating most of this behavior correctly, but it's taken me a while to track them all down.&lt;br /&gt;&lt;br /&gt;I still have a number of failures and errors in my tests, but I'm working through them all quickly, so I'm pretty happy about the progress. Each time I fix a single bug, the number of problems drops by a dozen or more. The latest one I found was where the session factory is being directly invoked for graphs with a "file:" scheme in the URI. There is supposed to be a fallback to another session factory that knows how to handle protocols that aren't &lt;code&gt;rmi&lt;/code&gt;, &lt;code&gt;beep&lt;/code&gt;, or &lt;/code&gt;local&lt;/code&gt;, but it seems that I've missed it. It shouldn't be too hard to find in the morning.&lt;br /&gt;&lt;br /&gt;One bug that has me concerned looks like some kind of resource leak. The test involved creates an &lt;code&gt;ItqlInterpreterBean&lt;/code&gt; 1000 times. On each iteration a query is invoked on the object, and then the object is closed before the loop is iterated again. For some reason this is consistently failing at about the 631&lt;sup&gt;st&lt;/sup&gt; iteration. I've added in some more logging, so I'm hoping to see the specific exception the next time I go through the tests.&lt;br /&gt;&lt;br /&gt;One thing that caught me out a few times was the set of references to the original &lt;code&gt;ItqlInterpreter&lt;/code&gt;, &lt;code&gt;ItqlSession&lt;/code&gt; and &lt;code&gt;ItqlSessionUI&lt;/code&gt; classes. So yesterday I removed all reference to these classes. This necessitated a cleanup of the various Ant scripts which defined them as entry points, but everything appears to work correctly now, which gives me hope that I got it all. The new code is now called &lt;code&gt;TqlInterpreter&lt;/code&gt;, &lt;code&gt;TqlSession&lt;/code&gt; and &lt;code&gt;TqlSessionUI&lt;/code&gt;. While the names are similar, and the functionality is the same, most of it was re-written from the ground up. This gave me a more intimate view of the way these classes were built, leading to a few surprises.&lt;br /&gt;&lt;br /&gt;One of the things the old &lt;acronym title="User Interface"&gt;UI&lt;/acronym&gt; code used to do was to block on reading a pipe, while simultaneously handling UI events. Only this pipe was never set to anything! It was totally dead code, but would never have been caught by a code usage analyzer, as it got run al time time (it could just never do anything). I decided to address this by having it read from, and process the standard input of the process (I suspect this was the initial intent, but I'm not sure). I don't know how useful it is, but it's sort of cute, as I can now send queries to the standard input while the UI is running, and not just paste into the UI.&lt;br /&gt;&lt;br /&gt;I've added a few little features like this as I've progressed, though truth be told I can't remember them all! :-)&lt;br /&gt;&lt;br /&gt;The other major thing I've been doing has been to fix up the bad code formatting that was imposed at some point in 2005, and to add &lt;a href="http://java.sun.com/j2se/1.5.0/docs/guide/language/generics.html"&gt;generics&lt;/a&gt;. Sometimes this has proven to be difficult, but in the end it's worth it. It's not such a big issue with already working code, but generics make updating significantly easier, both by documenting what goes in and out of collections, and by doing some checking on the types being used. Unfortunately, there are some strange structures that makes generics difficult (trees built from maps with values that are also maps), so some of this work was time consuming. On the plus side, it's &lt;strong&gt;&lt;em&gt;much&lt;/em&gt;&lt;/strong&gt; easier to see what that code is now doing.&lt;br /&gt;&lt;br /&gt;I hope to be through this set of fixes by the end of the week, so I can get a preliminary version of SPARQL going by next week. That will then let me start on those features of the SPARQL AST that we don't yet support.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-6903146802567802423?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/6903146802567802423/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=6903146802567802423' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6903146802567802423'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6903146802567802423'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/10/road-to-sparql-long-time-ago-interface.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-98319761002084678</id><published>2007-10-25T23:22:00.000-05:00</published><updated>2007-10-25T23:32:47.146-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OCL'/><category scheme='http://www.blogger.com/atom/ns#' term='Web Services'/><category scheme='http://www.blogger.com/atom/ns#' term='BioMOBY'/><title type='text'></title><content type='html'>&lt;h3&gt;Web Service Descriptions&lt;/h3&gt; Whenever I design ontologies and the tools to use them, I find myself thinking that I should somehow be describing more, and having the computer do more of the work for me. Currently OWL is about describing general structure and relationships, but I keep feeling like ontologies should describe behavior as well.&lt;br /&gt;&lt;br /&gt;I'm not advocating the inclusion of an imperative style of operation description here. If that were the case, then we'd just be building programming languages in RDF (something I've been thinking of for a while - after all, it seems natural to represent ASTs or List structure in RDF). I'm thinking more along the lines of &lt;acronym title="Object Constraint Language"&gt;&lt;a href="http://en.wikipedia.org/wiki/Object_Constraint_Language"&gt;OCL&lt;/a&gt;&lt;/acronym&gt; for &lt;acronym title="Unified Modeling Language"&gt;&lt;a href="http://en.wikipedia.org/wiki/Unified_Modeling_Language"&gt;UML&lt;/a&gt;&lt;/acronym&gt;. This is a far more declarative approach, and describes what things are, rather than how they do it. Like we were taught in first-year programming, OCL is all about describing the pre-conditions and post-conditions of functions.&lt;br /&gt;&lt;br /&gt;The main problem I have with OCL is that it looks far too much like a programming language, which is the very thing I'd like to avoid.  All the same, it's on the right track. It would be interesting to see something like it in OWL, though OWL still has a long way to go before it is ready for this. The other problem is that OCL is about &lt;em&gt;constraining&lt;/em&gt; a description, which seems to be at odds with the open-world model in OWL.&lt;br /&gt;&lt;br /&gt;Still, the world appears to be ready for OWL to have something like this, even if OWL isn't ready to include it. Maybe I should be building an RDF/Lisp interpreter after all?  :-)&lt;br /&gt;&lt;br /&gt;The reason for me thinking along these lines is that I'd like to be able to describe web services in a truely interoperable way. Today we have WSDL and OWL-S, which are very good at describing the names of services and how to interface to them, but do little to describe what those services do. If we could really describe a service correctly, then any client could connect into an unknown server and discover exactly what that server was capable of and how to talk to it, all without user interaction, and without any prior understanding of what that server could do.&lt;br /&gt;&lt;br /&gt;Ultimately, I think this is what we are striving for with the Semantic Web. It is a long way beyond us today, but the evolution of hardware and software in computers over the last 60 years have taught us that amazing things are achievable, so long as we take it one step at a time.&lt;br /&gt;&lt;br /&gt;To a limited extent, there are already systems like &lt;a href="http://biomoby.org/"&gt;BioMOBY&lt;/a&gt;, which use ontologies to query about services, and can work out for themselves how to connect to remote services and connect them together to create entirely new functions. There are still assumptions made about what kind of data is there and how to talk to it, but it includes a level of automation that is astounding for anyone familiar with WSDL standards.&lt;br /&gt;&lt;br /&gt;When I last saw BioMOBY nearly 3 years ago, they were using the own RDF-like data structures, and were considering moving to RDF to gain the benefit of using public standards. I should check them out again, and see where they went with that. They certainly had some great ideas that I'd like to see implemented in a more general context.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-98319761002084678?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/98319761002084678/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=98319761002084678' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/98319761002084678'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/98319761002084678'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/10/web-service-descriptions-whenever-i.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-2893611438438234788</id><published>2007-10-22T10:26:00.000-05:00</published><updated>2007-10-26T00:38:59.401-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OWL'/><category scheme='http://www.blogger.com/atom/ns#' term='Semantic Web'/><category scheme='http://www.blogger.com/atom/ns#' term='RDFS'/><category scheme='http://www.blogger.com/atom/ns#' term='RDF'/><title type='text'></title><content type='html'>&lt;h3&gt;Theory and Practice&lt;/h3&gt; The development of OWL comes from an extensive history of description logic research.  This gives it a solid theoretical foundation for developing systems on, but that still doesn't make it practical.&lt;br /&gt;&lt;br /&gt;There are numerous practical systems out there which do &lt;em&gt;not&lt;/em&gt; have a solid theoretical foundation (and this can sometimes bite you when you really try to push a system), but can still be very useful for real world applications.  After all, G&amp;ouml;del showed us that every system will either be incomplete or inconsistent (maybe that explains why Quantum Physics and General Relativity cannot both be right.  Since G&amp;ouml;del's Theorem requires that there be an inconsistency &lt;em&gt;somewhere&lt;/em&gt; in our description of the universe, then maybe that's it).  :-)&lt;br /&gt;&lt;br /&gt;If theoretically solid systems not an absolute requirement for practical applications, then are they really needed?  I'd like to think so, but I don't have any proof of this.  In fact, the opposite seems to be true.  Systems with obvious technical flaws become successful, while those with a good theoretical underpinning languish.  There are many reasons for failure, with social and marketing being among the more common.  Ironically, the problems can also be technical.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.jfsowa.com/pubs/index.htm"&gt;John Sowa's&lt;/a&gt; essay on &lt;a href="http://www.jfsowa.com/pubs/fflogic.pdf"&gt;Fads and Fallacies about Logic&lt;/a&gt; describes how logic was originally derived from trying to formalize statements made in natural language.  He also mentions that a common complaint made about modern logic is its unreadability.  But this innocuous statement doesn't do the evolution justice.  Consider this example in classical logic:&lt;ul&gt;&lt;li&gt;All men are mortal.&lt;/li&gt;&lt;li&gt;Socrates is a man.&lt;/li&gt;&lt;li&gt;Therefore: Socrates is mortal.&lt;/li&gt;&lt;/ul&gt;Now look at the following definition of modal logic:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Given a set of &lt;em&gt;propositional letters p&lt;sub&gt;1&lt;/sub&gt;,p&lt;sub&gt;2&lt;/sub&gt;,...&lt;/em&gt;, the set of formulae of the modal logic &lt;strong&gt;K&lt;/strong&gt; is the small set that:&lt;ul&gt;&lt;li&gt;contains &lt;em&gt;p&lt;sub&gt;1&lt;/sub&gt;,p&lt;sub&gt;2&lt;/sub&gt;,...&lt;/em&gt;,&lt;/li&gt;&lt;li&gt;is closed under Boolean connectives, &amp;and;, &amp;or; and &amp;not;, and&lt;/li&gt;&lt;li&gt;if it contains &lt;em&gt;&amp;Phi;&lt;/em&gt;, then it also contains &amp;#x25a1;&lt;em&gt;&lt;sub&gt;&amp;Phi;&lt;/sub&gt;&lt;/em&gt; and &amp;#x25c7;&lt;em&gt;&lt;sub&gt;&amp;Phi;&lt;/sub&gt;&lt;/em&gt;.&lt;/li&gt;&lt;/ul&gt;The semantics of modal formulae is given by &lt;em&gt;Kripke structures&lt;/em&gt; of &lt;em&gt;M&lt;/em&gt;=&amp;lang;&lt;em&gt;S&lt;/em&gt;,&lt;em&gt;&amp;pi;&lt;/em&gt;,&lt;em&gt;K&lt;/em&gt;&amp;rang;, where &lt;em&gt;S&lt;/em&gt; is a set of &lt;em&gt;states&lt;/em&gt;, &lt;em&gt;&amp;pi;&lt;/em&gt; is a projection of propositional letters to sets of states, and &lt;em&gt;K&lt;/em&gt; is the &lt;em&gt;accessibility relation&lt;/em&gt; which is a binary relation on the states &lt;em&gt;S&lt;/em&gt;. Then, for a modal formula &lt;em&gt;&amp;Phi;&lt;/em&gt; and a state &lt;em&gt;s&lt;/em&gt;&amp;isin;&lt;em&gt;S&lt;/em&gt;, the expression &lt;em&gt;M&lt;/em&gt;, &lt;em&gt;s&lt;/em&gt;&amp;#x22a8;&lt;em&gt;&amp;Phi;&lt;/em&gt; is read as "&lt;em&gt;&amp;Phi;&lt;/em&gt; holds in &lt;em&gt;M&lt;/em&gt; in state &lt;em&gt;s&lt;/em&gt;". So:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;em&gt;M&lt;/em&gt;, &lt;em&gt;s&lt;/em&gt; &amp;#x22a8; &lt;em&gt;p&lt;sub&gt;i&lt;/sub&gt;&lt;/em&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; iff &lt;em&gt;s&lt;/em&gt; &amp;isin; &lt;em&gt;&amp;pi;&lt;/em&gt;(&lt;em&gt;p&lt;sub&gt;i&lt;/sub&gt;&lt;/em&gt;)&lt;/li&gt;&lt;li&gt;&lt;em&gt;M&lt;/em&gt;, &lt;em&gt;s&lt;/em&gt; &amp;#x22a8; &lt;em&gt;&amp;Phi;&lt;/em&gt;&lt;sub&gt;1&lt;/sub&gt; &amp;and; &lt;em&gt;&amp;Phi;&lt;/em&gt;&lt;sub&gt;2&lt;/sub&gt; &amp;nbsp;&amp;nbsp; iff &lt;em&gt;M,s&lt;/em&gt; &amp;#x22a8; &lt;em&gt;&amp;Phi;&lt;/em&gt;&lt;sub&gt;1&lt;/sub&gt; and &lt;em&gt;M,s&lt;/em&gt; &amp;#x22a8; &lt;em&gt;&amp;Phi;&lt;/em&gt;&lt;sub&gt;2&lt;/sub&gt;&lt;/li&gt;&lt;li&gt;&lt;em&gt;M&lt;/em&gt;, &lt;em&gt;s&lt;/em&gt; &amp;#x22a8; &lt;em&gt;&amp;Phi;&lt;/em&gt;&lt;sub&gt;1&lt;/sub&gt; &amp;or; &lt;em&gt;&amp;Phi;&lt;/em&gt;&lt;sub&gt;2&lt;/sub&gt; &amp;nbsp;&amp;nbsp; iff &lt;em&gt;M,s&lt;/em&gt; &amp;#x22a8; &lt;em&gt;&amp;Phi;&lt;/em&gt;&lt;sub&gt;1&lt;/sub&gt; or &lt;em&gt;M,s&lt;/em&gt; &amp;#x22a8; &lt;em&gt;&amp;Phi;&lt;/em&gt;&lt;sub&gt;2&lt;/sub&gt;&lt;/li&gt;&lt;li&gt;&lt;em&gt;M&lt;/em&gt;, &lt;em&gt;s&lt;/em&gt; &amp;#x22a8; &amp;not;&lt;em&gt;&amp;Phi;&lt;/em&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; iff &lt;em&gt;M,s&lt;/em&gt; &amp;#x22ad; &lt;em&gt;&amp;Phi;&lt;/em&gt;&lt;/li&gt;&lt;li&gt;&lt;em&gt;M&lt;/em&gt;, &lt;em&gt;s&lt;/em&gt; &amp;#x22a8; &amp;#x25a1;&lt;em&gt;&lt;sub&gt;&amp;Phi;&lt;/sub&gt;&lt;/em&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; iff there exists &lt;em&gt;s&amp;apos;&lt;/em&gt; &amp;isin; &lt;em&gt;S&lt;/em&gt; with (&lt;em&gt;s,s&amp;apos;&lt;/em&gt;) &amp;isin; &lt;em&gt;K&lt;/em&gt; and &lt;em&gt;M,s&amp;apos;&lt;/em&gt; &amp;#x22a8; &lt;em&gt;&amp;Phi;&lt;/em&gt;&lt;/li&gt;&lt;li&gt;&lt;em&gt;M&lt;/em&gt;, &lt;em&gt;s&lt;/em&gt; &amp;#x22a8; &amp;#x25c7;&lt;em&gt;&lt;sub&gt;&amp;Phi;&lt;/sub&gt;&lt;/em&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; iff for all &lt;em&gt;s&amp;apos;&lt;/em&gt; &amp;isin; &lt;em&gt;S&lt;/em&gt; if (&lt;em&gt;s,s&amp;apos;&lt;/em&gt;) &amp;isin; &lt;em&gt;K&lt;/em&gt;, then &lt;em&gt;M,s&amp;apos;&lt;/em&gt; &amp;#x22a8; &lt;em&gt;&amp;Phi;&lt;/em&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;em&gt;(Courtesy of &lt;a href="http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521781760"&gt;The Description Logic Handbook&lt;/a&gt;.  Sorry if it doesn't render properly for you... I tried! The UniCode character &amp;amp;#x22a8; &lt;/em&gt;(&amp;#x22a8;)&lt;em&gt; is rendered in Safari, but not in Firefox on my desktop - though it shows up on my notebook.)&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;While several generations removed, it can still be hard to see how this formalism descended from classical logic. Modal logic is on a firm theoretical foundation, and it is apparent that the above is a precise description, yet it is not the sort of tool that professional programmers are ever likely to use. This is because the complexity of understanding the formalism is a significant barrier to entry.&lt;br /&gt;&lt;br /&gt;We see this time and again. Functional programming is superior to imperative programming in many instances, and yet the barrier to entry is too high for many programmers to use it. Many of the elements that were considered fundamental to Object Oriented Programming are avoided or ignored by many programmers, and are not even available in some supposedly Object Oriented languages (for instance, Java does not invoke methods by passing messages to objects). And many logic formalisms are overlooked, or simply not sufficiently understood for many of the applications for which there were intended.&lt;br /&gt;&lt;br /&gt;After working with RDF, RDFS and OWL for some years now, I've started to come to the conclusion that these systems suffer from these same problems with the barrier to entry. It took me long enough to understand the complexities introduced in an open world model without a unique name assumption. Contrary to common assumption, RDFS &lt;em&gt;domain&lt;/em&gt; and &lt;em&gt;range&lt;/em&gt; is descriptive rather than prescriptive. And Cardinality restrictions rarely create inconsistencies.&lt;br /&gt;&lt;br /&gt;Part of the problem stems from the fact that non-unique names and an open world are a complete different set of assumptions from the paradigms that programmers have been trained to deal with. It takes a real shift in thinking to understand this. Also, computers are good at working with the data they have stored. Working with data that is &lt;em&gt;not&lt;/em&gt; stored is more the domain of mathematics: a field that has been receiving less attention in the industry in recent years, particularly as professionals have moved away from "Computer Science" and into "Information Technology". Even those of us who know better still resort to the expediency of using many closed world assumptions when storing RDF data.&lt;br /&gt;&lt;br /&gt;Giving RDF, RDFS, and OWL to the general world of programmers today seems like a recipe for implementations of varying correctness with little hope of interoperability - the very thing that these technologies were designed to enable.&lt;br /&gt;&lt;br /&gt;However, the RDF, RDFS and OWL were designed the way they are for very sound reasons. The internet &lt;strong&gt;&lt;em&gt;is&lt;/em&gt;&lt;/strong&gt; and open world. New information is being asserted all the time (and some information is being retracted, meaning that facts on the web are both temporal and non-monotonic, neither of which is dealt with by semantic web technologies, but let's deal with one problem at a time). There are often many different ways of referring to the same things (IP addresses and hostnames are two examples). URIs are the mechanism for identifying things on the internet, and while URIs may not be unique for a single resource, they &lt;em&gt;do&lt;/em&gt; describe a single resource, and no other. All of these features were considered when RDF and OWL were developed, and the decisions made were good ones. Trying to build a system that caters to programmers presumptions by ignoring these characteristics of the internet would be ignoring the world as it is.&lt;br /&gt;&lt;br /&gt;So I'm left thinking that the foundations of RDF and OWL are correct, but somehow we have to present them in such a way that programmers don't shoot themselves in the foot with them.  Sometimes I think I have some ideas, but it's easy to become disheartened.&lt;br /&gt;&lt;br /&gt;Certainly, I believe that education is required. To some extent this has been successful, as I've seen significant improvement in developers' understanding (my own included) in the last few years. We also need to provide tools which help guide developers along the right path, even if that means restricting some of their functionality in some instances. These have started to come online, but we have a long way to go.&lt;br /&gt;&lt;br /&gt;Overall, I believe in the vision of the semantic web, but the people who will make it happen are the people who will write software to use it. OWL seems to be an impediment to the understanding they require, and the tools for the task are still rudimentary. It leaves me wondering what can be done to help.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-2893611438438234788?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/2893611438438234788/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=2893611438438234788' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2893611438438234788'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2893611438438234788'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/10/theory-and-practice-development-of-owl.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-6577488961340440926</id><published>2007-09-17T21:45:00.000-05:00</published><updated>2007-09-18T00:20:47.175-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Java'/><category scheme='http://www.blogger.com/atom/ns#' term='Currying'/><category scheme='http://www.blogger.com/atom/ns#' term='Ruby'/><title type='text'></title><content type='html'>&lt;h3&gt;Ruby&lt;/h3&gt; I've recently been making my way through the &lt;a href="http://www.pragmaticprogrammer.com/titles/ruby/index.html"&gt;Pickaxe book&lt;/a&gt;, otherwise known as &lt;em&gt;"Programming Ruby"&lt;/em&gt;. I've been avoiding &lt;a href="http://www.ruby-lang.org/en/"&gt;Ruby&lt;/a&gt; for a few years now, but I finally decided there was enough critical mass to make it worth my while.&lt;br /&gt;&lt;br /&gt;Initially I thought little of Ruby, as it was just another scripting language. Despite the power and fast turnaround that these languages provide, in the late 90's and early 2000's they had always failed to deliver compelling performance. However, hardware performance has largely overcome that limitation. Even when it became apparent some years ago that performance was no longer an issue, Ruby was not a language with a lot of mindshare. The popular languages were always &lt;a href="http://www.perl.org/"&gt;Perl&lt;/a&gt; for sysadmins and the Web, and &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt; for scientific applications, programs with GUIs, and some web sites. (I won't mention &lt;a href="http://www.tcl.tk/"&gt;Tcl&lt;/a&gt;. Anyone who used that got what they deserved). These are generalizations, but they serve as a picture of the general landscape. &lt;br /&gt;&lt;br /&gt;However, the importance of Ruby seemed to change with the advent of &lt;a href="http://www.rubyonrails.org/"&gt;Ruby on Rails&lt;/a&gt; (RoR). While often criticized as not providing the scalability of the enterprise frameworks, it boasts a pragmatism that gets a lot of very compelling sites up and running in record time. This is the perfect example of the advantages in avoiding &lt;a href="http://en.wikiquote.org/wiki/C._A._R._Hoare"&gt;premature optimization&lt;/a&gt;. The fact that people can just &lt;em&gt;make stuff work&lt;/em&gt; in Rails has been enough to see it expand into almost every Web 2.0 site I can think of. Just in case I wasn't paying attention, Sun recently decided this was worth looking at when &lt;a href="http://www.tbray.org/ongoing/When/200x/2006/09/07/JRuby-guys"&gt;they hired Thomas Enebo and Charles Nutter&lt;/a&gt;, two of the key developers in JRuby. Another indication (if I needed any) is all the interest in &lt;a href="http://www.activerdf.org/"&gt;ActiveRDF&lt;/a&gt;, which is a framework for accessing RDF in a way that is compatible withe existing RoR APIs.&lt;br /&gt;&lt;br /&gt;Now everyone I know who uses RoR tells me that I don't need to know Ruby to use it, I'm probably going to spend more time providing services (from &lt;a href="http://mulgara.org/"&gt;Mulgara&lt;/a&gt;) than I am to be using it directly. Not to say that I don't &lt;em&gt;want&lt;/em&gt; to use RoR... it's just that I usually find myself elsewhere in the programming stack. Besides, I love working at lower levels. So I've been thinking that I should make the time to properly learn this language.&lt;br /&gt;&lt;br /&gt;Actually, the thing that finally made me start learning Ruby was when I discovered that there is in-built support for &lt;a href="http://lambda-the-ultimate.org/"&gt;lambda&lt;/a&gt;s.  Well why didn't anyway just say that before?&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Currying&lt;/h3&gt; I'm still only partway through the book, but I'm enjoying it a lot. In my "spare time" I'm finding myself torn between reading more, and writing code in Ruby. This is completely ignoring the fact that I'm doing a big refactor in Mulgara (one reason is to improve SOAP support - allowing Ruby to work better), and that I've dabbled with the &lt;a href="http://www.talis.com/platform/index.shtml"&gt;Talis Platform&lt;/a&gt; as well. Because of this I've been restricting myself to playing with language constructs, but I intend to start using it in earnest soon.&lt;br /&gt;&lt;br /&gt;One of the things I've been wondering about is &lt;a href="http://en.wikipedia.org/wiki/Currying"&gt;currying&lt;/a&gt;. I found &lt;a href="http://moonbase.rydia.net/mental/blog/programming/currying-in-ruby"&gt;a reference&lt;/a&gt; to doing this in Ruby, but the technique was limited to explicit currying of particular lambdas. It wasn't general, and does not refer to methods.&lt;br /&gt;&lt;br /&gt;So I had a look at doing it more generally, and learnt a little on the way.&lt;br /&gt;&lt;br /&gt;It would be great if methods in Ruby could be completely compatible with lambdas, but they are separate object types. Initially I despaired of getting a type for a method, as every way I could think of referring to them would call the method instead. This should have clued me into the fact that methods are found by name (as in a string) or by "label". Fortunately, both methods and lambdas respond to the "call" message, meaning they can be used the same way. Instance based &lt;code&gt;Method&lt;/code&gt; objects also need to be bound to an object in order to be called, which I also learnt through trial and error.  But in the end I found the simple invocation for currying a method:&lt;pre&gt;&lt;code&gt;def curry(fn, p)&lt;br /&gt;  lambda { |*args| fn.call(p, *args) }&lt;br /&gt;end&lt;/code&gt;&lt;/pre&gt;This doesn't do any checking that the parameters will work with it, but it works if called correctly. The thing I like about it is that it lets you curry down arbitrary lambdas iteratively.  For instance, I can take a lambda that adds 3 parameters, and curry it down to a lambda that has no parameters:&lt;pre&gt;&lt;code&gt;add_xyz = lambda { |x,y,z| x+y+z }&lt;br /&gt;add_3yz = curry(add_xyz, 3)&lt;br /&gt;add_35z = curry(add_3yz, 5)&lt;br /&gt;add_357 = curry(add_35z, 7)&lt;br /&gt;&lt;br /&gt;puts "add_xyz(3,5,7) = #{add_xyz[3,5,7]}"&lt;br /&gt;puts "add_3yz(5,7) = #{add_3yz[5,7]}"&lt;br /&gt;puts "add_35z(7) = #{add_35z[7]}"&lt;br /&gt;puts "add_357() = #{add_357[]}"&lt;/code&gt;&lt;/pre&gt;All 4 invocations here return the same number (15).&lt;br /&gt;&lt;br /&gt;It also works on methods:&lt;pre&gt;&lt;code&gt;class Multiplier&lt;br /&gt;  def mult(x,y)&lt;br /&gt;    x * y&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;foo = Multiplier.new&lt;br /&gt;double = curry(foo.method(:mult), 2)&lt;br /&gt;&lt;br /&gt;puts "double(5) = #{double[5]}"&lt;/code&gt;&lt;/pre&gt;It's not as elegant as &lt;a href="http://www.haskell.org/"&gt;Haskell&lt;/a&gt;, but I'm pleased to see that it can be generalized. It gives me faith in the power of the language.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Language Life&lt;/h3&gt; In recent weeks I've been having discussions with friends about where I see languages like Ruby and Java, and what we think the future may hold.&lt;br /&gt;&lt;br /&gt;Ruby really impresses me with the eclectic approach it has taken to advanced techniques, without going overboard on multiple approaches like Perl did (Perl's infamous &lt;a href="http://en.wikipedia.org/wiki/There_is_more_than_one_way_to_do_it"&gt;&lt;acronym title="There's More Than One Way To Do It"&gt;TMTOWTDI&lt;/acronym&gt;&lt;/a&gt;). The &lt;a href="http://en.wikipedia.org/wiki/Open_world_assumption"&gt;Open World model&lt;/a&gt; that it brings via dynamic class extension is also refreshing, and a welcome relief to those programming in Java. Ruby is also heavily &lt;acronym title="Object Oriented"&gt;OO&lt;/acronym&gt; (much moreso than Java - and very much like &lt;a href="http://www.smalltalk.org/"&gt;Smalltalk&lt;/a&gt;) and with lambdas it permits a very functional style of programming, which has also been gaining in popularity recently. (For some reason people often think that &lt;em&gt;functional&lt;/em&gt; and &lt;em&gt;OO&lt;/em&gt; are at odds with each other. This is not so. Most functional languages make significant use of objects. Functional is the opposite of &lt;a href="http://en.wikipedia.org/wiki/Imperative_programming"&gt;&lt;em&gt;imperative&lt;/em&gt;&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;As the engine for Ruby improves (maybe even the &lt;a href="http://jruby.codehaus.org/"&gt;JRuby&lt;/a&gt; engine), and computers become more capable, then a language like this may become the standard in just a few years. Ruby has been getting a PR boost from people heavily involved in the &lt;acronym title="eXtreme Programming"&gt;XP&lt;/acronym&gt; development community, and the buzz around RoR continues to grow.&lt;br /&gt;&lt;br /&gt;Java, on the other hand, reminds me of C++ in the late 90's.&lt;br /&gt;&lt;br /&gt;Around this time there was a huge community around C++.  The ISO standard had &lt;a href="http://www.research.att.com/~bs/iso_release.html"&gt;been approved&lt;/a&gt;, and there were several large software houses getting the final elements of the spec into their compilers. Developers were using it from everything from operating systems and embedded devices through to financial applications and GUI front ends. Many of the major GUI libraries were in C++ (&lt;acronym title="Microsoft Foundation Classes"&gt;MFC&lt;/acronym&gt;, &lt;a href="http://trolltech.com/products/qt"&gt;Qt&lt;/a&gt;, among others). Text books written on obscure template constructs were selling well. There was a strong market for good C++ developers.&lt;br /&gt;&lt;br /&gt;By contrast, Java was a niche system. It had been publicly released in 1995, and had received quite a bit of criticism. Security flaws were found in the early "sandboxes". The GUI's that could be built were rudimentary. Performance was poor. The most compelling aspect of the system was the "demo" that Sun put together, called an "Applet", which allowed you to insert dynamic content into web pages of the browser they built to handle it. Ultimately, other systems were to overtake this one feature that was generating interest.&lt;br /&gt;&lt;br /&gt;But I'm sure that if you're reading this blog, then you know all this stuff.&lt;br /&gt;&lt;br /&gt;My point is that C++ seemed to be in a secure position, while Java occupied a niche. For a while there it looked like Java wouldn't make it, when Microsoft set out to "Embrace and Extend" this system, like they'd done to so many before it.&lt;br /&gt;&lt;br /&gt;And yet, here we are, a decade later. C++ has almost fallen off the map. Sure, it's still important in some areas, but it's largely been supplanted by more modern systems. Java holds pride of place in many system, from financial, through GUIs, to embedded controllers. In fact, today Java seems to hold the place that C++ held a decade before. This alone should be an indication that Java has crested.&lt;br /&gt;&lt;br /&gt;Using history as my guide, I would say that in 10 years time Java will be a very minor player, while something else that is a niche today will dominate the market.&lt;br /&gt;&lt;br /&gt;If Java is to have any significance in the future, I would guess it to be in Virtual Machine (VM). While there are many VMs out there, the Java VM has had a lot of work go into it by clever people, and now that it's being open sourced, a lot more clever people will be able to help. It has already been successful enough to have spawned several new systems to run on it, including &lt;a href="http://www.jython.org/Project/index.html"&gt;Jython&lt;/a&gt;, &lt;a href="http://jruby.codehaus.org/"&gt;JRuby&lt;/a&gt; and &lt;a href="http://groovy.codehaus.org/"&gt;Groovy&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;If we look at Ruby as an example of a possible successor, then some may criticize Ruby for not doing OO as well as Smalltalk, or functional programming as well as Erlang and Haskell. Similar criticisms can be made of any of the other languages that are popular today. However, Java was hardly original in anything it did, and yet that didn't prevent it from the success it ultimately enjoyed. (Personally, I see Ruby's threading support to be an Achilles Heel, especially in today's hardware environment, and the direction the chip manufacturers are taking us.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-6577488961340440926?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/6577488961340440926/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=6577488961340440926' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6577488961340440926'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6577488961340440926'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/09/ruby-ive-recently-been-making-my-way.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-1555681698308878175</id><published>2007-08-31T08:03:00.000-05:00</published><updated>2007-08-31T08:12:06.646-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='astrophotography'/><category scheme='http://www.blogger.com/atom/ns#' term='eclipse'/><category scheme='http://www.blogger.com/atom/ns#' term='lunar'/><title type='text'></title><content type='html'>&lt;h3&gt;Lunar Eclipse&lt;/h3&gt; A friend of mine took a &lt;a href="http://web.mac.com/bernard.walsh/Berns_Site/Lunar_Eclipse.html"&gt;series of photos&lt;/a&gt; of the recent lunar eclipse, as seen from Brisbane. They're not &lt;a href="http://www.astropix.com/HTML/SHOW_DIG/SHOW_DIG.HTM"&gt;serious&lt;/a&gt; &lt;a href="http://en.wikipedia.org/wiki/Astrophotography"&gt;astrophotography&lt;/a&gt;, but for someone like me who doesn't have much of a view of the sky any more, they were nice to see. (I miss my telescope).&lt;br /&gt;&lt;br /&gt;I particularly like the bright exposure of the final crescent, and again when it re-emerges. It's also a nice example of how even the simplest of telescopic lenses is able to see the Galilean moons.&lt;br /&gt;&lt;br /&gt;I recommend using the "Slideshow" view, to watch the progression of the full-size photos.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-1555681698308878175?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/1555681698308878175/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=1555681698308878175' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/1555681698308878175'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/1555681698308878175'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/08/lunar-eclipse-friend-of-mine-took.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-9048899765614549763</id><published>2007-08-30T12:19:00.000-05:00</published><updated>2007-08-30T14:04:46.657-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='FOAF'/><category scheme='http://www.blogger.com/atom/ns#' term='303'/><category scheme='http://www.blogger.com/atom/ns#' term='GOPHER'/><category scheme='http://www.blogger.com/atom/ns#' term='RDF'/><title type='text'></title><content type='html'>&lt;h3&gt;&lt;acronym title="Friend Of A Friend"&gt;FOAF&lt;/acronym&gt;&lt;/h3&gt; Like &lt;a href="http://prototypo.blogspot.com/2007/08/returning-http-303s-for-semantic-web.html"&gt;David's post&lt;/a&gt; yesterday, there have been a number of discussions in recent months about the best practice for &lt;acronym title="Uniform Resource Identifier"&gt;URI&lt;/acronym&gt;s that identify people. I've typically stayed out of the public debates, but have been involved in a number of offline conversations.&lt;br /&gt;&lt;br /&gt;A popular approach to building these URIs is to configure an &lt;acronym title="Hypertext Transfer Protocol"&gt;HTTP&lt;/acronym&gt; server such that when it receives a request for this URI it responds with an HTTP 303 (which means &lt;em&gt;"See Other"&lt;/em&gt;).  This lets the server respond with a document pertaining to that URI, but at the same time informs the client that this document is NOT the resolution of that URI. After all, the resolution of the URI is a person, and one can hardly respond with that (for a start, you'd need all that quantum state, and I haven't yet seen the internet protocols for quantum teleportation).&lt;br /&gt;&lt;br /&gt;Another approach is to simply use the URI of a document describing the person, and tack an anchor onto the end. Typically this anchor is &lt;em&gt;#me&lt;/em&gt;. Like the 303 approach, you can get a retrievable document that can be found from the person's URI, and again that document has a different URI to the URI of the person. The main problem cited with this second approach is that a &lt;em&gt;#me&lt;/em&gt; anchor may exist in the document, meaning that the URI resolves to something other than the person (while I recently learned that URI ambiguity is not strictly illegal, it is a &lt;em&gt;really&lt;/em&gt; bad idea.  After all, we usually rely on these things to identify a unique thing). Other people suggest avoiding possible anchor ambiguity with a query (?key=value on the end of the &lt;acronym title="Uniform Resource Locator"&gt;URL&lt;/acronym&gt;). This is much less popular, and I'll let the public arguments against this stand for themselves.&lt;br /&gt;&lt;br /&gt;While looking at the "303" approach the other day, I realized that both &lt;a href="http://www.apple.com/macosx/leopard/features/safari.html"&gt;Safari&lt;/a&gt; and &lt;a href="http://www.mozilla.com/firefox/"&gt;Firefox&lt;/a&gt; respond to a 303 as if it were a redirection. This makes sense in several ways. If a user has asked for "something" by address, then they'd like to see whatever data is associated with that address (as opposed to a response of "not here"). Also, the &lt;a href="http://www.faqs.org/rfcs/rfc2616.html"&gt;HTTP RFC&lt;/a&gt; says that this link &lt;em&gt;should&lt;/em&gt; be followed. Even so, since the resulting document is NOT what was asked for, the user should at least be told that they are looking at the "Next Best Thing", rather than silently being redirected.&lt;br /&gt;&lt;br /&gt;I came to all of this while updating my &lt;a href="http://www.foaf-project.org/"&gt;FOAF&lt;/a&gt; file the other day. While it is possible to describe all of your friends in minute detail, the normal practice is to include just enough information to uniquely identify them (plus a couple of things that are useful to keep locally, like the friend's name). Then when you and your friend's FOAF files are brought into the same store together, all that information will get linked up. This sounds great, until you realize that there is no defined way to find your friends' files. The various FOAF browsers, surfers, etc, that I've tried are all terrible at tracking down people's FOAFs, so whatever they're trying isn't working very well either.&lt;br /&gt;&lt;br /&gt;Whether using anchor suffixes or 303s, the URI that people &lt;em&gt;often&lt;/em&gt; use for themselves just happens to lead you to their own FOAF files. This would be the solution to the problem of finding your friends' files... if your friends happened to use this approach. While useful, it can't be relied upon for automatic FOAF file gathering. Because of this, I decided that I should try to put explicit links to all of my friends' FOAF URLs that I know about. This led me to tracking down the files of each of the people in my FOAF file (fortunately not many, as most of the people I know don't have a FOAF file), which had me following various 303 links, like the one to &lt;a href="http://kmi.open.ac.uk/people/tom/"&gt;Tom's URI&lt;/a&gt;.  I was using &lt;code&gt;&lt;a href="http://www.gnu.org/software/wget/"&gt;wget&lt;/a&gt;&lt;/code&gt;, which doesn't follow a "See Other" link automatically, and this was how I discovered that Tom was using a 303. I'm sure if I'd followed his URI with Firefox then I wouldn't have noticed the new address.&lt;br /&gt;&lt;br /&gt;After following the links for all these people, I then wanted some way to describe the location of their FOAF in my own FOAF description of them. After some investigation of the &lt;a href="http://xmlns.com/foaf/spec/"&gt;FOAF namespace&lt;/a&gt;, I discovered that there is no specified way to do this. I suppose this is what led to the de facto standard that people have adopted where their person URI leads you (however indirectly) to their FOAF file. This actually makes perfect sense, as you don't want to invalidate people's links to you just because you chose to move the location of your file, but it's still annoying if you want to be able to link to other people's file. Perhaps everyone should get a &lt;a href="http://www.purl.org/"&gt;&lt;acronym title="Persistent URL"&gt;PURL&lt;/acronym&gt; address&lt;/a&gt;?&lt;br /&gt;&lt;br /&gt;The closest thing I could find to a property describing a FOAF file, is the more general &lt;a href="http://xmlns.com/foaf/spec/#term_homepage"&gt;&amp;lt;foaf:homepage&amp;gt;&lt;/a&gt;. This property lets you link a resource (like a person) to some kind of document describing that resource. This meets the criteria of what I was looking for, but it is also more general than I was after, as it can also be used to point to non-FOAF pages, like a person's home page (the original intent of this property). All the same, I went with it, since it was a valid thing to do. At least it will help any applications that I write to look at my own file. It's a shame that it's so manual.&lt;br /&gt;&lt;br /&gt;While thinking about how to automate this process, it occurred to me that I could try the following:&lt;ul&gt;&lt;li&gt;If a person's URI ends in an anchor, then strip it off, and follow the URI. If the returned document is &lt;acronym title="Resource Description Framework"&gt;RDF&lt;/acronym&gt; then treat it as FOAF data (identifying RDF as being FOAF or not FOAF is another problem).&lt;/li&gt;&lt;li&gt;Follow the person's URI, and if the result is a 303, then follow that URI. If the resulting document is a RDF, the treat it as FOAF.&lt;/li&gt;&lt;li&gt;Iterate through each URI associated with the person (such as &amp;lt;foaf:homepage&amp;gt;) and if any of these return an RDF file then treat it as FOAF.&lt;/li&gt;&lt;li&gt;On each of the &lt;acronym title="HyperText Markup Language"&gt;HTML&lt;/acronym&gt; pages returned from the previous iterations, check for &amp;lt;a href=...&amp;gt; tags to resources that don't end with .html, .jpg, .png, etc. If querying for any of these links returns an RDF file, then treat as FOAF.&lt;/li&gt;&lt;/ul&gt;Incidentally, Tom's FOAF file would only be picked up via the last message. You have to follow his URI to get a 303, which then leads you to his home page. Then on that page you'll find links to his FOAF file. Frankly, it was just easier to manually add a &amp;lt;foaf:homepage&amp;gt; tag to his file.  :-)&lt;br /&gt;&lt;h3&gt;Anachronism&lt;/h3&gt; During the various conversations I've had (mostly with Tom), it occurred to me that there is an underlying assumption that all URIs will be HTTP. This is particularly true for 303 responses, as this is an HTTP response code. However, nothing in RDF suggests that the protocol (or &lt;em&gt;scheme&lt;/em&gt;, according to URI terminology) has to be HTTP. For instance, it isn't unheard of to find resources at the end of an &lt;code&gt;ftp://...&lt;/code&gt; URL. It got me wondering how much it would break existing systems if the URIs used for and in a FOAF file were not in HTTP, but something different. If they handle anything else, then it's almost certain to be &lt;acronym title="File Transfer Protocol"&gt;FTP&lt;/acronym&gt; (and possibly even &lt;acronym title="HyperText Transfer Protocol - Secure"&gt;HTTPS&lt;/acronym&gt;), so these weren't going to really test things. No, the protocol I chose was &lt;a href="http://en.wikipedia.org/wiki/Gopher_(protocol)"&gt;Gopher&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://gofish.sourceforge.net/"&gt;GoFish&lt;/a&gt; server managed the details for me here, though it took me a bit of debugging to realize that it wasn't starting when it couldn't find a user/group of "gopher" on my system (Apple didn't retain that account on OS X.  Go figure). Once I'd found that problem, it then took me a few minutes to discover that addresses for text file in the root are prefixed with 00/.  But once that was done I was off and running.&lt;br /&gt;&lt;br /&gt;I'm not a huge fan of running services from my home PC, so I can't say that I'll keep it up for a long time. But at the same time, it gives me some perverse pleasure to hand out my FOAF file as a gopher address.  :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-9048899765614549763?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/9048899765614549763/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=9048899765614549763' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/9048899765614549763'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/9048899765614549763'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/08/foaf-like-davids-post-yesterday-there.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-6354093613034511866</id><published>2007-08-05T22:32:00.000-05:00</published><updated>2007-08-05T22:59:37.760-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OS X'/><category scheme='http://www.blogger.com/atom/ns#' term='Java'/><category scheme='http://www.blogger.com/atom/ns#' term='compatibility'/><category scheme='http://www.blogger.com/atom/ns#' term='JDBC'/><title type='text'></title><content type='html'>&lt;h3&gt;Mulgara on Java 6&lt;/h3&gt; I'm nearly at the point where I can announce that Mulgara runs on Java 6 (or JDK 1.6). Many of the problems were due to tests looking for an exact output, while the new hashtables iterate over their data in a different order.&lt;br /&gt;&lt;br /&gt;The remaining problems fall into two areas.&lt;br /&gt;&lt;br /&gt;The first seems to be internal to the implementation of the Java 6 libraries. In one case the &lt;acronym title="Hyper Text Transfer Protocol - Secure"&gt;HTTPS&lt;/acronym&gt; connection code is unable to find an internal class that occurs in the same place for both Java 5 and Java 6. In another case, a method whose javadoc explicitly says will contain no caching, claims that it has "already closed the file" for all but the first time you try to open a &lt;abbr title="Java ARchive"&gt;JAR&lt;/abbr&gt; file using a URL, even though the URL object is newly created for each attempt.&lt;br /&gt;&lt;br /&gt;These problems may involve some browsing of Java source code to properly track down. Fortunately, they only show up in rarely-used resolver modules (I've never used them myself).&lt;br /&gt;&lt;br /&gt;The other problem is that as of Java 6, the &lt;code&gt;&lt;a href="http://java.sun.com/javase/6/docs/api/java/sql/ResultSet.html"&gt;java.sql.ResultSet&lt;/a&gt;&lt;/code&gt; interface has a new series of methods on it. I'd rather that we didn't, but we implement this interface with one of our classes. This is a holdover from a time when we tried to implement &lt;a href="http://java.sun.com/products/jdbc/overview.html"&gt;&lt;acronym title="Java DataBase Connectivity"&gt;JDBC&lt;/acronym&gt;&lt;/a&gt;. While it mostly worked, there was a fundamental disconnect with the metadata requirements of JDBC, and so we eventually abandoned the interface. However, the internal implementation of this interface remains.&lt;br /&gt;&lt;br /&gt;Since we don't use any of the new methods, then it is a trivial matter to implement these methods with empty stubs. &lt;a href="http://www.eclipse.org/"&gt;Eclipse&lt;/a&gt; does this with a couple of clicks, so it was very easy to do. Once this was done, the project compiled fine, and I could get onto tracking down the failures and errors in the tests, the causes of which I've described above.&lt;br /&gt;&lt;br /&gt;All this was going well, until someone pointed out that there were some issues under Windows. After spending some time getting the &lt;acronym title="Operating System"&gt;OS&lt;/acronym&gt; up and running again, I quickly found that the class implementing &lt;code&gt;ResultSet&lt;/code&gt; was missing some methods. How could this be? It had all the methods on &lt;a href="http://www.apple.com/macosx/tiger/"&gt;OS X&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The simple answer is to run &lt;em&gt;javap&lt;/em&gt; on the &lt;code&gt;java.sql.ResultSet&lt;/code&gt; interface, and compare the results.  Sure enough, on Windows (and Linux) the output contains 14 entries not found in OS X!&lt;br /&gt;&lt;br /&gt;&lt;em&gt;WTF?!?&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;This is easy enough to fix. Implementing the methods with stubs will make it work on Linux and Windows, and will have no effect on OS X. But why the difference? This meets my definition of broken.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-6354093613034511866?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/6354093613034511866/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=6354093613034511866' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6354093613034511866'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6354093613034511866'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/08/mulgara-on-java-6-im-nearly-at-point.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-2302050710872048604</id><published>2007-08-04T22:43:00.000-05:00</published><updated>2007-08-05T01:17:40.471-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OWL'/><category scheme='http://www.blogger.com/atom/ns#' term='pragmatic'/><category scheme='http://www.blogger.com/atom/ns#' term='description logic'/><category scheme='http://www.blogger.com/atom/ns#' term='logic'/><category scheme='http://www.blogger.com/atom/ns#' term='InverseFunctionalPredicate'/><title type='text'></title><content type='html'>&lt;h3&gt;OWL Inexpertise&lt;/h3&gt; One of my concerns about &lt;a href="http://talk.talis.com/"&gt;Talking to Talis&lt;/a&gt; yesterday (interesting pun between a verb and a noun there) was in making criticisms of some of the people working on &lt;a href="http://www.w3.org/2004/OWL/"&gt;&lt;acronym title="Web Ontology Language"&gt;OWL&lt;/acronym&gt;&lt;/a&gt;, when I'm really not enough of an expert to make such a call.&lt;br /&gt;&lt;br /&gt;I expressed concern over the "logicians" who have designed OWL as being out of touch with the practical concerns of the developers who have to use it. While I still believe there is basis for such an accusation, it is glossing over the very real need for a solid mathematical foundation for OWL, and is also disrespectful to several people in the field whom I respect.&lt;br /&gt;&lt;br /&gt;Knowing and understanding exactly what a language is capable of, is vital in its development. Otherwise, it is very easy to introduce features that conflict, or don't make sense in certain applications. Conflicting or vague definitions &lt;a href="http://www.jfsowa.com/pubs/fflogic.htm"&gt;may work in human language&lt;/a&gt;, but is not appropriate when developing systems with the precision that computers require. I have to work hard to get to the necessary understanding of description logic systems, which is why I respect people like Ian Horrocks (or Pat Hayes, or Bijan Parsia, the list goes on...) for whom it all seems to come naturally. Without their work, we wouldn't know exactly what all the consequences of OWL are, meaning that OWL would be useless for reasoning, or describing much of anything at all.&lt;br /&gt;&lt;br /&gt;However, coming from a perspective of "correctness" and "tractability", there is a strong desire in this community to keep everything within the domain of OWL-DL (the computationally tractable variant of OWL). Any constructs which fall outside of OWL-DL (and into OWL Full) are often dismissed. Anyone building systems to perform reasoning on OWL seems to be limiting their domain to OWL-DL or less. There appears to be an implicit argument that since calculations for OWL Full cannot be guaranteed to complete, then there is no point in doing them. Use of many constructs is therefore discouraged, on the basis that it is OWL Full syntax.&lt;br /&gt;&lt;br /&gt;While this makes sense from a model-theoretic point of view, pragmatically it doesn't work. Turing machines are not tractable (for instance, one can create an infinite loop), and yet no one has suggested that Turing complete languages are not important! Besides, G&amp;ouml;del taught us that tractability is not all that it's cracked up to be.&lt;br /&gt;&lt;br /&gt;A practical example of an OWL Full construct is in trying to map a set of &lt;acronym title="Relational Database Management System"&gt;RDBMS&lt;/acronym&gt; tables into OWL. It is very common for such tables to be "keyed" on a single field, often a numeric identifier, but sometimes text (like a student number, or &lt;acronym title="Social Security Number"&gt;SSN&lt;/acronym&gt;). Even if these fields are not the primary key of the table, a good mapping into a language like OWL will need to capture this property of the field.&lt;br /&gt;&lt;br /&gt;The appropriate mapping of a key field on a record is to mark that field as a property of type &lt;em&gt;owl:InverseFunctionalPredicate&lt;/em&gt;. However, it is not legal to use this property on a number or a string (an &lt;acronym title="Resource Description Framework"&gt;RDF&lt;/acronym&gt; &lt;em&gt;literal&lt;/em&gt;) in anything less than OWL Full.&lt;br /&gt;&lt;br /&gt;There are workarounds to stay within OWL-DL. However, this is one of many common use cases where workarounds are required to stay within the confines of OWL-DL. While theoretically possible that using &lt;em&gt;owl:InverseFunctionalPredicate&lt;/em&gt; on a "literal" would cause intractability, most use cases will not lead to this. It would seem safe in many systems to permit this - with an understanding of the dangers involved. Instead, the unwillingness of the experts to let people work with OWL Full, has caused onerous restrictions on many developers. This in turn leads to them simply not bothering with OWL, or to go looking for alternatives.&lt;br /&gt;&lt;br /&gt;I can appreciate the need to prevent people from shooting themselves in the foot. On the other hand, preventing someone from taking aim and firing at their feet often leads to other difficulties, encouraging them to just remove the safety altogether.&lt;br /&gt;&lt;br /&gt;It's an argument with two sides. There may well be many logicians out there who agree that a practical approach is required for developers, in order to make OWL more accessible to them. However, my own observations have not seen any concessions made on this point.&lt;hr/&gt;There. It reads much better here than the bald assertion I made for Talis.  :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-2302050710872048604?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/2302050710872048604/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=2302050710872048604' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2302050710872048604'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2302050710872048604'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/08/owl-inexpertise-one-of-my-concerns.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-5038670559351682769</id><published>2007-08-03T20:01:00.000-05:00</published><updated>2007-08-04T08:38:20.708-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OWL'/><category scheme='http://www.blogger.com/atom/ns#' term='Talis'/><category scheme='http://www.blogger.com/atom/ns#' term='Open Source'/><category scheme='http://www.blogger.com/atom/ns#' term='podcast'/><category scheme='http://www.blogger.com/atom/ns#' term='RDF'/><category scheme='http://www.blogger.com/atom/ns#' term='Mulgara'/><title type='text'></title><content type='html'>&lt;h3&gt;Nodalities&lt;/h3&gt; Last night Luc was determined to keep me up, and he did a pretty good job of it. This happens frequently enough that it shouldn't be worth a mention in this blog, except that today I had agreed to speak with &lt;a href="http://paulmiller.typepad.com/thinking_about_the_future/"&gt;Paul Miller&lt;/a&gt; from &lt;a href="http://www.talis.com/"&gt;Talis&lt;/a&gt;, for the &lt;a href="http://talk.talis.com/"&gt;Talking with Talis&lt;/a&gt; podcast.&lt;br /&gt;&lt;br /&gt;So that I'd be &lt;em&gt;compos mentis&lt;/em&gt;, I resorted to a little more coffee than usual (I typically have one in the morning, and sometimes have one in the afternoon. Today I had two in the morning).  While this had the desired affect of alertness, the ensuing pleonastic babble was a little unfortunate. Consequently, I feel like I've embarrassed myself eight ways to Sunday, though Paul has been kind enough to say that I did just fine.&lt;br /&gt;&lt;br /&gt;I was caught a little off guard by questions asking me to describe &lt;a href="http://www.w3.org/TR/rdf-schema/"&gt;&lt;acronym title="Resource Description Framework - Schema"&gt;RDFS&lt;/acronym&gt;&lt;/a&gt; and &lt;a href="http://www.w3.org/2004/OWL/"&gt;&lt;acronym title="Web Ontology Language"&gt;OWL&lt;/acronym&gt;&lt;/a&gt;. Rather than giving a brief description, as I ought to have, I digressed much too far into inane examples. I also said a few things which I thought at the time were &lt;em&gt;kind of&lt;/em&gt; wrong (by which, I mean that I was close, but did not hit the mark), but with the conversation being recorded it felt too awkward to go back and correct myself, particularly when I'd need a little time to think in order to get it right.&lt;br /&gt;&lt;br /&gt;Perhaps more frustratingly, my needless digressions and inaccurate descriptions stole from the time that could have been used to talk about things I believe to me more interesting.  In particular, I'm thinking of the &lt;a href="http://www.opensource.org/"&gt;Open Source&lt;/a&gt; process, and how it relates to a project like Mulgara. &lt;a href="http://talk.talis.com/archives/2007/05/david_wood_talk.html"&gt;David&lt;/a&gt; was able to give a lot of the history behind the project, but as an architect and developer, I have a different perspective that I think also has some value. I also think that open source projects are pivotal in the development of "software as a commodity", which is a notion that deserves serious consideration at the moment. I touched on it briefly, but I also ought to have elaborated on how open source commodity software is really needed as the fundamental infrastructure for enabling the semantic web, and hence the need for projects like &lt;a href="http://mulgara.org/"&gt;Mulgara&lt;/a&gt;, &lt;a href="http://www.openrdf.org/"&gt;Sesame&lt;/a&gt; and &lt;a href="http://jena.sourceforge.net/"&gt;Jena&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;But despite my missed opportunity to discuss these things today, I should not consider Talis's podcast to be a forum for expressing my own agenda. If I have a real desire to say these things, then I should be using my &lt;em&gt;own&lt;/em&gt; forum, and that is this blog.&lt;br /&gt;&lt;br /&gt;As always, time is against me, but I'll mention a few of these things, and perhaps I can have time to revisit the others in the coming weeks.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;People&lt;/h3&gt; I should also have mentioned some of the other names involved in Mulgara, from both the past and present. Fortunately, David already mentioned some of them (myself included) but since I'm in my own blog I can go into some more detail. Whether paid or not, these people all gave a great deal of commitment into making this a project with a lot to offer the community. However, since there are so many, I'll just stick to those who have some kind of ongoing connection to the project:&lt;ul&gt;&lt;li&gt;&lt;a href="http://prototypo.blogspot.com/"&gt;David Wood&lt;/a&gt;, who decided we could write Mulgara, made enormous sacrifices to pay for it out of his own pocket... and &lt;em&gt;THEN&lt;/em&gt; made it open source! His ongoing contributions to Mulgara are still valuable.&lt;/li&gt;&lt;li&gt;David Makepeace (a mentor early in my career, who I was fortunate to work with again at Tucana) who was the real genius behind the most complex parts of the system.&lt;/li&gt;&lt;li&gt;Tate Jones, who kept everyone focused on what we needed to do.&lt;/li&gt;&lt;li&gt;Simon Raboczi who drove us to use the standards, and ensured the underlying mathematical model was correct.&lt;/li&gt;&lt;li&gt;&lt;a href="http://morenews.blogspot.com/"&gt;Andrew Newman&lt;/a&gt; who knew everything there was to know in the semantic web community, and aside from writing important code, he was the one who wouldn't stop asking when we could overcome the commercial concerns and make the system Open Source.&lt;/li&gt;&lt;li&gt;&lt;a href="http://etymon.blogspot.com/"&gt;Andrae Muys&lt;/a&gt;, the last person to join the inner cabal, and the guy who restructured it all for greater modularity, and correctness. This contribution alone cannot be overstated, but since Tucana closed shop he has remained the most committed developer on the project.&lt;/li&gt;&lt;li&gt;Collectively, the guys at &lt;a href="http://www.topazproject.org/"&gt;Topaz&lt;/a&gt;, who have provided more support than anyone else since Tucana closed.&lt;/li&gt;&lt;/ul&gt;These were just some of the guys who made the project worthwhile, and Tucana a great place to work.&lt;br /&gt;&lt;br /&gt;&lt;small&gt;Sorry to those I didn't mention.&lt;/small&gt;&lt;br /&gt;&lt;br /&gt;Even if I move past Mulgara and into a new type of RDF store, then the open source nature of Mulgara will allow me to bring a lot of that intelligence and know-how forward with me. For this reason alone, I think that the Open Source process deserves some discussion.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Architecture&lt;/h3&gt; Back when Mulgara (or &lt;acronym title="Tucana Knowledge Store"&gt;TKS&lt;/acronym&gt;/Kowari) was first developed, it was interesting to see the &lt;a href="http://infolab.stanford.edu/~melnik/rdf/db.html"&gt;schemas&lt;/a&gt; being proposed. Looking at them, there was a clear influence from the underlying &lt;em&gt;Description Logic&lt;/em&gt; that RDF was meant to represent. However, I was not aware of description logics back then, and instead only knew about RDF as a &lt;em&gt;graph&lt;/em&gt;. Incidentally, I only considered &lt;a href="http://www.w3.org/TR/rdf-syntax-grammar/"&gt;RDF/XML&lt;/a&gt; to be a serialization of these graphs (a perspective that has been useful over the years), so a knowledge of this wasn't relevant to the work I was doing (though I did learn it).&lt;br /&gt;&lt;br /&gt;Since I was graph focused, and not logic focused, I didn't perceive predicates as having a distinct difference from subjects or objects (especially since it is possible to make statements where predicates appear as subjects). Also, while "objects" are different from "subjects" by the inclusion of literal values, this seemed to be a minor annotation, rather than a fundamental difference. Consequently, while considering the "triple" of &lt;em&gt;subject&lt;/em&gt;, &lt;em&gt;predicate&lt;/em&gt; and &lt;em&gt;object&lt;/em&gt;, I started wondering at the significance of their ordering. This led me to drawing them in a triangle, much as you can see in the &lt;a href="http://www.w3.org/RDF/icons/"&gt;RDF Icon&lt;/a&gt;.&lt;br /&gt;&lt;a href="http://www.w3.org/RDF/" title="RDF Resource Description&lt;br /&gt;Framework"&gt;&lt;img border="0" src="http://www.w3.org/RDF/icons/rdf_w3c_icon.48"/&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This then led naturally to the three way index we used for the first few months of the project, and is still the basis of our thinking today. Of course, in a commercial environment, we were acutely aware of the need for security, and it wasn't long before we introduced a fourth element to the mix. Initially this was supposed to provide individualized security for each statement (a requested feature), but it didn't take long to realize that we wanted to group statements together, and that security should be applied to groups of statements, rather than each individual statement (else security administration would be far too onerous, regardless of &lt;em&gt;who&lt;/em&gt; thought this feature would be a good idea). So the fourth element became our "model", though a little after that the name "graph" became more appropriate.&lt;br /&gt;&lt;br /&gt;Moving to 4 nodes in a statement led to an interesting discussion, where we tried to determine what the minimum number of indices would be, based on our previous 3-way design. This is what led to the 6 indices that Mulgara uses today. I explored this in much more depth some time later in this blog, with a &lt;a href="http://gearon.blogspot.com/2004/08/caveat-emptor-i-normally-try-and-take.html"&gt;couple&lt;/a&gt; of &lt;a href="http://gearon.blogspot.com/2004/08/proof-reading-once-again-its-way-too.html"&gt;entries&lt;/a&gt; back in 2004. In fact, it is this very structure that allows us to do very fast querying regardless of complexity (and if we don't, then it just needs re-work on the query optimizer, and not our data structures). More importantly, for my recent purposes (and my thesis), this allows for an interesting re-interpretation of the &lt;a href="http://en.wikipedia.org/wiki/Rete_algorithm"&gt;RETE algorithm&lt;/a&gt; for fast rule evaluation. This then is our basis for performing OWL inferences using rules.&lt;br /&gt;&lt;br /&gt;See? It's all tied together, from the lowest conceptual levels to the highest!&lt;br /&gt;&lt;br /&gt;I freely acknowledge that OWL can imply &lt;a href="http://www.dis.uniroma1.it/~nardi/Didattica/RC/dispense/dlhb-02-2pp.pdf"&gt;much more&lt;/a&gt; than can be determined with rules (actually, that's not strictly true, as an approach using &lt;a href="http://dblp.uni-trier.de/rec/bibtex/conf/pods/BancilhonMSU86"&gt;magic sets&lt;/a&gt; to temporarily generate &lt;em&gt;possible&lt;/em&gt; predicates can also get to the harder answers - but this is &lt;em&gt;not&lt;/em&gt; practical). To get to these other answers, the appropriate mechanism is with a Tableaux reasoner (such as Pellet). However, from experience I believe that most of what people need (and want) is covered quite well with a complete set of rule-base inferences.  This was reinforced for me when &lt;a href="http://kaon2.semanticweb.org/"&gt;KAON2&lt;/a&gt; came up with exactly the same approach (though I confess to having been influenced by KAON2 before it was released, in that I was already citing papers which formed the basis of that project).&lt;br /&gt;&lt;br /&gt;All the same, while I think Rules will work for most situations, having a tableaux reasoner to fall back on will give Mulgara a more complete feature set. Hence, my desire to integrate &lt;a href="http://pellet.owldl.com/"&gt;Pellet&lt;/a&gt; (originally from &lt;a href="http://www.mindswap.org/"&gt;MIND Lab&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;I have yet to look at the internals of Pellet, to see how it stores and accesses its data.  I'd love to think that I could use an indexing scheme to help it to scale out over large data sets like rules can, but my (limited) knowledge of the tableaux algorithm says that this is not likely.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Open Source&lt;/h3&gt; There are several reasons for liking Pellet over the other available reasoners. First, is that it is under a license that is compatible with Mulgara. Second, is that I saw the ontology debugger demonstrated at MIND Lab a couple of years ago, and have been smitten ever since. Third, the work that Christian Halaschek-Wiener presented at SemTech on &lt;a href="http://www.mindswap.org/~chris/publications/Syndication-OWL-WWW2007.pdf"&gt;OWL Syndication&lt;/a&gt;, convinced me that Pellet is really doing the right thing for scalability on &lt;a href="http://en.wikipedia.org/wiki/Tbox"&gt;TBox&lt;/a&gt; reasoning.&lt;br /&gt;&lt;br /&gt;Finally, Pellet is open source. Yes, that seems to be repeating my first point about licenses, but this time I have a different emphasis. The first point was about legal compatibility of the projects. The point I want to make here is that reasoning like this is something that everyone should be capable of doing, in the same way that storing large amounts of data should be something that everyone can do. Open source projects not only make this possible, but if the software is lacking in some way, then it can be debugged and/or expanded to create something more functional. Then the license point comes back again, allowing third party integration and collaboration. This lets people build something on top of all these open source commodities that is a gestalt of all the components. Open source projects enable this, allowing the community to rapidly create things that are conceptually far beyond the component parts.&lt;br /&gt;&lt;br /&gt;From experience, I've seen the same process in the commercial and open source worlds. In the commercial world, the growth is extraordinarily slow. This is because of limited budgets, and limited communication between those who can make these things happen. Ideas are duplicated between companies, and resources are spent trying to make one superior to all the others, sometimes ignoring customers' needs (and often trying to tell the customer what they need).&lt;br /&gt;&lt;br /&gt;In the open source world, everyone is free to borrow from everyone else's ideas (within license compatibility - a possible bugbear), to expand on them, or to use them as a part of a greater whole. Budgets are less of an issue, as projects have a variety of resources available to them, such as contributing sponsors, and hobbyists. Projects focus on the features that clients want, because often the client is contributing to the development team.&lt;br /&gt;&lt;br /&gt;Consider &lt;a href="http://www.microsoft.com/sql/default.mspx"&gt;MS-SQL&lt;/a&gt; and &lt;a href="http://www.oracle.com/"&gt;Oracle&lt;/a&gt;. Both are very powerful databases, which have competed now for many years. In a market dominated by these players, it is inconceivable that a new database could rival them. Yet &lt;a href="http://www.mysql.com/"&gt;MySQL&lt;/a&gt; has been steadily gaining ground for many years, first as a niche product for specialized use, and then more and more as a fully functional server. It still has a way to go to scale up to high end needs as the commercial systems do, but this is a conceivable target for MySQL. In the meantime, I would guess that there are more MySQL installations in the world than almost any other &lt;acronym title="Relational Database Management System"&gt;RDBMS&lt;/acronym&gt; available today. Importantly, it got here in a fraction of the time that it took the commercial players.&lt;br /&gt;&lt;br /&gt;Semantic Web software has a long way to go before reaching the maturity of products like those I just mentioned. We still have to take semantic web software a long way forward. But history has shown us that the way forward is to make the infrastructural software as open and collaborative as possible, enabling everyone to develop at a much higher level, without being concerned about the layers below them. Higher level development has happened with many layers of computing in the past (compilers, &lt;acronym title="Object Oriented"&gt;OO&lt;/acronym&gt; toolkits, spreadsheets, databases, web scripting languages for server-side and client-side scripting), and the cheaper and more open the lower levels were, the more rapid and functional the high level development became.&lt;br /&gt;&lt;br /&gt;It is at this top level that we can provide &lt;em&gt;real&lt;/em&gt; value for the world at large, and not just the &lt;acronym title="Information Technology"&gt;IT&lt;/acronym&gt; community. It is this that should be driving our development. We should not be striving to make computing better. Computers are just tools. We should be striving to make the &lt;em&gt;world&lt;/em&gt; better.&lt;br /&gt;&lt;br /&gt;Sounds pretty lofty, I know. Blame the caffeine from this morning wearing off and leaving me feeling light headed. But there has to be some point to it all. This all takes too much work if we indulge in navel gazing by only enabling IT. IT has to enable people outside of its own field or else there is no reason for it to exist, and we will all get caught in another .com bubble-burst.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-5038670559351682769?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/5038670559351682769/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=5038670559351682769' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/5038670559351682769'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/5038670559351682769'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/08/nodalities-last-night-luc-was.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-6336881867898030947</id><published>2007-07-30T21:46:00.000-05:00</published><updated>2007-07-30T23:21:21.329-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='EDGE'/><category scheme='http://www.blogger.com/atom/ns#' term='iPhone'/><category scheme='http://www.blogger.com/atom/ns#' term='WiFi'/><category scheme='http://www.blogger.com/atom/ns#' term='3G'/><title type='text'></title><content type='html'>&lt;h3&gt;WiFi in the Balkans&lt;/h3&gt; For some time I've known that there is &lt;a href="http://en.wikipedia.org/wiki/WiFi"&gt;WiFi&lt;/a&gt; all over the place here (such an easy-to-remember name when compared to &lt;a href="http://standards.ieee.org/getieee802/802.11.html"&gt;802.11&lt;/a&gt;). However, using an iPhone really brings it home.  Whenever I look something up while out of range of my usual networks (and let's face it, I wouldn't have bought an iPhone if I weren't going to be using it all the time) I get a list of anything from 3 to a dozen networks all within range.  And this doesn't consider the networks that aren't being broadcast (though I don't think many people use this option).  With the exception of the occasional commercial access system (give us your credit card details and we'll let you in), then all of these access points are locked.&lt;br /&gt;&lt;br /&gt;Many people have unlimited, or virtually unlimited, high speed internet access, and they're all attaching these wireless gateways to them.  These access points then overlap tremendously in range, causing interference with each other, and slowing each other down.  This seems like massive duplication to me.  Add to this the fact that most of these networks spend the majority of their time idle, and the pointlessness of the situation is even more frustrating.&lt;br /&gt;&lt;br /&gt;I'm not advocating grid networking (I'm skeptical that the technology has the algorithms to efficiently route the massive amount of data it would need to deal with).  However, it would seem that if the network were configured such that access point owners &lt;em&gt;could&lt;/em&gt; open up their access points and let everyone on, then everyone would benefit.  Some points would get more traffic than others, but overall it should even out.  Coming from this perspective I can understand why so many cities have looked at providing this service, and why Google decided to &lt;a href="https://wifi.google.com/"&gt;roll it out in Mountain View&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Many of the advantages are obvious.  More efficient usage of the airwaves (fewer mid-air packet collisions), ubiquitous urban access, and less infrastructure cost to the community as a whole.&lt;br /&gt;&lt;br /&gt;Unfortunately, I can see the downside too.  It would have to be paid for by the community, rather than the individual.  The total cost would be much less than is being paid now (by each individual, with their own access point and their own internet connection), but the money still has to come from somewhere.  It may not be much in a city budget, but there are always those who don't feel they need to pay extra taxes for services they don't use.  I disagree with this view, but my opinion has no impact on voters, and by extension, my opinions have no influence with politicians.&lt;br /&gt;&lt;br /&gt;I can also see the authorities having a hissy fit over it.  It's trivial to use the internet anonymously (3 coffee shops within a block of here have free WiFi - not to mention &lt;a href="http://en.wikisource.org/wiki/Changing_MAC_addresses"&gt;more&lt;/a&gt; &lt;a href="http://www.google.com/search?q=anonymizer"&gt;technical&lt;/a&gt; &lt;a href="http://en.wikipedia.org/wiki/IP_address_spoofing"&gt;solutions&lt;/a&gt;), but ignorance or laziness of most people still allow the police, and &lt;a href="http://en.wikipedia.org/wiki/RIAA#Efforts_against_file_sharing"&gt;others&lt;/a&gt;, to track down people of interest.  The fact that this kind of tracking can be circumvented, or even redirected to someone innocent is of little consequence here.  Those who want to find people engaging in certain activities on the internet would not want to allow universal anonymous access, especially in this age of post-911 paranoia.  Authorized (and identified) access is not really feasible in this situation, as it would be nearly impossible to roll out and enforce, and easily circumvented.  So the easy solution is just prevent people from having ubiquitous community-sponsored WiFi.&lt;br /&gt;&lt;br /&gt;The legal framework for some of these restrictions is already being set up in &lt;a href="http://www.iht.com/articles/ap/2006/11/11/asia/AS_GEN_Singapore_Internet_Charges.php"&gt;some jurisdictions&lt;/a&gt;.  Many concerns are currently around accessing private (sometimes download limited) networks, but as these concerns are removed with the promise of ubiquitous "free" access, then other reasons will be cited.&lt;br /&gt;&lt;br /&gt;Even more influential than law enforcement (in this country) are the network providers.  These companies have already &lt;a href="http://www.macworld.com/news/2004/11/23/philadelphia/index.php/?lsrc=mcrss-1104"&gt;tried to prevent cities&lt;/a&gt; from rolling out ubiquitous WiFi.  They are obviously scared it will threaten their business model.  I don't really care too much, as they are already being paid a lot of money for under utilized service (all those redundant lines not being used to their capacity), and &lt;a href="http://en.wikipedia.org/wiki/Network_neutrality"&gt;abuse their market&lt;/a&gt; in many other ways as well.  Like many other large companies, they are unwilling to try to keep up with their market, preferring to shape the market to their own desires.  This works in the short term, but history shows it is doomed to failure in the long run.&lt;br /&gt;&lt;br /&gt;In the meantime, I'll continue to use &lt;a href="http://en.wikipedia.org/wiki/EDGE"&gt;EDGE&lt;/a&gt; on my iPhone, and wish that my previous phone hadn't died before Apple brought out a model that included &lt;a href="http://en.wikipedia.org/wiki/3G"&gt;3G&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;small&gt;&lt;em&gt;I'm struggling to stay awake while I type.  Does it show?&lt;/em&gt;&lt;/small&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-6336881867898030947?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/6336881867898030947/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=6336881867898030947' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6336881867898030947'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/6336881867898030947'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/07/wifi-in-balkans-for-some-time-ive-known.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-2821142233782373118</id><published>2007-07-29T16:17:00.000-05:00</published><updated>2007-07-29T21:40:37.562-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='indexing'/><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'></title><content type='html'>&lt;h3&gt;Conversations&lt;/h3&gt; I've just spent a week working in Novato, CA.  While I didn't get much programming done, I did manage a few very productive conversations.  I spent the whole week working with &lt;a href="http://chronos-st.blogspot.com/"&gt;Alan&lt;/a&gt;, who is interested in Mulgara (for various reasons), and on Wednesday night I finally got to met with &lt;a href="http://fotap.org/~osi/"&gt;Peter&lt;/a&gt; from Radar Networks.&lt;br /&gt;&lt;br /&gt;While describing the structure of Mulgara, and particularly the string pool, Alan had a number of astute observations.  First of all, our 64 bit gNodes don't have a full 64 bit address space to work in, since each node ID is multiplied by the size of an index entry to get a file offset.  This isn't an issue in terms of address space (we'd have to be allocating thousands of nodes a second for decades for this to be a problem), but it shows that there are several unused bits of addressable space that are unreachable.  This provides opportunities for storing more type information in the ID.&lt;br /&gt;&lt;br /&gt;This observation on the address space took on new relevancy when Peter mentioned that another RDF datastore tries to store as much data as possible directly in the indexes, rather than redirecting everything (except blank nodes) through their local equivalent to the string pool.  This actually makes perfect sense to me, as the Mulgara string pool (really, it's a "URI and literal" pool) is able to fit a lot of data into less than 64 bits already.  We'll only fit in short strings (7 ASCII characters or fewer), but most numeric and data/time data types should fit in here easily.  Even if they can't, then we could still map a reduced set of values into this space (How many DateTime values really need more than, say, 58 bits?).&lt;br /&gt;&lt;br /&gt;Indeed, I'm only considering the XA store here.  When the XA2 store starts to come online it will have run-length encoded sets of triples in the blocks.  This means that we can really stretch the length of what gets encoded in the indexes without diverging to the string pool.&lt;br /&gt;&lt;br /&gt;The only thing that this approach might break would be some marginal uses of the Node Type and DataType resolvers.  These resolvers are usually used to test or filter for node type information, and this function would not be affected.  However, both resolvers are capable of being queried for all the contents of the string pool that meet the type criteria, and this function would be compromised.  I'm not too worried though, as these functions are really only useful for administrative processes (and marginally at that).  The only reason I allowed for this functionality in the first place was because I &lt;em&gt;could&lt;/em&gt;, and because it was the natural semantic extension of the required operations.  Besides, some of the other changes we might make to the string pool could invalidate this style of selection of "all uses of a given type".&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Permanent Strings&lt;/h3&gt; The biggest impediment to load speed at the moment appears to the be string pool.  It's not usually a big deal, but if you start to load a lot of string data (like from &lt;a href="http://labs.systemone.at/wikipedia3"&gt;enwiki&lt;/a&gt;) then it really shows.  Sure, we can cache pretty well (for future lookups), but when you are just writing a lot of string data then this isn't helping.&lt;br /&gt;&lt;br /&gt;The use cases I've seen for this sort of thing usually involve loading a lot of data permanently, or loading it, dropping it, and then re-loading the data in a similar form.  Either way, optimizing for writing/deleting strings seems pretty pointless.  I'm thinking that we really need an index that lets us write strings quickly, at the expense of not being able to delete them (at least, not while the database is live).&lt;br /&gt;&lt;br /&gt;I'm not too concerned about over optimizing for this usage pattern, as it can just be written as an alternative string pool, with selection made in the &lt;em&gt;mulgara-config.xml&lt;/em&gt; file.  It may also make more sense to make a write-once pool the default, as it seems that most people would prefer this.&lt;br /&gt;&lt;br /&gt;I've been discussing this write-once pool with a few people now, but it was only while talking with Alan that I realized that almost everything I've proposed is &lt;em&gt;already&lt;/em&gt; how Lucene works.  We already support Lucene as the backend for a resolver, so it wouldn't be a big step to move it up to taking on many of the string pool functions.  Factor in that many of the built in data types (short, int, character, etc) can be put into the indexes online, and the majority of things we need to index in the string pool end up being strings after all, which of course is what Lucene is all about.  Lucene is a great system, and integration of projects like this is one of the big advantages of building open source projects.&lt;br /&gt;&lt;br /&gt;It's been a while since I wrote to the Lucene API.  I ought to pull out the docs and read them again.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-2821142233782373118?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blogspot.com/feeds/2821142233782373118/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6848574&amp;postID=2821142233782373118' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2821142233782373118'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6848574/posts/default/2821142233782373118'/><link rel='alternate' type='text/html' href='http://gearon.blogspot.com/2007/07/conversations-ive-just-spent-week.html' title=''/><author><name>Quoll</name><uri>http://www.blogger.com/profile/03653112583629043593</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6848574.post-3282061045442744231</id><published>2007-07-21T15:07:00.000-05:00</published><updated>2007-07-21T17:24:26.070-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Java'/><category scheme='http://www.blogger.com/atom/ns#' term='HashSet'/><category scheme='http://www.blogger.com/atom/ns#' term='Mulgara'/><title type='text'></title><content type='html'>&lt;h3&gt;Java 1.6&lt;/h3&gt; Mulgara currently doesn't work with &lt;a href="http://java.sun.com/javase/6/"&gt;Java 6&lt;/a&gt; (also called &lt;abbr title="Java Developer Kit"&gt;JDK&lt;/abbr&gt; 1.6). I knew I needed to enable this, but have been putting it off in lieu of more important features. But this release made it very plain that Mulgara is in an awkward position between two Java releases: namely JDK 1.4 and JDK 1.6.&lt;br /&gt;&lt;br /&gt;The main problem going from Java 1.4 to Java 5 was the change in libraries included in the &lt;abbr title="Java Runtime Environment"&gt;JRE&lt;/abbr&gt;. Someone had taken advantage of the &lt;a href="http://www.apache.org/"&gt;Apache&lt;/a&gt; XML libraries that were in there, but now these had all changed packages, or were no longer available. The other issue was a few incompatibilities in the unicode implementation - some of which were the reason for introducing the &lt;a href="http://web.mac.com/thegearons/code/CodePoint.java"&gt;CodePoint&lt;/a&gt; class &lt;a href="http://gearon.blogspot.com/2006/01/unicode-in-my-last-post-i-was.html"&gt;last year&lt;/a&gt;, and published &lt;a href="http://gearon.blogspot.com/2007/07/codepoints-i-was-just-asked-about-code.html"&gt;8 days ago&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Going to Java 6 is relatively easy in comparison.  Sun learnt their lesson about dropping in third party libraries that users may want to override with more recent versions, so this was not an issue.  The only real change has been to the classes in &lt;a href="http://java.sun.com/javase/6/docs/api/index.html"&gt;java.sql&lt;/a&gt;, in which new interfaces and extensions to old interfaces have prevented a few classes from compiling.  This is easily fixed with some stub methods to fulfill the interfaces, since we know these methods are not being called internally in Mulgara.&lt;br /&gt;&lt;br /&gt;I haven't gone through everything yet (like the failing HTTP tests), but the main problem for Mulgara seems to be in passing the tests, but not in the code itself. The first of these was a query that returned the correct data, but out of order. Now any queries whose results are to be tested should have an &lt;code&gt;ORDER BY&lt;/code&gt; directive, so this failure should not have been allowed to happen.  It's easily resolved, but that made me wonder about the change in ordering, until I got to the next test failure.&lt;br /&gt;&lt;br /&gt;Initially, I was confused with this failure. The "bad output" contained an exception, which is usually a bad sign.  But when I looked at the query which caused the exception I realized that an exception was the correct response. So how could it have passed this test for previous versions of Java? Was it a &lt;a href="http://en.wikipedia.org/wiki/Heisenbug#Schroedinbugs"&gt;Schr&amp;ouml;dinbug&lt;/a&gt;?&lt;br /&gt;&lt;br /&gt;The first step was to see what the initial committer had expected the result to be.  That then led to a "&lt;a href="http://www.worldwidewords.org/topicalwords/tw-doh1.htm"&gt;Doh!&lt;/a&gt;" &lt;a href="http://www.fortunecity.com/lavendar/poitier/135/doh.wav"&gt;moment&lt;/a&gt;. The idea of this test was to specifically test that the result &lt;em&gt;would&lt;/em&gt; generate an exception, and this was the expected output.  Why then, the failure?&lt;br /&gt;&lt;br /&gt;Upon careful inspection of the expected and actual outputs, I found the difference in the following line from teh Java 6 run:&lt;br /&gt;&lt;samp&gt;Caused by: (QueryException) org.mulgara.query.TuplesException: No such variable $k0 in tuples [$v, $p, $s] (class org.mulgara.resolver.AppendAggregateTuples)&lt;/samp&gt;&lt;br /&gt;Whereas the expected line reads:&lt;br /&gt;&lt;samp&gt;Caused by: (QueryException) org.mulgara.query.TuplesException: No such variable $k0 in tuples [$p, $v, $s] (class org.mulgara.resolver.AppendAggregateTuples)&lt;/samp&gt;&lt;br /&gt;I immediately thought that the variables had been re-ordered due to the use of a hash table (where no ordering can be guaranteed).  So I checked the classes which create this message (&lt;code&gt;org.mulgara.resolver.SubqueryAnswer&lt;/code&gt; and &lt;code&gt;org.mulgara.store.tuples.AbstractTuples&lt;/code&gt;).  In both cases, they use a List, but I was still convinced that the list must have been originally populated by a HashSet.  In fact, this also ties in with the first so-called "failure" that I saw, where data in a query was returned in a different order.  Some queries will use internal structures to maintain their temporary data, and this one must have been using a Set as well.&lt;br /&gt;&lt;br /&gt;To test this, I tried the following code in Java 5 and 6:&lt;pre&gt;&lt;code&gt;import java.util.HashSet;&lt;br /&gt;public class Order {&lt;br /&gt;  public static void main(String[] args) {&lt;br /&gt;    HashSet&lt;String&gt; s = new HashSet&lt;String&gt;();&lt;br /&gt;    s.add("p");&lt;br /&gt;    s.add("v");&lt;br /&gt;    s.add("s");&lt;br /&gt;    for (String x: s) System.out.print(x + " ");&lt;br /&gt;    System.out.println();&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;In Java 5 the output is: &lt;samp&gt;p s v&lt;/samp&gt;&lt;br /&gt;In Java 6 the output is: &lt;samp&gt;v s p&lt;/samp&gt;&lt;br /&gt;&lt;br /&gt;I checked on this, and the hash codes have not changed.  So it looks like &lt;code&gt;&lt;a href="http://java.sun.com/javase/6/docs/api/java/util/HashMap.html"&gt;HashMap&lt;/a&gt;&lt;/code&gt; has changed in its storage technique.&lt;br /&gt;&lt;h3&gt;Fix&lt;/h3&gt; I have two ways I can address me problem.  The first is to find the map where the data gets reorganized, and either use an ordered collection type, or else use a &lt;a href="http://java.sun.com/javase/6/docs/api/java/util/LinkedHashSet.html"&gt;LinkedHashSet&lt;/a&gt;. The latter is still a set, but also guarantees ordering.  However, this is a patch, and a bad one at that.&lt;br /&gt;&lt;br /&gt;The real solution is to write some more modules for use in &lt;a href="http://jxunit.sourceforge.net/"&gt;JXUnit&lt;/a&gt;, to make it more flexible than the current &lt;code&gt;equal&lt;/code&gt;/&lt;code&gt;not-equal&lt;/code&gt; comparisons done on strings now.  This seems like a distraction from writing actual functionality, but I think it's needed, despite it taking longer that the "hack" solution.&lt;br /&gt;&lt;br /&gt;Speaking of which... DavidW just asked if I could document the existing resolvers in Mulgara 1.1 (especially the Distributed Resolver).  He didn't disagree with my reasons for releasing without documentation, but he pointed out that not having it written up soon could result in a backlash.  Much as I hate to admit it (since I have other things to do), he's right.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6848574-3282061045442744231?l=gearon.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://gearon.blog
