<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:ent="http://www.purl.org/NET/ENT/1.0/" version="2.0">
  <channel>
    <title>Curiouser and Curiouser! on text-analysis</title>
    <link>http://matt.blogs.it/</link>
    <description>RSS feed for topic text-analysis</description>
    <copyright>Copyright 2006 Matt Mower</copyright>
    <generator>Squib/0.4.0.348</generator>
    <managingEditor>self@mattmower.com</managingEditor>
    <webMaster>self@mattmower.com</webMaster>
    <language>en-gb</language>
    <item>
      <title>Analysing new topics</title>
      <link>http://matt.blogs.it/entries/00000365.html</link>
      <pubDate>Sun, 08 Sep 2002 22:58:49 +0100</pubDate>
      <description>&lt;P&gt;I'm reading a paper [&lt;A href="http://citeseer.nj.nec.com/cache/papers/cs/1119/http:zSzzSzwww.cs.jhu.eduzSz~sheppardzSzcs.605.754zSzpaperszSzpaper3a.pdf/joachims96probabilistic.pdf"&gt;Joachims, 1996&lt;/A&gt;] about text analysis.&amp;nbsp; The idea is to produce a facility within liveTopics for suggesting topics based upon the text entered in a post.&amp;nbsp; At the moment there is a simple facility based upon a word search for existing topics, I'm keen to improve upon that in the future.&lt;/P&gt;
&lt;P&gt;Does anyone know of any good papers on this subject?&lt;/P&gt;</description>
      <guid isPermaLink="true">http://matt.blogs.it/entries/00000365.html</guid>
      <ent:cloud ent:href="http://matt.blogs.it/topics/">
      </ent:cloud>
    </item>
    <item>
      <title>Maybe I didn't need a network to predict this!</title>
      <link>http://matt.blogs.it/entries/00001755.html</link>
      <pubDate>Thu, 24 Mar 2005 18:34:17 +0000</pubDate>
      <description>&lt;p&gt;Neural networks are very cool but not suitable for all applications.  I've basically been stumped by the problems of trying to use a network for indentifying interesting weblog posts.&lt;/p&gt;
&lt;p&gt;The first problem is the input problem.  How do you represent arbitrary chunks of text to the network in a meaningful way?&lt;/p&gt;
&lt;p&gt;The problem here is that you have a layer of &lt;em&gt;input neurons&lt;/em&gt;  which form the input to the network.  The inputs are driven by the environment (e.g. the text) and must consist of real values which can be fed to the next (hidden) layer of the network.  If you're measuring temperatures, voltages, water levels, and so on then you are working with real values already.  If you're working in image recognition you tend to have a fixed array of pixels (e.g. 640x480). But what about text?&lt;/p&gt;
&lt;p&gt;I find myself presented with two sub-problems:&lt;ol&gt;&lt;li&gt;&lt;p&gt;The length of the text is not finite.  Measuring temperature you might have 2 or 3 sensors.  A weblog post could as easily be 5 words or 50,000 words depending upon the authors whim.  Although, in practice, you could say &lt;em&gt;"no post will ever be more than 1MB of text"&lt;/em&gt; and treat 1MB as a limit that can create it's own problems.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Text doesn't neatly correspond to a real value input pattern.  How do you represent the specific text as a numeric value?&lt;p&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/p&gt;
&lt;p&gt;An approach I formulated was to chop the input into keywords (doing appropriate stop-word rejection and so forth) and then feed the most relevant &lt;em&gt;n&lt;/em&gt; keywords to the network as input for that text.  The keywords could be uniquely numbered and then represented as a binary value.  Each bit of the binary value would correspond to an input cell raising a value of 0.0 or 1.0 depending upon whether the bit is set or not.  If we allowed a total of 4,096 possible keywords this can be represented in 12 bits (2^12=4096).  If we used the 10 most relevant keywords for each post, i.e. n=10, then the input layer would, therefore, be composed of 12&lt;em&gt;n&lt;/em&gt; or 120 cells.&lt;/p&gt;
&lt;p&gt;However even having reached this point there are further problems to consider:&lt;ul&gt;&lt;li&gt;&lt;p&gt;In a large post there may well be more than 10 keywords which means losing relevant information.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;In a small post there may be less than 10 relevant keywords.  What input is provided for non-existent keywords?&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The relevance of the keyword to the item is not encoded.  This problem might be solved by adding further input cells for each keyword to express keyword relevance.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The network acts as a &lt;em&gt;feature detector&lt;/em&gt;.  If we considering a set of temperature sensors wired to the inputs, each sensor will be wired to a specific set of inputs, they won't change.  However a keyword that is detected in one position (i.e. represented in one set of input cells) for one item may be detected in another position for a different item and won't be considered the same feature. This is likely to be problematic.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/p&gt;
&lt;p&gt;Basically, when it comes to free text, &lt;strong&gt;input is a mess&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;And then there are pragmatic problems like:&lt;ul&gt;&lt;li&gt;&lt;p&gt;"How big should the hidden layer be?"&lt;/p&gt;&lt;p&gt;Too small and the network won't learn, too big and the network will &lt;em&gt;mimic&lt;/em&gt; rather than learning to generalize properly and will also be slow.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;"Should we have one hidden layer or two?"&lt;/p&gt;&lt;p&gt;As a general rule of thumb it appears that in 85% of cases 2 layers works best with 3 layers performing better in the rest.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;"What training rate should be used?"&lt;/p&gt;&lt;p&gt;Set too high and the network &lt;em&gt;bounces&lt;/em&gt; around unable to settle on a solution, set too low and it never converges on useful behaviour&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/p&gt;
&lt;p&gt;Each of these problems is solvable but usually involves trial and error searching by training the network, deciding whether it is effective and, if not, junking it and trying some different combination of parameters.  This is fine if you have a fixed training set with which you train the network repeatedly and then, having found the best parameter combination, just use it from there on.&lt;/p&gt;
&lt;p&gt;However, in my application, the training is done by the user interactively and the training set will be different for each of them, and will change over time.  Although the accumulated training data could be stored and the network retrained in the background I think this could mean that the network would never be &lt;strong&gt;useful&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;All of which is leading me to think (as others have advised me) that a Neural Network may not be the best solution to this particular problem.  I'm interested in whether anyone has any successful experience in this area.  Otherwise I'm probably going to start looking more closely at Bayesian classifiers.&lt;/p&gt;</description>
      <guid isPermaLink="true">http://matt.blogs.it/entries/00001755.html</guid>
      <ent:cloud ent:href="http://matt.blogs.it/topics/">
      </ent:cloud>
    </item>
    <item>
      <title>A cracking good read</title>
      <link>http://matt.blogs.it/entries/00001865.html</link>
      <pubDate>Fri, 17 Jun 2005 23:53:52 +0100</pubDate>
      <description>&lt;p&gt;Today I picked up a copy of &lt;a href="http://www.amazon.co.uk/exec/obidos/ASIN/0262531410/qid=1119048812/sr=8-1/ref=sr_8_xs_ap_i1_xgl/202-9212189-3436646"&gt;Statistical Language Learning&lt;/a&gt; by Eugene Charniak.  It seems like quite a thorough treatment of natural language processing using hidden markov models and probabilistic context free grammars.  I'm also half way through Thomas Passin's &lt;a href="http://www.amazon.co.uk/exec/obidos/ASIN/1932394206/qid%3D1119048989/202-9212189-3436646"&gt;Explorers Guide to the Semantic Web&lt;/a&gt; which is a very useful guide to the current state of the art in RDF w.r.t. agents, searching, logic, and ontology.  I'm especially motivated by the relevance to the cognitive psychology reading I did on judgement, reasoning, and decision making this year.&lt;/p&gt;
&lt;p&gt;Off the back of that I'm tinkering with a simple agent programming environment, written in Ruby, which I might release if it amounts to anything interesting.&lt;/p&gt;</description>
      <guid isPermaLink="true">http://matt.blogs.it/entries/00001865.html</guid>
      <ent:cloud ent:href="http://matt.blogs.it/topics/">
      </ent:cloud>
    </item>
    <item>
      <title>Corpus</title>
      <link>http://matt.blogs.it/entries/00002298.html</link>
      <pubDate>Sat, 15 Jul 2006 23:58:29 +0100</pubDate>
      <description>&lt;p&gt;If anyone knows of a free modern English corpus that's available online I'd be very grateful to hear about it. Or if they can suggest an alternative way to solve my problem:&lt;/p&gt;

&lt;p&gt;I have a database of terms that are used for making associations from English to a set of ideas. However because the number of terms relating to each idea may be different and because the popularity of each can be different it makes it very hard to compare how each is expressed.&lt;/p&gt;

&lt;p&gt;It occurred to me that using a corpus I could build a frequency map to highlight popular/unpopular terms and apply discounts appropriately, normalizing (to some extent) the associations expressed and making comparisons more meaningful.&lt;/p&gt;

&lt;p&gt;I am aware of the Brown corpus, the LOB, and the BNC. However each of these costs money which I don't have.&lt;/p&gt;

&lt;p&gt;All ideas gratefully received.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: with &lt;a href="http://www.zedshaw.com/"&gt;Zed Shaw&lt;/a&gt;'s help I found a &lt;a href="http://www.bckelk.uklinux.net/menu.html"&gt;page&lt;/a&gt; that includes a &lt;a href="http://www.bckelk.uklinux.net/words/wlist.zip"&gt;lexicon of about 57,000 English words with relative frequencies&lt;/a&gt;. I'm hopeful that this might offer a way forward.&lt;/p&gt;</description>
      <guid isPermaLink="true">http://matt.blogs.it/entries/00002298.html</guid>
      <ent:cloud ent:href="http://matt.blogs.it/topics/">
      </ent:cloud>
    </item>
  </channel>
</rss>
