Building a personal version IBM Watson question answering system

It is possible to build your own Watson Jr. question-answering system, something less fancy, less sophisticated, scaled-down for personal use or business workgroup usage

As with any Do-It-Yourself (DIY) project, I am not responsible if you are not happy with your Watson Jr. I am basing the approach on what I read from publicly available sources, and my work in Linux, supercomputers, XIV, and SONAS. For our purposes, Watson Jr. is based entirely on commodity hardware, open source software, and publicly available sources of information. Your Watson Jr. will certainly not be as fast or as clever as the IBM Watson you saw on television.)


Element                Number of cores Time to answer one Jeopardy question
Single core                       1      2 hours
Single IBM Power750 server       32      less than 4 minutes
Single rack (10 servers)        320      less than 30 seconds
IBM Watson (90 servers)       2,880      less than 3 seconds

Step 1: Buy the Hardware

three-tier Supercomputers are built as a cluster of identical compute servers lashed together by a network. You will be installing Linux on them, so if you can avoid paying extra for Microsoft Windows, that would save you some money.

Here is your shopping list:
* Three x86 hosts, with the following:
o 64-bit quad-core processor, either Intel-VT or AMD-V capable,
o 8GB of DRAM, or larger
o 300GB of hard disk, or larger
o CD or DVD Read/Write drive
o 1GbE Ethernet
* Computer Monitor, mouse and keyboard
* Ethernet 1GbE 4-port hub, and appropriate RJ45 cables
* Surge protector and Power strip
* Local Console Monitor (LCM) 4-port switch (formerly known as a KVM switch) and appropriate cables. This is optional, but will make it easier during the development. Once your Watson Jr. is operational, you will only need the monitor and keyboard attached to one machine. The other two machines can remain “headless” servers.

Step 2: Establish Networking
IBM Watson used Juniper switches running at 10Gbps Ethernet (10GbE) speeds, but was not connected to the Internet while playing Jeopardy! Instead, these Ethernet links were for the POWER7 servers to talk to each other, and to access files over the Network File System (NFS) protocol to the internal customized SONAS storage I/O nodes.

The Watson Jr. will be able to run “disconnected from the Internet” as well. However, you will need Internet access to download the code and information sources. For our purposes, 1GbE should be sufficient. Connect your Ethernet hub to your DSL or Cable modem. Connect all three hosts to the Ethernet switch. Connect your keyboard, video monitor and mouse to the LCM, and connect the LCM to the three hosts.

Step 3: Install Linux and Middleware
For this project, you can use any modern Linux distribution that supports KVM. IBM Watson used Novel SUSE Linux Enterprise Server [SLES 11]. Alternatively, I can also recommend either Red Hat Enterprise Linux [RHEL 6] or Canonical [Ubuntu v10]. Each distribution of Linux comes in different orientations. Download the the 64-bit “ISO” files for each version, and burn them to CDs.

* Graphical User Interface (GUI) oriented, often referred to as “Desktop” or “HPC-Head”
* Command Line Interface (CLI) oriented, often referred to as “Server” or “HPC-Compute”
* Guest OS oriented, to run in a Hypervisor such as KVM, Xen, or VMware. Novell calls theirs “Just Enough Operating System” [JeOS].

For Watson Jr., I have chosen a [multitier architecture], sometimes referred to as an “n-tier” or “client/server” architecture.

Host 1 – Presentation Server

For the Human-Computer Interface [HCI], the IBM Watson received categories and clues as text files via TCP/IP, had a [beautiful avatar] representing a planet with 42 circles streaking across in orbit, and text-to-speech synthesizer to respond in a computerized voice. Your Watson Jr. will not be this sophisticated. Instead, we will have a simple text-based Query Panel web interface accessible from a browser like Mozilla Firefox.

Host 1 will be your Presentation Server, the connection to your keyboard, video monitor and mouse. Install the “Desktop” or “HPC Head Node” version of Linux. Install [Apache Web Server and Tomcat] to run the Query Panel. Host 1 will also be your “programming” host. Install the [Java SDK] and the [Eclipse IDE for Java Developers]. If you always wanted to learn Java, now is your chance. There are plenty of books on Java if that is not the language you normally write code.

While three little systems doesn’t constitute an “Extreme Cloud” environment, you might like to try out the “Extreme Cloud Administration Tool”, called [xCat], which was used to manage the many servers in IBM Watson.

Host 2 – Business Logic Server

Host 2 will be driving most of the “thinking”. Install the “Server” or “HPC Compute Node” version of Linux. This will be running a server virtualization Hypervisor. I recommend KVM, but you can probably run Xen or VMware instead if you like.

Host 3 – File and Database Server

Host 3 will hold your information sources, indices, and databases. Install the “Server” or “HPC Compute Node” version of Linux. This will be your NFS server, which might come up as a question during the installation process.

Technically, you could run different Linux distributions on different machines. For example, you could run “Ubuntu Desktop” for host 1, “RHEL 6 Server” for host 2, and “SLES 11” for host 3. In general, Red Hat tries to be the best “Server” platform, and Novell tries to make SLES be the best “Guest OS”.

My advice is to pick a single distribution and use it for everything, Desktop, Server, and Guest OS. If you are new to Linux, choose Ubuntu. There are plenty of books on Linux in general, and Ubuntu in particular, and Ubuntu has a helpful community of volunteers to answer your questions.

Step 4: Download Information Sources

You will need some documents for Watson Jr. to process.

IBM Watson used a modified SONAS to provide a highly-available clustered NFS server. For Watson Jr., we won’t need that level of sophistication. Configure Host 3 as the NFS server, and Hosts 1 and 2 as NFS clients. See the [Linux-NFS-HOWTO] for details. To optimize performance, host 3 will be the “official master copy”, but we will use a Linux utility called rsync to copy the information sources over to the hosts 1 and 2. This allows the task engines on those hosts to access local disk resources during question-answer processing.

We will also need a relational database. You won’t need a high-powered IBM DB2. Watson Jr. can do fine with something like [Apache Derby] which is the open source version of IBM CloudScape from its Informix acquisition. Set up Host 3 as the Derby Network Server, and Hosts 1 and 2 as Derby Network Clients. For more about structured content in relational databases, see my post [IBM Watson – Business Intelligence, Data Retrieval and Text Mining].

Linux includes a utility called wget which allows you to download content from the Internet to your system. What documents you decide to download is up to you, based on what types of questions you want answered. For example, if you like Literature, check out the vast resources at [FullBooks.com]. You can automate the download by writing a shell script or program to invoke wget to all the places you want to fetch data from. Rename the downloaded files to something unique, as often they are just “index.html”. For more on wget utility, see [IBM Developerworks].

Step 5: The Query Panel – Parsing the Question

Next, we need to parse the question and have some sense of what is being asked for. For this we will use [OpenNLP] for Natural Language Processing, and [OpenCyc] for the conceptual logic reasoning. See Doug Lenat presenting this 75-minute video [Computers versus Common Sense]. To learn more, see the [CYC 101 Tutorial]. Unlike Jeopardy! where Alex Trebek provides the answer and contestants must respond with the correct question, we will do normal Question-and-Answer processing. To keep things simple, we will limit questions to the following formats:
* Who is …?
* Where is …?
* When did … happen?
* What is …?
* Which …?

Host 1 will have a simple Query Panel web interface. At the top, a place to enter your question, and a “submit” button, and a place at the bottom for the answer to be shown. When “submit” is pressed, this will pass the question to “main.jsp”, the Java servlet program that will start the Question-answering analysis. Limiting the types of questions that can be posed will simplify hypothesis generation, reduce the candidate set and evidence evaluation, allowing the analytics processing to continue in reasonable time.

Step 6: Unstructured Information Management Architecture

The “heart and soul” of IBM Watson is Unstructured Information Management Architecture [UIMA]. IBM developed this, then made it available to the world as open source. It is maintained by the [Apache Software Foundation], and overseen by the Organization for the Advancement of Structured Information Standards [OASIS].

UIMA-bridge

Basically, UIMA lets you scan unstructured documents, gleam the important points, and put that into a database for later retrieval. In the graph above, DBs means ‘databases’ and KBs means ‘knowledge bases’. See the 4-minute YouTube video of [IBM Content Analytics], the commercial version of UIMA.

UIMA-Collection-Processing

Starting from the left, the Collection Reader selects each document to process, and creates an empty Common Analysis Structure (CAS) which serves as a standardized container for information. This CAS is passed to Analysis Engines , composed of one or more Annotators which analyze the text and fill the CAS with the information found. The CAS are passed to CAS Consumers which do something with the information found, such as enter an entry into a database, update an index, or update a vote count.

(Note: This point requires, what we in the industry call a small matter of programming, or [SMOP]. If you’ve always wanted to learn Java programming, XML, and JDBC, you will get to do plenty here. )

If you are not familiar with UIMA, consider this [UIMA Tutorial].

Step 7: Parallel Processing

People have asked me why IBM Watson is so big. Did we really need 2,880 cores of processing power? As a supercomputer, the 80 TeraFLOPs of IBM Watson would place it only in 94th place on the [Top 500 Supercomputers]. While IBM Watson may be the [Smartest Machine on Earth], the most powerful supercomputer at this time is the Tianhe-1A with more than 186,000 cores, capable of 2,566 TeraFLOPs.

To determine how big IBM Watson needed to be, the IBM Research team ran the DeepQA algorithm on a single core. It took 2 hours to answer a single Jeopardy question! Let’s look at the performance data:


Element                Number of cores Time to answer one Jeopardy question
Single core                       1      2 hours
Single IBM Power750 server       32      less than 4 minutes
Single rack (10 servers)        320      less than 30 seconds
IBM Watson (90 servers)       2,880      less than 3 seconds

The old adage applies, [many hands make for light work]. The idea is to divide-and-conquer. For example, if you wanted to find a particular street address in the Manhattan phone book, you could dispatch fifty pages to each friend and they could all scan those pages at the same time. This is known as “Parallel Processing” and is how supercomputers are able to work so well. However, not all algorithms lend well to parallel processing, and the phrase [nine women can’t have a baby in one month] is often used to remind us of this.

Fortuantely, UIMA is designed for parallel processing. You need to install UIMA-AS for Asynchronous Scale-out processing, an add-on to the base UIMA Java framework, supporting a very flexible scale-out capability based on JMS (Java Messaging Services) and ActiveMQ. We will also need Apache Hadoop, an open source implementation used by Yahoo Search engine. Hadoop has a “MapReduce” engine that allows you to divide the work, dispatch pieces to different “task engines”, and the combine the results afterwards.

Host 2 will run Hadoop and drive the MapReduce process. Plan to have three KVM guests on Host 1, four on Host 2, and three on Host 3. That means you have 10 task engines to work with. These task engines can be deployed for Content Readers, Analysis Engines, and CAS Consumers. When all processing is done, the resulting votes will be tabulated and the top answer displayed on the Query Panel on Host 1.

Step 8: Testing

To simplify testing, use a batch processing approach. Rather than entering questions by hand in the Query Panel, generate a long list of questions in a file, and submit for processing. This will allow you to fine-tune the environment, optimize for performance, and validate the answers returned. There you have it. By the time you get your Watson Jr. fully operational, you will have learned a lot of useful skills, including Linux administration, Ethernet networking, NFS file system configuration, Java programming, UIMA text mining analysis, and MapReduce parallel processing. Hopefully, you will also gain an appreciation for how difficult it was for the IBM Research team to accomplish what they had for the Grand Challenge on Jeopardy! Not surprisingly, IBM Watson is making IBM [as sexy to work for as Apple, Google or Facebook], all of which started their business in a garage or a basement with a system as small as Watson Jr..

If you liked this article, please give it a quick review on ycombinator or StumbleUpon. Thanks