Big Data in a Small Page - Cognitive Computing Simplified

Through various media like sight, sound, touch, etc., people accept or disseminate information in multiple formats - the written/ spoken language, facial expressions, intonations, physical language like hugs, etc. Each of them is highly nuanced. For example, it is the slightest twitch in the eye that differentiates surprise from sarcasm. This is why it takes a computing system with 100 billion nodes i.e. our brain, to successfully communicate between two people. This kind of processing power is not yet available outside nature.

So we engineers do what we are best at - pick a niche and assume that nothing else is important. If we pick up the simplest of these formats, the language, we still have to tackle with insane complexity. Language is simplest because it is consensus based. Two persons cannot exchange information unless they agree on language. Hence language has been codified. This is very useful for engineers who are trying to use structured, quantitative computing science for emulating human behavior.

As complicated as it may look, language is the easiest among forms of human communication to codify and be understood by computers.

The trouble is that this consensus based codification is highly local - geographically, temporally, professionally, demographically, or by any other axis you cut it. For example a sentence like "Today is a good day." means completely different things to people living in San Francisco than to people living in Byrd Station, Antarctica. For people living in Washington DC it means different things in August than in February. On the same day in Washington DC, say on the day of Supreme Court's decision on Obamacare, this sentence means different things to Republicans than to Democrats.

To ~80% of earth's population that does not understand English, the sentence does not mean anything at all, but let us engineers do the niche magic again and ignore them for a while.

Ergo, to design a cognitive computing solution that can be applied to more than one datapoint, we have to allow for a lax codification. The same word can take multiple parts of speech. The same meaning is conveyed by different orders of words in the sentence. Try "A good day, today is." The same fact can be expressed using different words. For example, "launch", "unveil", "release", "unlock", "announce", "beta test" and "introduce" may mean the same event in some contexts. The same fact can also be colored in different sentiments or related to multiple other facts in multiple ways based on opinions of different authors.

Each datapoint in human language can have multiple meanings. Any attempt to codify language has to be able to handle the multiple hypotheses.

Many of the current NLP solutions try to make a good guess as soon as they come across these tokens. For example, Stanford's NLP kit assigns part of speech based on complex sentence parsing models built over statistical dictionaries. Another approach, like Coseer's, is to take every possibility as a hypothesis. As these hypotheses are processed with each other in increasingly constrained steps, multiple permutations simply perish. For example, in the sentence "A report is due.", we can make two hypotheses about "report" - verb or noun. When we inter-relate these with hypotheses on other words in the sentence, the verb hypothesis is voided. In either case, the simplest decisions require immense computing power.

Let's make this real. At 9:30 pm Pacfic Time on March 6, 2015, we ran a basic Finder module on www.nytimes.com. This module only focuses on spatial position of text elements, and creates hypotheses on basic facts encountered in text. The output is then used by multiple sophisticated algos to solve real problems. When we printed the JSON for NY Times, the basic mode output for single page ran into 1.7 million lines. It is available here for the brave among you (~4000 pages, ~9MB). For our solutions for Finance and eCommerce sectors, our servers process 1-1.5m documents everyday. In other words for simplest of our products we need to process datapoints running into trillions.

A single page of written text can generate million such hypotheses. Processing that needs incredible computing power.

With the recent developments in Big Data technology it has become possible to process this kind of information using commoditized hardware, so now we can attempt to emulate our brain and get computers to do things that have been exclusive human till now. Managing this ambiguity using explosive data permutations within latencies that are sensible for real time applications, is what cognitive computing is all about. To know more, follow us on social media.

Welcome.