Machine Learning or Machine Training?

There is a fundamental issue in statistical based “learning” systems:
i) they regard new facts as outliers in their known models
ii) they regard new facts not by themselves but by similarities with their known models
 
This is not a problem of course, when we want them to perform well and predictably in very specific scenarios which we know already by hand and which we believe will not change in time.
A face will still be a face, a road a road, a written letter a letter etc.
To this “learning” paradigm, “new different data” and “statistical outliers” are just the same thing. 
The problem here is precisely with Learning.
The statistical models do not learn with data — they are modeled or trained by data. And that only happens when we are building the model before actually use it.
So, the first time something new happens, they will deal with it the same way they have always been doing. Just like a cat cannot learn what is a human, so he can only interact with humans according to is notion of “others” which he will put in a pre-defined category in his mind such as “prey”, “provider”, etc.
The point is, we could train a cat to behave with humans differently than with everything else, but he was just being trained, not taught. He still could not learn anything new about people besides what he already knows.
 
About 10 years ago, I tried to solve a specific and apparently simple problem related to this.
I wanted to understand better how we humans can acquire so much information and seamlessly integrate it in our own mental processes, without being bogged down by the exponential complexity of the relationships that our own behaviors required us to learn.
I tried to understand how simple questions could be answered with simple responses, as soon as new data was acquired. I could not succeed in this. No matter how much I tried, the explosion of relationship combinations broke any heuristic every time. I quit.
Remember, statistical models can build complex relationships or mappings between datasets, but their answer is probabilistic and even the most simple questions will fail if entailing something not previously built in the model.

The explosion of relationships I mentioned happens because I am not restricting my data search on specific or preferential relationships. I wanted no heuristic guiding the search besides the data itself.
This assertion in an article about knowledge graphs in ForgeAI clearly states what I wanted to avoid, because it would condition every end result:
Because our edges are interpreted as probabilities, it is possible to set a probability cutoff beyond which we are not interested in graph connections. This allows us to only consider graph algorithms over highly restricted subsets of the graph, which provides us with major algorithmic improvements.
These probabilities, the cutoff, not knowing for sure if excluding a subset would exclude a more meaningful result, is precisely the problem I wanted to avoid. Basically, we would simplify the procedure by excluding “less relevant” information. This turns the process nondeterministic, which is a problem when you have to explain your results with anything other than a probability. In the worst case, I wanted to apply any heuristic over the general search method, to assure I was not biasing the results beforehand.
I am not sure wether you (reader) understand the scope of troubles we put ourselves into when we start doing cutoffs. Cutoffs are based on a judgement (eg some mathematical function someone chose), which in turn is based on relationships probabilities, which in turn are based on an heuristic, which in turn is based on a reasoning made ultimately by someone, no matter how data-informed it is. On top of that, if those probabilities are static, you are assured that at some point in time you will be doing the wrong assessments on data. If they are not fixed, then you have to build another machine to govern those probabilities in time.
All this happens and you are never quite sure that the best answer of all for any given query lies just two nodes ahead of one of your cutoffs. You have not exactly a major algorithmic improvement but a major algorithmic performance improvement, at the expense of the precision of the result.

Anyway a few years ago, I got back at it. In the mean time I had learned a bit more about mathematics, and I already knew a good deal of neurophysiology and anatomy. I also had nothing else really interesting to do.
But still nothing to do with what we call machine “learning”. I never liked statistics because it relies on a downgrade of each and every single individual case, which for me is the first thing I would do if I had decided to quit learning anything at all.
As for mathematics, it gave me a framework for thinking and exploring, for abstract work. Just that.
Like I said above, statistical “learning” has nothing to do with learning. Its about training. I hope I am being crystal clear about this.

This time around (a few years ago) I solved my 10 year old problem. I will not share the details here but I can say that knowing geometry, embryology and having time to spare did help (call it Learning).
Let me remind you that the problem was simple, and that was I was convinced it should have a simplistic solution, regardless whether it was hard or not to find (I believe it was).
Solving the exponentiality of relations meant also that having a huge database of facts stored should not impact the usage of it.
My point here is not about anything new, or innovative. It has already been done. My point is precisely that it can be done, without any prior knowledge of this field, and the end approach can be (at least technically) simple.
My point here is about “learning” and human versus machine capabilities. ItE28099s about the implications of the ways we approach problems and problem solving.

So, lets call this system George. George will show you that he can learn new things quickly.
Moreover, I am not telling George anything about the facts that he gets to know. This is relevant to those who are familiar with the usual heuristics, because they all require some “understanding” about the data and some previously defined way to use that “understanding”.
[As a note to those who are familiar with these subjects, I say that to weight a graph you need un understanding beforehand, which is not the case I am addressing here.]
Again, my point is that this is against the definition of “learning” in the first place.
George, I will say again, knows nothing, except one thing: he knows how to explore relationships without being overwhelmed in the process, whatever they may be (the problem I tried to solve in the first place).

I tell George “Alana knows Eve”.
I add that “Eve is Jay’s teacher”.
Now I can ask George how Alana and Jay are related (1).
He says:
Alana knows Eve.
Eve is Jay’s teacher

Simple, right? Of course. Its all Gorge knows anyway. 
Now George learns that “Jay was at Alana’s birthday”.
If I ask George again about how Jay and Alana are related, he should give me a different answer, because in the mean time he “learned” a more simple answer, right?
Now he says:
Jay was at Alana’s birthday

Great, George!
But lets put George to the test. He knows better. 
George learns that
“Leonardo is father to Alana”
“Eve bought is car from Alec”
“Leonardo plays soccer with Alec”

Now there is more than one answer to the question “How are Leonardo and Eve related?” (2).
George got it right again:
Eve bought is car from Alec
Leonardo plays soccer with Alec
and
Alana knows Eve
Leonardo is father to Alana

George also tells me that for him, the difference in these answers relies on their context.
He says that the context for the first is “Alec” and for the second is “Alana”. Go figure out why.

This is what George knows up to this point:

  • Alana knows Eve
  • Eve is Jay’s teacher
  • Jay was at Alana’s birthday
  • Leonardo is father to Alana
  • Eve bought is car from Alec
  • Leonardo plays soccer with Alec

 
Remember, the only thing I taught George to do, was to search the answers by himself, aiming to the most simple ones.
The principle in the first example (1) still applies — if I teach George a lot more about Leonardo and Eve, he will only consider it if it simplifies his answer for a particular query.
But now you could think — if more knowledge doesn’t “degrade” the answer’s quality, does it degrade its performance?
Lets raise the bar for poor George.
So George goes to college. There he learns a lot more. Ready to put it to practice, just like we humans would like to be able to do! 
So he learned a lot more about Leonardo and Eve. But also, many other not necessarily related things (but potentially related).
We ask him the same thing again and check is answers and how long he takes to get there.

Remember, each time the database changes George is unsure whether the old answers are still good enough.

Life is tough on George, and so are we!

Besides the previous information, I added 6000 random relations between another 99 people, including with the first persons we used. So the previous names are now related in many more ways and with many more people.
Againd the question (1), and the answer takes the same time (some ms):
Jay was at Alana’s birthday
 
But this time, he has other answers he learned (each line is an answer):

Some random assertion relating Alana, Natalee, Asa, Lisa and Jay
Some random assertion relating Jairo, Erica, Jay, Jeramiah and Alana
Some random assertion relating Jay, Alana, Molly, Anne and Eve
Some random assertion relating Jay, Alana, Moriah, Charlee and Lindsay
Some random assertion relating Jay, Kingston, Alana, Aldo and Noemi
Some random assertion relating Alana, Natalee, Asa, Lisa and Jay
Some random assertion relating Jay, Alana, Molly, Anne and Eve
etc

He is equally fast giving them. Take notice that all answers are direct relations between Jay and Alana. There are many more kinds of relations of course, but George wants to make it simple so he’s not adding more than we wanted to know about.
Again the question (2) about Leonardo and Eve (each line is an answer):

Some random assertion relating Leonardo, Logan, Marcus, Destinee and EveSome random assertion relating Eve, Leonardo, Jairo and Dashawn
Some random assertion relating Dashawn, Hailey, Eve, Angel and Leonardo
Some random assertion relating Leonardo, Erica and Eve
Some random assertion relating Eve, Landen, Kelsie and Leonardo
etc

Remember that the initial answers were more complex (two related assertions). George still simplifies things without additional delay.

Unlike a statistical model, he adapted his behaviour to what he learned (without training), and moreover, knowing more made things easier for him.

 Now lets find some more complicated relations.
I ask George about the following persons (3): Lillianna, Douglas, Allan, Rosemary, Libby and Kristina.
He answers (each 3.x is an aswer):
3.1)
Some random assertion relating Natalee, Douglas, Anne, Charlee and Lillianna
[On the other hand]
Some random assertion relating Lillianna, Jamari, Libby, Allan and Jairo
[Following]
Some random assertion relating Rosemary and Lillianna
Some random assertion relating Kailee, Kristina, Lillianna, Campbell and Logan
 
The “on the other hand” indication means that the assertion is not related directly to the previous one, but via the assertion “following” it. It’s just how I chose to express it.
3.2)
Some random assertion relating Salvador, Douglas, Logan, Leonardo and Lillianna
[On the other hand]
Some random assertion relating Lillianna, Jamari, Libby, Allan and Jairo
[Following]
Some random assertion relating Rosemary and Lillianna
Some random assertion relating Elizabeth, Kristina, Lillianna, Bridget and Salvador
etc
 
There are other answers, of course, but George is quick to explore them.
However, there are no simpler relationships than these — 4 assertion responses in a universe of over 6000. If less than 4 assertions would be a valid response, he would give them. Remove one of the assertions in any of those responses and you get a wrong answer.
I can tell you that George could know entire book libraries and still he would not stumble to give short answers. Naturally, in that case, to give all the answers would take more time, but I claim it is supra-linear in complexity , not exponential. 
You may argue, that this is just some simple search on data relations. If that is your opinion, then can you guess which algorithm is used? In these examples I am not using asynchronous techniques and the implementation is not fully optimized. I am using middle grade hardware.
 
Back to my point with all this.

If George was a statistical machine, he would make up an answer, the one that fits better to his model parameters.

No single learned fact could ever change radically his behaviour.
And his answer would not explain its own relevancy to the problem we are trying to solve.
Everything outside his scope of training would be reduced to a probabilistic similarity, therefore adulterated. He was not learning but instead he was changing data to fit what he already knows.
But we do not want that. We want to know how George got to his answer, or why any answer is relevant to our query, not how related it is to what he was trained to do.
This is the trademark of Learning — the system changes its state and its behaviour in response to changes in its environment.
In the first place, we didn’t give George any parameters about the data we fed him. If we gave George an heuristic, he could do the same thing he is doing but only focusing on the information that best fitted that heuristic (again the problem with statistical models). But anyway, filtering data is the easiest part and it would make things easier for George. Not my concern here, as you might have guessed already.

Intelligent behaviour explains itself.

If you were on trial for an offense, and the judge was to give you a sentence according to machine learning, he would say:
— Since the N similar cases in our database were found guilty, I declare you also guilty. Because the “AI” “learned” from millions of similar cases and decided that way.
So, it was not your case that was on trial, but the average case. Do you understand now better the difference between intelligent decision making and statistical decision making? You better do before you start relying in statistical decision making processes in your organization or in your life.


Update:
In case you who are reading this don’t go directly to my point here with “George”, so that you do not be “stuck” to the simplistic examples I gave (relation between assertions) and miss it, I made this update.
If you want a machine that learns, it has to be able to gather information and relate it to answer queries or do something with it. For example, you would an intelligent system to be able to relate the propositions “I bought my car from Fred” and “Fred sells cars at Bristol” to know (or to have learned that) “I bought my car in Bristol” without knowing it beforehand. And that system should tell me why it said that: because of those facts that it learned. This is a requirement to build knowledge and emulate rational or intellligent behaviour.
Look at it this way: a statistical machine (our today “machine learning”) relies on the ability to derive a state from a great number of tried possibilities (loosing a lot of information in the process). But a combinatorial kind of intelligence derives its power from the exponentiality of the number of combinations, that can be tried on the fly without previous training. The correlation between data that a statistical system does on its training, is done on the fly by a combinatorial system. Moreover, it is done not as a static framework but as a part of a dynamic evolving system. The knowledge in all those cases used in the current ML trials is implicit in the combinatorials that can be inferred from a much simpler state.
Of course, once a system is built that can not only internalize new information, but create new data from it, the process feeds itself and will surely get “out of hand”… at least human hands. Hope you do not feel as uneasy with it as I do!

The moral of this story is the following.

One of the things that makes us humans is this ability to really learn.

Not just to be taught or trained to do or know something — but to Learn something. That is, taking in something new and deriving value from it by relating it to what we previously known.
I share this example with you to at least motivate you to think about this:

  • are you really learning something?
  • are you really producing knowledge or just reproducing it?
  • are you really solving problems or just following a guideline?
  • are you assuming things or are you questioning them?
  • do you know exactly what you know?

In the end…

  • are you being human?

If you do not think for yourself, be sure that someone, or something else, will eventually do it for you.*

As for feelings, of course machines cannot feel, but it is a fact that we cannot build knowledge from feelings (although we love to feel we know something).
I wanted to share this case with you to help you understand this bottom line (*).
 Hope I succeed and contribute to a more critical and informed mindset about the most recent technical advancements and where we are heading in to.

____________
This article was originaly published on LinkedIn at December 31, 2018.