Significance of Compression Rate Method

Summary: The significance of the Compression Rate Method (CRM) is that it justifies a form of empirical inquiry into aspects of reality that have previously resisted systematic interrogation. Some examples of potential investigations are described. A key hypothesis is discussed, and the link between empirical science and lossless data compression is emphasized.

In my previous post, the protagonist Sophie developed a modified version of the scientific method. It consists of the following steps:

  1. Obtain a large database T related to a phenomenon of interest.
  2. Develop a theory of the phenomenon, and instantiate the theory as a compression program.
  3. Test the theory by invoking the compressor on T and measuring the net codelength achieved (encoded data plus length of compressor).
  4. Given two rival theories of the phenomenon, prefer the one that achieves a shorter net codelength.

This modified version preserves two of the essential attributes of the traditional method. First, it employs theoretical speculation, but guides and constrains that speculation using empirical observations. Second, it permits Strong Inference by allowing the field to make decisive comparisons between rival theories.

The key difference between the CRM and the traditional method is that the former does not depend on the use of controlled experiments. For that reason, it justifies inquiries into aspects of empirical reality that have never before been systematically interrogated. The kind of scientific theories that are tested by the CRM depend on the type of measurements in the database target T. If T contains measurements related to physical experiments, the theories of physics will be necessary to compress it. Other types of data lead to other types of science. Consider the following examples:

  1. Set up a camera next to a highway, and record the stream of passing cars. To compress the resulting data, you will need to develop a computational understanding of the visual appearance of automobiles. You will need theories of hubcaps, windshields, license plates, car categories, and so on.
  2. Position some microphones in the tops of trees and start recording. A major source of variation in the resulting data will be bird vocalization. To compress the data, you will need to find ways to differentiate between bird songs and bird calls, tools to identify species-characteristic vocalizations, and maps showing the typical ranges of various species. In other words, this type of inquiry will be a computational version of the traditional study of bird vocalization carried out by ornithologists.
  3. Construct a database by using large quantities of English text. To compress this database you will need an advanced computational understanding of English. You will need dictionaries, rules of grammar, word-sense disambiguation tools, and, more generally, theories of linguistics.
  4. Convince Mark Zuckerberg to give you the Facebook image database. One obvious property of this dataset is that it contains an enormous number of faces. To compress it, you will need theories of the appearance of faces. These theories will be highly related to work on face modeling in graphics - see here for example.
  5. Generate a huge database of economic data such as home prices, interest and exchange rate fluctuations, business inventories and sales, unemployment and welfare applications, and so on. To compress this database, you will need theories of economics.

It should be emphasized that when in the above list it says "You will need theories of X", this simultaneously means that "You can test and refine theories of X", and "You can prove the superiority of your pet theory of X" by demonstrating the codelengths it achieves on an appropriate dataset. So if you are a linguist and you want to demonstrate the validity of X-bar theory, you build an X-bar compressor and test it on the large text database. If you are an economist and you want to prove the truth of Austrian Business Cycle Theory, you build ABCT into a compressor and invoke it on the economics database. If a theory can't be packaged into a compressor for some real world dataset, then it's probably not scientific anyway (more later on the problem of demarcation).

(It's also worth noting the dedication to truth, and the simultaneous contempt for petty academic affiliation games, indicated by a rigorous adherence to the compression principle. If you develop a new theory of linguistics and use it to set a new record on the benchmark text database, I will hail you as a great linguist. I will publish your papers in my journal, nominate you for awards, and approve your grant applications. It does not matter if you are a teenage college dropout living with your parents.)

The inquiries described in the above list make an important implicit assumption, which can be called the Reusability Hypothesis:

The abstractions useful for practical applications are also useful for compression.

 

Thus one very practical application in relation to the Facebook database is the detection and recognition of faces. This application depends on the existence of a "face" abstraction. So the hypothesis implies that this face abstraction will be useful for compression as well. Similarly, with regards to the ornithology example, one can imagine that the ability to recognize bird song would be very useful to bird-watchers and environmentalists, who might want to monitor the activity, population fluctuations, and migration patterns of certain species. Here the Reusability Hypothesis implies that the ability to recognize bird-song will also be useful to compress the treetop sound database.

The linguistics example is worth examining because of its connection to a point made in the preface about overmathematization and the distinction of complex deduction vs. complex induction. The field of computational linguistics is highly mathematized, and one can imagine that in principle some complex mathematics might be useful to achieve text compression. But by far the simplest, and probably the most powerful, tool for text compression is just a dictionary. Consider the following sentence:

John went to the liquor store and bought a bottle of _______ .

 

Now, if a compressor knows nothing about English text, it will have to encode the new word letter-by-letter, for a cost of Nlog(26), where N is the length of the word (I assume N is encoded separately, at a basically fixed cost). But a compressor equipped with a dictionary will have to pay only log(Wn), where Wn is the number of words of length N in the dictionary. This is a substantial savings, since Wn is much smaller than 26^N. Of course, more advanced techniques, such as methods that take into account part of speech information, will lead to further improvement.

The point of the above example is that the dictionary is highly useful, but it does not involve any kind of complex mathematics. Compare a dictionary to the theory of general relativity. Both can be used to make predictions, and so both should be viewed (under my definition) as legitimate scientific theories. And they are both complex, but complex in opposite ways. GR is deductively complex, since it requires sophisticated mathematics to use correctly, but inductively simple, because it requires only a few parameters to specify. In contrast the dictionary is deductively simple, since it can be used by anyone who can read, but inductively complex, since it requires many bits to specify.

Another point made in the preface was that this approach involves empirical science as a core component, as opposed to being primarily about mathematics and algorithm-design (as I consider modern AI to be). Some people may be confused by this, since most people consider data compression to be a relatively minor subfield of computer science. The key realization is that lossless data compression can only be achieved through empirical science. This is because data compression is impossible for arbitrary inputs: no compressor can ever achieve compression rates of less than M bits when averaged over all M-bit strings. Lossless compressors work because they contain an implicit assertion about the type of data on which they will be invoked, and they will fail to achieve compression if that assertion turns out to be false. In the case of image compressors like PNG, the assertion is that, in natural images, the values of adjacent pixels are highly correlated. PNG can exploit this structure to achieve compression, and conversely, the fact that PNG achieves compression for a given image means that it has the assumed structure. In other words, PNG contains an empirical hypothesis about the structure of visual reality, and the fact that it works is empirical evidence in favor of the hypothesis. Now, this pixel-correlation structure of natural images is completely basic and obvious. The proposal, then, is to go further: to develop increasingly sophisticated theories of visual reality and test those theories using the compression principle.

Still, it may not be obvious why this kind of research would be any different from other work on data compression - people are, after all, constantly publishing new types of compression algorithms. The key difference is the emphasis on large-scale compression; this completely changes the character of the problem. To see why, consider the problem of building a car. If you only want to build a single car, then you just hack it together by hand. You build all the parts using machine tools and then fit them together. The challenge is to minimize the time- and dollar-cost of the manual labor. This is an interesting challenge, but the challenge of building ten million cars is entirely different. If you're going to build ten million cars, then it makes sense to start by building a factory. This will be a big up-front cost, but it will pay for itself by reducing the marginal cost of each additional car. Analogously, when attempting to compress huge databases, it becomes worthwhile to build sophisticated computational tools into the compressor. And ultimately the development of these advanced tools is the real goal.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 10:42 AM
Select new highlight date
All comments loaded

I have some gripes with these two posts as being not nearly compressed enough; but I'm shocked by the number of comments disagreeing with the posts, or speculating that compression is not relevant to AI. This is not stuff you can rationally disagree with. Any model useful for AI does compression, period.

A valid criticism would be that we have a scientific methodology built on statistical tests for significance; and to ask for a way to translate compression results into significance tests. The easiest way would be to compress a large number of examples, and perform tests on the mean and standard deviation of the resulting compression ratios.

Any model useful for AI does compression, period.

Any model does compression, period. What is the particular relevance to AI? And if that was your downvote on my other comment, how does thinking of AI in terms of compression help to develop AI?

I'm afraid it was my downvote. One example of using compression is Hinton's deep autoencoder networks, which work (although he doesn't say this, the math does) by fine-tuning each layer so as to minimize the entropy of the node activations when presented with the items to be learned. In other words: Instead of trying to figure out what features to detect, develop features that compress the original information well. Magically, these features are very good for performing categorization with.

AI was seldom thought of as compression until about 1986. Also, AI wasn't very good until 1986. Pre-1986, learning was ignored, and logic was king. All the pure-logic approaches suffer from combinatorial explosion, because they don't use entropy to enumerate possibilities in order of usefulness. The hard problems of compression were hidden by supplying AI programs with knowledge already compressed into symbols in the appropriate way; but they still didn't work, unless the number of possible actions/inferences was also restricted artificially.

There are people, like Rodney Brooks, who say logic isn't necessary at all. I wouldn't go that far. So, I overstated: There is work to be done in AI that isn't about compression, except in a very abstract way. Lots of work has been done without thinking of it as being compression. But I would say that the hard stuff that gives us problems (categorization, similarity; recognizing, recalling, and managing state-space trajectories) is closely tied to questions of compression.

I have a minor disagreement, which I think supports your general point. There is definitely a type of compression going on in the algorithm, it's just that the key insight in the compression is not to just "minimize entropy" but rather make the outputs of the encoder behave in a similar manner as the observed data. Indeed, that was one of the major insights in information theory is that one wants the encoding scheme to capture the properties of the distribution over the messages (and hence over alphabets).

Namely, in Hinton's algorithm the outputs of the encoder are fed through a logistic function and then the cross-entropy is minimized (essentially the KL divergence). It seems that he's more providing something like a reparameterization of a probability mass function for pixel intensities which is a logistic distribution when conditioned on the "deeper" nodes. Minimizing that KL divergence means that the distribution is made to be statistically indistinguishable from the distribution over the data intensities (since the KL-divergence minimizes expected log likelihood ratio-which means minimizing the power over the uniformly most powerful test).

Minimizing entropy blindly would mean the neural network nodes would give constant output: which is very compressive but utterly useless.

(A text with some decent discussion on the topic)[http://www.inference.phy.cam.ac.uk/mackay/itila/book.html]. At least one group that has a shot at winning a major speech recognition benchmark competition uses information-theoretic ideas for the development of their speech recognizer. Another development has been the use of error-correcting codes to assist in multi-class classification problems (google "error correcting codes machine learning")[http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=error+correcting+codes+machine+learning] (arguably this has been the clearest example of a paradigm shift that comes from thinking about compression which had a big impact in machine learning). I don't know how many people think about these problems in terms of information theory questions (since I don't have much access to their thoughts): but I do know at least two very competent researchers who, although they never bring it outright into their papers, they have an information-theory and compression-oriented way of posing and thinking about problems.

I often try to think of how humans process speech in terms of information theory (which is inspired by a couple of great thinkers in the area), and thus I think that it is useful for understanding and probing the questions of sensory perception.

There's also a whole literature on "sparse coding" (another compression-oriented idea originally developed by biologist but since ported over by computer vision and a few speech researchers) whose promise in machine learning may not have been realized yet, but I have seen at least a couple somewhat impressive applications of related techniques appearing.

Thanks to you and PhilGoetz for those references. I have updated my estimate of the subject.

Downvoted for sheer repetitive drudgery, without genuine motivation for the reader. How many chapters before you show us a real-world instance of CRM achieving progress in research?

When Eliezer had a sequence to write, each post was (1) interesting to read on its own, (2) connected to some real event/interesting story/experimental data, and (3) clearly adding something new to the conversation. You would do well to work on those aspects of your writing.

For my own clarification, did you think the idea of using CRM to evaluate theories of linguistics was uninteresting, or did you just not believe me that it would work?

The latter. Until I see a real-world case where CRM has been very effective compared to other methods, I'm not going to give much credit to claims that it will achieve greatness in this, that and the other field.

And in particular, I find it extremely unlikely that current major theories of linguistics could in practice be coded into compressors, in a way that satisfies their proponents.

This isn't precisely what Daniel_Burfoot was talking about but its a related idea based on "sparse coding" and it has recently obtained good results in classification:

http://www.di.ens.fr/~fbach/icml2010a.pdf

Here the "theories" are hierarchical dictionaries (so a discrete hierarchy index set plus a set of vectors) which perform a compression (by creating reconstructions of the data). Although they weren't developed with this in mind, support vector machines also do this as well, since one finds a small number of "support vectors" that essentially allow you to compress the information about decision boundaries in classification problems (support vector machines are one of the very few things from machine learning that have had significant and successful impacts elsewhere since neural networks).

The hierarchical dictionaries learned do contain a "theory" of the visual world in a sense, although an important idea is that they do so in a way that is sensitive to the application at hand. There is much left out by Daniel_Burfoot about how people actually go about implementing this line of thought.

For me, along with everything orthonormal and I said before, the problem here is that your application of CRM to linguistics is just:

"CRM is a good epistemology (which the entire reader base already agrees with). Here's the obvious things you would do if you applied it to linguistics. Don't worry, something new is coming, just wait a few more articles."

Don't worry, something new is coming,

It's a new idea to use the CRM to approach computer vision. Nobody in computer vision thinks about large scale lossless compression of natural images. All work in CV is predicated on the idea that good methods can be obtained through pure theorizing and algorithm design. Consider for example this famous CV paper. Note the complex mathematics and algorithm design, and then note that the empirical verification consists of the application of the method to about four images (the reader is supposed to verify that the segmentation result agrees with intuition). Or check out this rival technique which actually uses MDL but doesn't report the actual compression rates, only the segmentation results.

If that was your point, you could have resolved this in one article whose thesis is just, "Hey, like you already know, MML is a superior method of rationality compared to traditional science. [Here's a review of MML.] Behold, the best work in computer vision ignores it! Look how much better you'd do, just by reading this site!"

You didn't need to go into all the fluff about AI's advances and failures, what's wrong with science, blah blah blah.

(Not to say this doesn't benefit you; you seem to get net positive karma, with the x10 modifier, each time you draw this out to yet another post, despite not presenting anything new.)

Of course, I thought your original motivation for this series was to explain what super-powerful epistemology it is babies must be using to be able to go from a near-blank slate[1] to solving AI-complete problems on limited empirical data, and how we can replicate that. And you haven't even touched that one.

[1] which I, and the best scientific research, dispute is an accurate characterization of babies

Evaluation of AI research is important.

However, I doubt you can even figure out how to use a human-level explanation of the contents of an image to improve lossless image compression rate, so I'd suggest a different route.

There are already many small competitions in AI research - someone creates a dataset/task with a method of evaluating accuracy (usually agreement with human annotation), and others try to do better.

I'd like to see the use of open/common training and test sets eventually lead to a requirement that systems also be published and reproducible by others, or else the research papers about them are rejected. At that point, you can try whatever compression rate method you like on everybody's systems. I'd work toward that, rather than selling CRM.

CRM as you describe it here, if used to decide funding for AI research competitively, would encourage a lot of work in general data compression and not very much AI, at least for a few years.

However, I doubt you can even figure out how to use a human-level explanation of the contents of an image to improve lossless image compression rate, so I'd suggest a different route.

Who's interested in lossless compression? Not people interested in AI.

Daniel's CRM is about lossless compression, thus my response. I can't tell what you agree or disagree with.

Minimum description length and information theory are both part of the standard AI curriculum. I'd agree that most AI work suggests (and uses in its implementation, e.g. Bloom filters) lossy compression.

Daniel's CRM is about lossless compression, thus my response.

I don't think the distinction between lossy and lossless is as important in AI as you imply here:

I doubt you can even figure out how to use a human-level explanation of the contents of an image to improve lossless image compression rate, so I'd suggest a different route.

As you will note from my comment here, any lossy method can be trivially converted to a corresponding lossless one by adding on a list of "exceptions" to the "theory". So long as the exceptions are few, the compressed string can still be shorter than the source.

So if you notice a major regularity in the data, using it can still result in a shortening of the message length, even if not all of the data adheres to it. Thus, the constraint that the method be lossless is not significant in this context.

I don't disagree with your breaking a lossless compression into those three parts, but converting a lossy compression scheme to a lossless one seem a poor way to judge it: you condemn jpeg as doing no compression. Yes, you probably could use the same ideas to improve lossless compression, but the 90% compression rate of jpeg means something.

Lossless compression has the advantage of objective judgement, but for any particular purposes we want a lossy compression that throws out the data irrelevant to that purpose. Of course, that depends on the purpose; if you're debugging a camera, you don't want jpeg.

In other words, I endorse lossy compression over lossless compression as a measure of theory. (I think Jonathan Graehl and Phil Goetz agree, I'm not sure.)

But jpeg is being judged by a slightly different figure of merit, which changes the relevant considerations.

Like other lossy image compression schemes, jpeg is judged based on how much of the source picture it is that a human notices as being lost. So, for that problem, the relevant "source" to be encoded is really the human perception of the image, not the bitstream of the image file.

And really, since all phenomena pass through someone's senses before they're observed, this is the relevant metric for any MML formalism: whether you can come up with a way to briefly describe what you have experienced. It's just that in typical lossless tests, they've eliminated the impact of human cognition by making it possible to compare directly to a specific string and determine if that string can be completely reproduced.

So I don't condemn jpeg or regard it as meaningless; I'm saying it can be directly compared to lossless methods if you either a) penalize it by the amount of data needed to go from the jpeg to the source image, or b) penalize it by the amount of data needed, on average, to restore the image to something a human would not remember differently (which in many cases is very little if any data).

Not read the article yet, but:

Summary: The significance of the Compression Rate Method (CRM) is that it justifies a form of hard empirical science that does not depend on controlled experimentation.

Like astrophysics, cosmology, evolutionary biology, and geology?

I'm not sure that hard empirical science sans controlled experimentation needs justification.

I'll read the article, now.

Edit: I have nothing to say about the article.

Evolutionary biology is subject to a lot of controlled experiments. See my reply to Marks below.

I see your point and changed the sentence in the summary.

when attempting to compress huge databases, it becomes worthwhile to build sophisticated computational tools into the compressor. And ultimately the development of these advanced tools is the real goal.

I don't see why thinking of the problem as a compression task will cast any light on the problem of building these advanced tools. Especially when, as you already pointed out in your examples, the concepts needed to build these tools are not discoverable from the data -- or at least, you are envisaging them as not being discovered from the data.

Let's take the stock market as an example. The stock market prices are in principle predictable, only not from the data itself but from additional data taken from the newspapers or other sources. How does the CRM apply if the data does not in itself contain the neccessary information?

Let's say I have a theory that cutting production costs will increase stock prices in relation to the amount of cost cut and the prominence of the company and the level of fear of a crash on the stock market and the level of a "bad news indicator" that is a weighted sum of bad press for the company in the past. How would I test my theory with CRM?

In the wider sense, MML still works on the dataset {stock prices, newspapers, market fear}. Regardless of what work has presently been done to compress newspapers and market fear, if your hypothesis is efficient then you can produce the stock price data for a very low marginal message length cost.

You'd write up the hypothesis as a compressor-of-data; the simplest way being to produce a distribution over stock prices and apply arithmetic coding, though in practice you'd tweak whatever state of the art compressors for stock prices exist.

Of course the side effect of this is that your code references more data, and will likely need longer internal identifiers on it, so if you just split the cost of code across the datasets being compressed, you'd punish the compressors of newspapers and market fear. I would suggest that the solution is to deploy shapely value, with the value being the number of bits saved overall by a single compressor working on all the data sets in a given pool of cooperation.

First, infer the existence of people, emotions, stock traders, the press, factories, production costs, and companies. When that's done your theory should follow trivially from the source code of your compression algorithm. Just make sure your computer doesn't decay into dust before it gets that far.

1.Set up a camera next to a highway, and record the stream of passing cars. To compress the resulting data, you will need to develop a computational understanding of the visual appearance of automobiles. You will need theories of hubcaps, windshields, license plates, car categories, and so on.

How does that work? The data would be rediculously complex. If you want to sacrifice quality, you could just convert everything to 3d models and compress that, but that would have alot of error compared to an original video. If you were to compress just the video, knowledge about cars wouldn't do any good because there are far to many outliers, the pixel values. Now there might be an complex way to do it, assume the general shapes of objects and limit the possible variation of pixels in those areas, but even then there are to many exceptions to be useful. And how do you compress faces? You might be able to create a method that can identify a face or a specific face from something else, but then its not compression, its object recognition. The same with your bird example which might even be more complex then the car example. And if you decide to not overfit, this sentence could be compressed by "its all made of letters and the most common letter is e, so therefore it is all E's."

As the OP says "the key difference is the emphasis on large-scale compression".

Something with 3D models eventually compresses a lot better than recording video, over a long period of time, for the same reason video games are made with 3D models rather than by recording every possible game scenario as a precomputed animation.

But there is always going to be error. Games don't look realistic because of this method. On the video, you could have a spot on the window where it instantly turns blue for no reason. What I'm trying to say is theres to much static, to many outliers. If you were to take a thousand pictures of someone, no two pictures would be the same, even though they appear to be.

Along with what rwallace said, it might be a good idea to point out that the total message length actually breaks down into three parts:

1) Specification of the encoding: in info-theoretical terms, this maps more common, long strings (observations) to shorter symbols. It plays the role of the theory, in that it implicitly says what observations are most likely. This part tells you how to "decompress" parts 2 and 3 into their original form.

2) The compressed string: this lists all the observed data, using the shortcuts given in 1). Common occurrences are easily described by short strings.

3) The "exceptions" to the theory -- the observations that aren't easily compressed into 2). The longer strings are reserved for these, since they occur the least.

So yes, even if you can't reach perfection, any strong regularity found in the data will allow you to compress it, even if it's a rule with part-3 exceptions.

Daniel's Reusability hypothesis is correct: any insight you gain into a particular domain can be equivalently expressed as a way to compress descriptions of that domain.

However, let me again register my annoyance with this series, not on grounds that it's wrong (so far), but that it's just an exposition of the well-accepted minimum message length formalism. It may not be accepted by scientists, but it's accepted here, and sciences do implicitly use it when they can't do controlled experiments (which are not strictly necessary for science anyway), such as in astronomy and geology.

First of all, compression doesn't have to be limited to just commonly occuring sets. For example, if I create a fractal, you don't have to find out commonly used colors or pixels that occur near each other, but just create a program which generates the exact same fractal. When you do this you add another element to the system - time. How much time do you spend trying to compress something, or vice versa, decompressing it? And no compression system can garuntee that every observed instance can be compressed to less then its original size. Infact, there are always going to be cases where compressing it actually increases the size of the file, possibly significantly, but at least just a few bits.

First of all, compression doesn't have to be limited to just commonly occuring sets. For example, if I create a fractal, you don't have to find out commonly used colors or pixels that occur near each other, but just create a program which generates the exact same fractal.

That kind of regularity can still be explained in the language of commonly occurring sets: For a fractal, you reserve a symbol for the overall structure, which is common (in that something like it occurs frequently), and you have a symbol to instantiate it, given a particular position, to what scale. Time is handled similarly.

Compression amounts to saying: "This is the structure of the data. Now, here's the remaining data to full in any uncertainty not resolved by its structure." The structure tells you what long symbols you can substitute with short symbols.

And no compression system can garuntee that every observed instance can be compressed to less then its original size.

True, compression only attempts to guarantee that, for the function generating the data, you save, on average, by using a certain compression scheme. If the generating function has no regularity, then you won't be able to compress, even on average.

That kind of regularity can still be explained in the language of commonly occurring sets: For a fractal, you reserve a symbol for the overall structure, which is common (in that something like it occurs frequently), and you have a symbol to instantiate it, given a particular position, to what scale. Time is handled similarly.

Yes but if you have a symbol for every possible fractal, it defeats the purpose. The point was to use time as a memory resource. This also works in reverse with caching and other methods, you can use memory as if it were time. Both, however, only work up to a point.

Yes, but if your compressed data contains both a 3-D model and a stream of outlier pixels (to be applied as delta to the result of rendering the model), this is still going to be smaller than if you didn't have a 3-D model and just supplied the video as a pixel stream.

Mind you, if what you want is machine vision, I think in practice you will do better to work on machine vision directly than on video compression. But the compression approach is valid as a matter of mathematical principle.

Right - the delta correction has lower entropy than the raw pixel stream. Of course, you have to use a good 3-D model!

Mind you, if what you want is machine vision, I think in practice you will do better to work on machine vision directly than on video compression.

Are you familiar with the field of machine vision? If not, two quick points: 1) the evaluation metrics it employs are truly terrible. 2) Almost all computer vision tasks can be reformulated as specialized image compression tricks (e.g. the stereo correspondence problem can be reformulated as a way of compressing the second image in a stereo pair by predicting the pixels using the first image and a disparity function).

I'll admit I'm not particularly familiar with machine vision. I was extrapolating from other areas such as natural language processing, where trying to formulate problems in terms of compression isn't particularly helpful, but if an expert on machine vision says it is helpful in that field, fair enough.

natural language processing, where trying to formulate problems in terms of compression isn't particularly helpful

Why do you think that? Statistical natural language processing has been very successful, both for recognition and translation of text, and for speech recognition.

Statistical natural language processing yes, but that's not the same thing as text compression (again, there is a mathematical sense in which one could consider them related in principle, but doing text compression and doing natural language processing were still separate activities last I heard).