Organizing Our Thoughts About Cells – Finally

Since I was mentioning the wide variety of cell types yesterday (in the context of phenotypic assays), it might be a good time to highlight this new paper. The authors are looking out over the modern landscape of cell characterization and are quite worried about what they see.

That’s because recent years have seen a huge growth of single-cell profiling techniques, with DNA sequencing, mRNA profiling, and even single-cell proteomics moving along briskly. That is of course producing vast amounts of new data, but the paper points out that we are very far from dealing with it in a systematic manner, and says that we’re going to regret it if we don’t:

. . .although there are excellent algorithms for organizing single-cell profiles into manageable numbers of ‘‘clusters’’, researchers are largely left to their own devices with respect to naming clusters and relating them to other datasets.

This is unfortunate for several reasons. First, it leads to considerable repetition of effort, often in the form of time-intensive literature and web searches (e.g., the unsystematic practice of ‘‘googling’’ differentially expressed genes). Second, it is the wild west out there, with no widely accepted standards around annotation quality or nomenclature. Although we are increasingly adept at integrating datasets and transferring labels computationally, this risks simply propagating potentially suboptimal descriptors. Third, the resulting corpus is heavily biased toward the systems in which the data is being generated (a complex function of scientific interest, resource allocation, and technical factors), rather than being anchored in a natural distribution. Fourth, it represents a missed opportunity, as it doesn’t feel like we are moving toward any consensus or cohesion that mirrors Linnaeus’s index cards, where new information can simply be added to a stable backbone.

As more data is generated, the situation is becoming progressively worse. . .

They go on to say that not only do we have any sort of unified system to classify cell types, we con’t even agree on what the best way to do that might be (what features are most important?), and we don’t even all agree about what’s meant by commonly used terms like “lineage” or “type”. As it is now, cells are named variously by their tissue location, by their physiological function, by their appearance under the microscope (individually or in reference to other tissue structures – it varies), by the names of their discoverers, or by whatever letter/number codes someone assigned to them in whatever lab described them. Overlay the various species involved (and the often-corresponding cells in each), the different stages of development and the multiple functions that cells can perform, the ways in which these classification schemes can converge and diverge (two unrelated cell types that do similar things, for example, or two morphologically similar cell types that have totally different functions), and you have a blueprint for complete confusion.

The authors consider the most likely of the organizing principles – historical (rejected out of hand as unsystematic), morphological (also rejected as containing too little useful data in itself), physiologic function (definitely a contender, although we certainly don’t know those functions for many cells or have only hazy and incomplete ideas), evolutionary relationships (also a contender, but very challenging to realize in a systematic way and also sets you up for severe clashes with a developmental perspective), and molecular profiling. That one is given serious consideration – it’s already being used to classify immune cells (the “CD” nomenclature based on cell-surface proteins, and it does tend to classify cells into discrete related groups. But it has plenty of complications: you might end up with different classifications by focusing on mRNA versus focusing on expressed proteins, for one thing, and you also will still have to contend with major changes across cellular development pathways, not to mention incorporating derangements like tumor cell states. If you get fine-grained enough, there are temporal and spatial variations in mRNA and protein levels that might cause normal cell-cycle variations to slurry up your attempts at an organized view.

And then there are developmental relationships. The paper points out that for an organisms like C. elegans, we have done enough work to nail down every single cell at every stage of the organism’s development, and there is a systematic nomenclature that incorporates this. And that is a great accomplishment, but it’s going to be hard to replicate that with organisms whose cell lineages might not be quite so invariant, whose bodies contain more than a thousand cells or so, and who are not tiny and transparent under the microscope. That last one matters quite a bit, to be honest. There are ways of tagging cells to find out their lineages in larger, more opaque creatures (like you and me), but you’d still miss an awful lot of steps along the way. And there’s the problem of applying such techniques to humans, of course, although the authors note that we can still get pretty far by homology to other animal species.

The paper ends up proposing a combination of molecular profiles and developmental pathways, and the authors suggest that some sort of cell-lineage-tree representation might capture the most data in a usable form. The paper goes into a great deal of detail on this proposal which I won’t recapitulate, but it’s very much worth going through if you’re into this sort of thing. The hope is that such a system could be a framework onto which new data could be appended in a systematic way instead of just piled up somewhere. The end proposal is for a three-dimensional tree representation where the axes are cell lineage, differentiation state, and time, with computational techniques applied to determine when you branch things off (i.e., setting things up for maximum information gain in the display). The hope is that the “tree thinking” that these representations would encourage could allow for new insights and a better mental picture of the cellular landscapes, which is something that the current data heaps do not encourage in the slightest.

I think they’re right in that we have to start somewhere, because the current “system” just isn’t helping anyone. We’re leaving a lot of potential insights unexplored through a lack of systematization. Any such system will fall short of perfection, but we’re so far from that now that anything reasonably thoughtful (like this proposal) is bound to be a big step up.

Side note: one thing that I noticed while reading this paper is the free use of contractions (doesn’t, don’t, isn’t, e.g.). I don’t recall seeing this in any other journal article, not to this extent, even in a perspective/overview like this. Anyone know of other examples, or if this is some recent editorial decision at Cell?