What Category Theory Teaches Us About DataFrames

(mchav.github.io)

73 points | by mchav 5 days ago

6 comments

rich_sasha 3 hours ago
The article starts well, on trying to condense pandas' gaziliion of inconsistent and continuously-deprecated functions with tens of keyword arguments into a small, condensed set of composable operations - but it lost me then.
The more interesting nugget for me is about this project they mention: https://modin.readthedocs.io/en/latest/index.html called Modin, which apparently went to the effort of analysing common pandas uses and compressed the API into a mere handful of operations. Which sounds great!
Sadly for me the purpose seems to have been rather to then recreate the full pandas API, only running much faster, backed by things like Ray and Dask. So it's the same API, just much faster.
To me it's a shame. Pandas is clearly quite ergonomic for various exploratory interactive analyses, but the API is, imo, awful. The speed is usually not a concern for me - slow operations often seem to be avoidable, and my data tends to fit in (a lot of) RAM.
I can't see that their more condensed API is public facing and usable.
[-]
- bbkane 27 minutes ago
  Check out polars- I find it much more intuitive than pandas as it looks closer to SQL (and I learned SQL first). Maybe you'll feel the same way!
few 3 hours ago
I felt like one or two decades ago, all the rage was about rewriting programs into just two primitives: map and reduce.
For example filter can be expressed as:
```
  is_even = lambda x: x % 2 == 0
  mapped = map(lambda x: [x] if is_even(x) else [], data)
  filtered = reduce(lambda x, y: x + y, mapped, [])
```
But then the world moved on from it because it was too rigid
[-]
- mememememememo 2 hours ago
  Performance aside it seems you could do most maybe a the ops with those three. I say three because your sneaky plus is a union operation. So map, reduce and union.
  But you are also allowing arbitrary code expressions. So it is less lego-like.
pavodive 58 minutes ago
When I started reading about pandas complexity and the smaller set of operations needed, couldn't help but think of R's data.table simplicity.
Granted, it's got more than 15 functions, but its simplicity seems to me very similar to what the author presented in the end.
jeremyscanvic 55 minutes ago
It's very insightful how they explain the difference between dataframes and SQL tables / standard relational structures!
getnormality 1 hour ago
Hmm. Folks trying to discover the elegant core of data frame manipulation by studying... pandas usage patterns. When R's dplyr solved this over a decade ago, mostly by respecting SQL and following its lead.
The pandas API feels like someone desperately needed a wheel and had never heard of a wheel, so they made a heptagon, and now millions of people are riding on heptagon wheels. Because it's locked in now, everyone uses heptagon wheels, what can you do? And now a category theorist comes along, studies the heptagon, and says hey look, you could get by on a hexagon. Maybe even a square or a triangle. That would be simpler!
No. Stop. Data frames are not fundamentally different from database tables [1]. There's no reason to invent a completely new API for them. You'll get within 10% of optimal just by porting SQL to your language. Which dplyr does, and then closes most of the remaining optimality gap by going beyond SQL's limitations.
You found a small core of operations that generates everything? Great. Also, did you know Brainfuck is Turing-complete? Nobody cares. Not all "complete" systems are created equal. A great DSL is not just about getting down to a small number of operations. It's about getting down to meaningful operations that are grammatically composable. The relational algebra that inspired SQL already nailed this. Build on SQL. Don't make up your own thing.
Like, what is "drop duplicates"? What are duplicates? Why would anyone need to drop them? That's a pandas-brained operation. You want the distinct keys defined by a select set of key columns, like SQL and dplyr provide.
Who needs a separate select and rename? They're just name management. One flexible select function can do it all. Again, like both SQL and dplyr.
Who needs a separate difference operation? There's already a type of join, the anti-join, that gets that done more concisely and flexibly, and without adding a new primitive, just a variation on the concept of a join. Again, like both SQL and dplyr.
Props to pandas for helping so many people who have no choice but to do tabular data analysis in Python, but the pandas API is not the right foundation for anything, not even a better version of pandas.
[1] No, row labels and transposition are not a good enough reason to regard them as different. They are both just structures that support pivoting, which is vastly more useful, and again, implemented by both R and many popular dialects of SQL.
[-]
- DangitBobby 35 minutes ago
  I guess I have pandas brain because I definitely want to drop duplicates, 100% of the time I'm worried about duplicates and 99% of the time the only thing I want to do with duplicates is drop them. When you've got 19 columns it's _really fucking annoying_ if the tool you're using doesn't have an obvious way to say `select distinct on () from my_shit`. Close second at say, 98% of the time, I want to a get a count of duplicates as a sanity check because I know to expect a certain amount of them. Pandas makes that easy too in a way SQL makes really fucking annoying. There are a lot of parts on pandas that made me stop using it long ago but first class duplicates handling is not among them.
  And the API is vastly superior to SQL is some respects from a user perspective despite being all over the place in others. Dataframe select/filtering e.g. df = df[df.duplicated(keep='last')] is simple, expressive, obvious, and doesn't result in bleeding fingers. The main problem is the rest of the language around it with all the indentations, newlines, loops, functions and so on can be too terse or too dense and much hard to read than SQL.
  [-]
  - gregw2 15 minutes ago
    You articulate your case well, thank you!
    I always warn people (particularly junior people) though that blindly dropping duplicates is a dangerous habit because it helps you and others in your organization ignore the causes of bad data quickly without getting them fixed at the source. Over time, that breeds a lot of complexity and inefficiency. And it can easily mask flaws in one's own logic or understanding of the data and its properties.
    [-]
    - DangitBobby 8 minutes ago
      When I'm in pandas (or was, I don't use it anymore) I'm always downstream of some weird data process that ultimately exported to a CSV from a team that I know has very lax standards for data wrangling, or it is just not their core competency. I agree that duplicates are a smell but they happen often in the use-cases that I'm specifically reaching to pandas for.
  - getnormality 23 minutes ago
    Duplicates in source data are almost always a sign of bad data modeling, or of analysts and engineers disregarding a good data model. But I agree that this ubiquitous antipattern that nobody should be doing can still be usefully made concise. There should be a select distinct * operation.
    And FWIW I personally hate writing raw SQL. But the problem with the API is not the data operations available, it's the syntax and lack of composability. It's English rather than ALGOL/C-style. Variables and functions, to the extent they exist at all, are second-class, making abstraction high-friction.
    [-]
    - DangitBobby 12 minutes ago
      Oooh buddy how's the view from that ivory tower??
      But seriously I'm not in always in control of upstream data, I get stuff thrown over to my side of the fence by an organization who just needs data jiggled around for one-off ops purposes. They are communicating to me via CSV file scraped from Excel files in their Shared Drive, kind of thing.
      [-]
      - getnormality 4 minutes ago
        Do what you gotta do, but most of my job for the past decade has been replacing data pipelines that randomly duplicate data with pipelines that solve duplication at the source, and my users strongly prefer it.
        Of course, a lot of one-off data analysis has no rules but get a quick answer that no one will complain about!
        [-]
        DangitBobby 2 minutes ago
        I updated my OG comment for context. As an org we also help clients come up with pipelines but it's just unrealistic to do a top-down rebuild of their operations to make one-off data exports appeal to my sensibilities.
- fn-mote 1 hour ago
  Amen.
  The author takes the 4 operations below and discusses some 3-operation thing from category theory. Not worth it, and not as clear as dplyr.
  > But I kept looking at the relational operators in that table (PROJECTION, RENAME, GROUPBY, JOIN) and thinking: these feel related. They all change the schema of the dataframe. Is there a deeper relationship?
jiehong 2 hours ago
Dups of a few days ago:
- https://news.ycombinator.com/item?id=47567087