* Tags:: #🗞️Articles, [[Data culture]] * Author:: [[Benn Stancil]] and [[Mark Grover]] (Stemma.ai) * Link:: [Good Data Citizenship Doesn’t Work | by Benn Stancil | Jan, 2022 | Towards Data Science](https://towardsdatascience.com/good-data-citizenship-doesnt-work-265f13a37fa5) * Source date:: [[2022-01-21]] * Read date:: [[2022-01-22]] > Data, it turns out, doesn't so much drift towards entropy, but sprints at it. ^7e41a3 I have mixed feelings about this article. There are many things I like from it, but I disagree on its main premise: it is not that good data citizenship doesn't work, it is that, morals aside, **we don't have good data citizenship**. *I have to clean my mouth before* disagreeing with Benn Stancil but I'll have a go at it. I really liked the extension on the data democracy analogy: for a democracy to work, you need both good citizens (the people that use data) and good leaders (the data team, providing the right tools and advice). And, in a kind of [[📖 Brave New World]] dystopia, democracies can be buried under too much data. Since more people is using more data than before (which doesn't mean that [[data literacy]] is growing at the same speed, see [[🗞️ What is analytics engineering]]), >In recent years, this has all changed. The number of patrons has grown enormously, from executives and a few operational leaders to everyone at a company. The variety of orders has also exploded. Data isn’t used for just KPIs, but for answering complex questions across a wide range of topics, from product and engineering to operations, customer support and executive decision making. And the complexity of each order has increased too. Data that used to live in a few central systems is now spread across dozens (if not hundreds) of third-party SaaS applications. According to [some sources](https://www.statista.com/statistics/1233538/average-number-saas-apps-yearly), organizations now use more than ten times more SaaS products today than they did just five years ago. Data teams resorted to self-service to avoid becoming a bottleneck. The authors use another analogy here: we built a data buffet for people. I don't think it is a good analogy: >But it’s created a new problem: How can people know they can trust the data they’re seeing? To overextend the buffet analogy, how can people know the food they find in the corner station, the one that feels like it’s been unattended all night, is still safe to eat? In a buffet, people don't prepare their own food, and the waiters are the ones making sure that food doesn't become stale. If we apply this to data, we would be back to the centralized, bottleneck model of a central data team keeping up with all data management. So no, we are not providing a buffet (at least in these current immature stages), we are providing a camping kitchen: ![[Pasted image 20220123095953.png]] Then, chaos comes: >People will show up in a meeting to talk through a business problem, and different people will bring mismatched numbers. All too often, data slows decisions down rather than speeds them up, because people spend more time debating if the data is right than they do talking about what to do about it. And all too often, data teams will update core datasets or KPIs, and old versions will linger on for months. Incidents like these erode trust in data and data teams. It teaches people not to participate in the data democracy, and to make decisions entirely on their own, often without using data at all. What do people do to try to avoid this? Documentation. However... >But the good habits slip. Data citizens forget to update it, or, understandably busy with their day jobs, don’t have time to do it. New people join the company and are never trained to update the docs. And here is where they lost me. Should we throw our hands up in the air, saying that "keeping with good habits is impossible"? I think we can hold ourselves up to higher standards. This is a cliché (life is a disappointing succession of common places): **this is not a technical problem, it is a culture problem**. It is the teams failing to understand that their data is also their product, as the [[Data Mesh]] philosophy prescribes. In fact, this [[Tragedy of the Commons]] in data comes from a myopic view of overfocusing on your current team tasks, and it is wrong not only for data. From [[📖 Extreme Programming Explained]]: ![[📖 Extreme Programming Explained#^da43d6]] Not to mention that data helps with "thinking" (it is a tool for improving decisions) not for "doing", and many individuals and organizations value spending time "doing things right" more than "doing the right things". It is also easy to understand why an organization ends up giving up on the efforts to work well with data. It is a vicious circle: 1. We don't work well with data. 2. We end up with bad data. 3. We don't trust data. 4. We won't devote more resources (time, people) to data. 5. Back to square one. ^5cd9cd ## Indulgent practical advice The article then goes on to give practical advice on what to do then. Of course, one of the tips is that you should make it easy for people to collaboratively document data and force it in the data creation part. The authors also recommend auto documentation (ok) and still manual documentation of things that aren't what they seem, that is, *tribal knowledge*. This is also ok, although it shouldn't be needed if we reduced tribal knowledge in the first place, avoiding idiosyncratic stuff that only adds confusion, and making a good use of an [[Ubiquitous Language]] at least within the teams. But in the end, I disagree strongly again. The authors say: > The aim shouldn't be to create a perfect source of truth, but a sketch of what appears to be true. And related to that, "Let there be mess", that we should focus on important areas and leave others as sandboxed environments, and idea that appears also in [[📖 Data Management at Scale]]: ![](assets/1640757288_121.png) I can't help but think that the abuse of these ideas lead us to indulgence. I agree that working with data is messy (particularly because the tools to properly manage it are not really there yet), but we can't simply unilaterally assume this: we won't have the trust of the organization. At the very least, we (the data people) should make concisous efforts to instill this vision: to make people understand that extracting insights from data is hard and not an exact science, that caution is needed because data is full of pitfalls (see 3. Virtues for leading projects in [[📖 How to lead in Data Science]]). And also, again I think we can hold to higher standards ourselves. As [[Andrew S. Grove]] states in [[📖 High Output Management]]: >**Let chaos reign, then rein in chaos.** > >…anything that can be done will be done, if not by you, then by someone else (…) as a manager in such workplace, you need to develop a *higher tolerance for disorder*. Now, you **should still not accept disorder**. In fact, you should do your best to drive what's around you *to* order (p. XIV).