How do I convince my data engineer to not modify data before including it in our db?

Taringano@lemm.ee · 1 year ago

How do I convince my data engineer to not modify data before including it in our db?

JoeyJoeJoeJr@lemmy.ml · 1 year ago

Where are you getting the data from, and do you maintain access to the originals after ingestion?

Is the database used for anything other than Elasticsearch?

If you do not have access to it after ingestion, you should keep a perfect copy of the data because, as you noted, you lose information otherwise. This can be especially important to address bugs in normalization logic, or requirement changes. For example, if your normalization logic replaces “-” with “_”, and at some point in the future you need to distinguish between “this-phrase” and “this_phrase”, if you’ve lost the original data you’ve also lost the ability to fix your normalized data and indexes.

Similarly, while the existing normalization logic might be better for Elasticsearch, you may not be using Elasticsearch forever, and you don’t know the requirements of the next system.

That all said, I’m also skeptical that there is any real Elasticsearch benefit to modifying your data as described, in particular converting to lowercase. You might want to ask your data engineer to tell you explicitly what the purported benefits are. If they tell you it’s for performance, ask for metrics, and weigh performance gains/costs against the usability gains/costs. If they can’t give you metrics, ask for the documentation supporting their claims. If they can’t give you metrics or docs, find a new data engineer.

Taringano@lemm.ee · 1 year ago

We build market analytics/reports out of the data from elastic search.

Thank you for your suggestion. I’ll address this with them to see if I can get a better understanding of the reasoning behind it.

We don’t have access to all the past data, most yes. But a lot no.

CaptainBuckleroy@lemm.ee · edit-2 1 year ago

The answer to your question is extremely use-case specific, and sounds like something to discuss with others at your workplace.

Taringano@lemm.ee · 1 year ago

That’s fair.

When would that be useful?

Consider we have no space restriction nor need for absurd speeds. All our competitors stpre the data as it was originally inputted (we share data sources, theirs display nice ours displays all lowercase and etc, as mentioned.)

CaptainBuckleroy@lemm.ee · edit-2 1 year ago

Got it, useful info.

I’m a software engineer, but here’s a bunch of stuff to consider, in no particular order.

Maybe the data engineer isn’t the one to convince?

If it saves time, how much time? Would tools (I’m using the term tools broadly here) you use work differently? (Such as analytics for IBM Ibm and ibm counting differently).

Is there a solution that’s the best of both worlds? If space isn’t an issue can the text be preserved somehow linked to each entry? The formatted text is used for elastic search, but the original text is preserved?

Maybe “convincing” isn’t the right approach, but learning is?

floppade [he/him]@lemm.ee · 1 year ago

If space is not an issue, you can keep both versions, one for display, one for search in your db. That way, you don’t need to figure out how to reformat it later.

Side note: But there is an underlying issue which is your data engineer and you don’t communicate technological needs well. It’s a common challenge, so no judgment/condescension meant from me. Consider taking short courses on the technologies your team uses, so you can get better information and context from your meetings with them. I recognize that expecting you to organize that instead of your boss isn’t fair, but I hope it helps you avoid future friction and stress.

cryptiod137@lemmy.world · 1 year ago

Stack exchange does say that text fields are case-sensitive in ElasticSearch, so that is probably why they do that.

greengnu@slrpnk.net · 1 year ago

It is fine if your database has _A tables (also called journal or audit tables) as the previous values would be stored in the _A table entries in case you ever desired to get that data back.

But if your database is missing such good practices, tell them to just use lower() or upper() and leave your data alone