this post was submitted on 23 Aug 2023
5 points (85.7% liked)

Data Engineering

185 readers
1 users here now

Discussion on Data Engineering topics. Data pipelines, tools and technologies, databases and DBMS, best practices:

Rules:

founded 1 year ago
MODERATORS
 

Our data engineer insists in lowercasing everything and removing some other formatting like new lines on free text fields.

They say it's "better for elastic search".

To me that makes no sense and loses information that can't be added back. But I couldn't really convince them otherwise. So far no real problem has come out of it but it makes for a worse experience for the user. Like company names that are acronyms show up as all lowercase. (ibm, llc, etc.) or free text fields that we miss when the user wrote in caps or added paragraphs.

What are your thoughts on this?

Disclaimer, I'm not a data engineer. Just a PM from a data related product.

top 8 comments
sorted by: hot top controversial new old
[–] [email protected] 6 points 1 year ago (1 children)

Where are you getting the data from, and do you maintain access to the originals after ingestion?

Is the database used for anything other than Elasticsearch?

If you do not have access to it after ingestion, you should keep a perfect copy of the data because, as you noted, you lose information otherwise. This can be especially important to address bugs in normalization logic, or requirement changes. For example, if your normalization logic replaces "-" with "_", and at some point in the future you need to distinguish between "this-phrase" and "this_phrase", if you've lost the original data you've also lost the ability to fix your normalized data and indexes.

Similarly, while the existing normalization logic might be better for Elasticsearch, you may not be using Elasticsearch forever, and you don't know the requirements of the next system.

That all said, I'm also skeptical that there is any real Elasticsearch benefit to modifying your data as described, in particular converting to lowercase. You might want to ask your data engineer to tell you explicitly what the purported benefits are. If they tell you it's for performance, ask for metrics, and weigh performance gains/costs against the usability gains/costs. If they can't give you metrics, ask for the documentation supporting their claims. If they can't give you metrics or docs, find a new data engineer.

[–] [email protected] 2 points 1 year ago

We build market analytics/reports out of the data from elastic search.

Thank you for your suggestion. I'll address this with them to see if I can get a better understanding of the reasoning behind it.

We don't have access to all the past data, most yes. But a lot no.

[–] [email protected] 3 points 1 year ago* (last edited 1 year ago) (1 children)

The answer to your question is extremely use-case specific, and sounds like something to discuss with others at your workplace.

[–] [email protected] 1 points 1 year ago (1 children)

That's fair.

When would that be useful?

Consider we have no space restriction nor need for absurd speeds. All our competitors stpre the data as it was originally inputted (we share data sources, theirs display nice ours displays all lowercase and etc, as mentioned.)

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago)

Got it, useful info.

I'm a software engineer, but here's a bunch of stuff to consider, in no particular order.

Maybe the data engineer isn't the one to convince?

If it saves time, how much time? Would tools (I'm using the term tools broadly here) you use work differently? (Such as analytics for IBM Ibm and ibm counting differently).

Is there a solution that's the best of both worlds? If space isn't an issue can the text be preserved somehow linked to each entry? The formatted text is used for elastic search, but the original text is preserved?

Maybe "convincing" isn't the right approach, but learning is?

[–] [email protected] 1 points 1 year ago

Stack exchange does say that text fields are case-sensitive in ElasticSearch, so that is probably why they do that.

[–] [email protected] 1 points 1 year ago

It is fine if your database has _A tables (also called journal or audit tables) as the previous values would be stored in the _A table entries in case you ever desired to get that data back.

But if your database is missing such good practices, tell them to just use lower() or upper() and leave your data alone

[–] [email protected] 1 points 1 year ago

If space is not an issue, you can keep both versions, one for display, one for search in your db. That way, you don’t need to figure out how to reformat it later.

Side note: But there is an underlying issue which is your data engineer and you don’t communicate technological needs well. It’s a common challenge, so no judgment/condescension meant from me. Consider taking short courses on the technologies your team uses, so you can get better information and context from your meetings with them. I recognize that expecting you to organize that instead of your boss isn’t fair, but I hope it helps you avoid future friction and stress.