Matching two Excel files

Dsolei · July 17, 2023, 1:41pm

Greetings. Can you suggest me the best way to use easymorph to do the follow:

I receive data from warehouse, in a raw format. I need to take that file (excel) and assign every product in it with the product code of my main database (also in excel). An example to make it clear:

Entry received in the raw format (FILE A): “birra xyz cl. 33”
In my core database excel (FILE B) “birra xyz vap cl. 33 birra chiara bassa gradazione - Cod. 1234”

I need a FILE C where the program match “birra xyz cl.33” (found in FILE A) with the “Cod.1234” (found in FILE B).

How can achieve that with easymorph?

Thanks in advance for any support (and forgive my bad english)

dgudkov · July 17, 2023, 7:25pm

Hello Davide and welcome to the Community!

You can load both files in EasyMorph and then use the “Match” action for instance in the “Starts with” or “Contains” mode.

Dsolei · July 19, 2023, 10:24am

Hi Dmitry, thx for you reply and the welcome first of all

Back on topic, that’s not what I need. The 2 records were just example, but I have a flow of data to match that I cannot do with a single rule for every one, too many (it’s like I do at this time, manually associate any product in (file A) with the code in the (file B).

I would like to know if easymorph can automatically do the match based on the affinity. Take this example: First table (File A) have nearly 2.000 records (product description), then the core database (File B) got over 24.000 records (every record has a product description, with a code associated). For every single record in the (File A), can the easymorph try to associate (or suggest) the right code, reading and comparing the description associated in every code present in File B, based on the best affinity?

Hope I was able to figure out my goal, as usual any support is very appreciate.

roberto · July 19, 2023, 1:16pm

Hi @Dsolei,

Are you familiar with iterations in EasyMorph? More info here: EasyMorph | Loops and iterations.

I think what you could do in this case is iterate through the 2000 records in file A, and for each record, check which description in file B is most similar, and add that one as the “suggested” match. To obtain the “most similar” description, you can use the “distance” function: syntax:functions:distance [EasyMorph Help]

I’ve created a sample project where there are two datasets - “File A” which includes new animal descriptions, and “File B” which is the animal core database which includes the animal code and description. Each new description in “File A” is compared with all descriptions in “File B” and the most similar is returned:

Fuzzy matching and iterations.morph (6.8 KB)

Let me know if you have any questions!

dgudkov · July 19, 2023, 1:17pm

What does “best affinity” mean in your case? How should the matching logic work exactly? The “Match” action can do fuzzy matching based on edit distance and is intended for matching non-exact values.

roberto · July 19, 2023, 1:21pm

@dgudkov I wasn’t aware that the “Match” action also did fuzzy lookup! Good to know.

@Dsolei Dmitry’s solution doesn’t require using iterations, so that would be much simpler than mine.

Dsolei · July 20, 2023, 10:32am

Thx Roberto, and thx Dmitry. I see the example provided by Roberto and he nailed it, that’s exactly what I need. Problem is don’t understand how you have done it

Dmitry solution on the other side, seems not fitting to solve my problem, or I have not understand well how the Matching action works. If i try to do the match, the program only take the first entry in the column (file A) and try to search and substitute every entry he found in the column “description” in the file B that match with that single entry in file A, ignoring any other record in the file A. Do I make some mistake? I need the result of Roberto, where every recod in file A is matched with the most fitting description in the file B.

Roberto, maybe I’m asking too much, but could you be so kindly to describe me every passage you made to achieve that result?

Test1.morph (3.6 KB)

In the above file I created a simulation of File A and File B, I would need a result like Roberto one, how can achieve that?

Sorry im very noob but if the program can solve this task I found the perfect solution for my work. Thanks in advance for your patience my friends

roberto · July 20, 2023, 11:04am

Hi @Dsolei,

I still think that Dmitry’s solution is better, because it doesn’t require iterations and when doing the fuzzy matching it checks all the descriptions in file B and gets the one with the least edit distance (with a maximum edit distance of 10, is that enough?).

And once you have the matching description from the File B (core database), you can also bring over the Database Code.

See sample project: Fuzzy matching.morph (4.6 KB)

(You will see that I have calculated a new column called “Database Match” that just brings over the same values as column “New Description”. This has been done because the “Match” action does not create a new column where the matches are stored, it replaces the existing column, so I “duplicated” the “New Descriptions” column before doing the match so that you can see both the “New Description” and the “Database Match” columns side by side).

Regards

Dsolei · July 25, 2023, 10:20am

Hi @roberto ,
thanks very much for your support, I was able to recreate every single step and is what i needed. The only problem left, but is pretty big, it’s that the Fuzzy match do too many failures, either do not found any match, or give it wrong when there are other with much better affinity.

I give you an example to be more clear. I just substitute the original brand with “fakename”:

File A
Description
ACQUA DI fakename PET L.1,5

File B Core database
Description
fakename EFF.NAT PET LT.1,5
fakename GAS PET LT.1
fakename NAT. VAP LT. 1,5
fakename GAS VAR CL.75

Easymorph doesn’t find any match, even if fakename is written the same, PET and 1,5 should pilot the program to do the right match. Instead the program surrender and doesn’t find any match. I set the max edit distance to 10. (any value under this gives me no match at all)

There is something more that I can do to get better results or it’s a program limit? I dont mind got some error to fix manually, but at least there is a way to force the match with better affinity? Keep in mind as you see in my example, that File B Core database has lots of product that are very similar but differ just for capacity, container, flavor ecc.

dgudkov · July 25, 2023, 12:54pm

How would EasyMorph understand that it should match “fakename” but not PET or 1.5? Match by names that don’t have any technical information. Reduce the edit distance to the minimal possible before matching.

Dsolei · July 25, 2023, 2:32pm

Hi @dgudkov , thx for your reply. I know it’s difficult for me try to explain what I would like as result and explain it to others in an understandable manner (plus englis is not my mother tongue). Thx for your patience, I really appreciate all of you.
In my previous example, I would like the program to “elaborate” like this: the most fitting description of the record in file A (ACQUA DI fakename PET L.1,5) among the records found in File B Core database, is fakename EFF.NAT PET LT.1,5. Reason why, it’s the only record in the whole File B that contains fakename, PET and 1,5 aswell, in this way is the “most fitting” description. I hope I was able to make you understand my desiderata, and if there is a way for Easymorph to produce a similar result.

I add more info to help understand what is my goal: I receive from different warehouses the extraction of their sells (mostly beer, water and wines). Every warehouse has is own way to describe the product, and I need to “uniform” that description with those (if present) in my core database, so I can use the data to make analysis ecc.

Again, thx for your patience and help.

roberto · July 25, 2023, 3:42pm

Hi @Dsolei,

Three questions:

A. Do you and the warehouses have the information contained in the description, in individual fields as well? So that there were a field for “Product Name”, “Bottle Type” and “Quantity”? If that would be the case you could compare by sub-sections instead of the whole description.

B. Does each warehouse always send the information the same way, and are new products being added all the time? If the warehouses always follow the same descriptions and products are not being added often, you could maintain a reference table that has two columns, one with the description of the product in the Warehouses, and one with the corresponding description in the core database, and use that to find the correct match instead of fuzzy matching.

C. The products in your core database have a product code, right? Would the warehouses be willing to include a column in their database to include your corresponding product code, to do the matching using this code?

Regards,

Roberto

Dsolei · July 26, 2023, 9:57am

Hi @roberto , thx for trying hard to help me, very appreciate

To answer your questions:

A. Sadly no. I have those in core database, but warehouse put all the informations in a single column and any warehouse has his own way to describe the same product. That’s why I try to find a workaround, otherwise a simple VLOOKUP in excel would do the trick.

B. Yes, once I “assigned” the product is done for good, but there are thousand of records and every months the can add new. Problem is, the matching is possible only using the description, and here we are.

C. Nope, cannot ask the warehouse to do this job, would be simply transfer the problem of matching to them. I would like to do that, but not possible

In the end Roberto, I have to understand if what i want is possible or not with easymorph. As I tried to explain to @dgudkov , I need the program to take the value of a table A. Let’s say is “Acqua fantastica Vap cl.75 pz.24”. Then take ALL the values of the table B (cored database) and match the one with more “affinity”, so the program should first search for all products with “Acqua”, then among all of them, keep only the product with the word “fantastica”, then among all of them, keep only the product with “VAP”, ecc untill he find the right match, or better, the record with most of the match.

Finally, I give you an example of it, and to keep as simple as possible, I limit the table A to just 1 record. There is a way for the program to “find” the right answer ? If you try yourself, you’ll have no problem to find the most fitting result (SAN FRATE GAS VAR PZ.24 CL.50), but how make the program achieve the same goal?
Test 3.morph (3.5 KB)

As usual, thx a lot for your time.