Matching data between data sources with Python

Data is often messy and rarely in perfect shape. This is especially true if the data comes from many different sources and the specifications are loosely defined. If you have access to data that is in great shape, it’s probably because someone else did the dirty work of validating it, cleaning it up, and normalizing it for you.

One particular type of data problem is matching data between data sources when exact linking identifiers are missing. This situation has come up for me quite often.

Matching techniques

When you need to match data, there are a number of techniques that can help. The best way to illustrate this is with a concrete example. To do this, we’ll pull data from the SEC’s EDGAR system. You can read this article to learn how to connect to EDGAR and fetch data, and another article to learn about stock index data in EDGAR. We’ll use some of the stock index data mentioned there to demonstrate some matching techniques.

In this article, we’ll look at multiple ways to attempt to match data between two data sources. The data in this case is a company name. We’ll use the following techniques to try to match as many values as we can:

Merging with exact data
Cleaning data with a regular expression (and then trying to match again)
Picking the best data with duplicates using groupby
Fuzzy searching
Good old manual verification and cleanup

A quick review

If you don’t want to read the earlier articles, I’ll give you a quick summary of how we got to this point. We found a filing on EDGAR that contains all the investments in an ETF. That filing has an XML document that lists the investments. If we choose an ETF that tracks a stock index, we can use the filing data to build the index holdings. At this point, we want to match the index members with the search terms used to find companies in EDGAR. This will allow us to fetch more data from EDGAR and build a rich database of information about all the stocks in an index, for free.

The problem

Once we dig into the data, we start to recognize the problem. The stocks held by the ETF are listed by name and CUSIP. We need to figure out how to match the investments in the N-PORT document to EDGAR’s internal system for tracking companies, using other identifiers. But all we have to match on is the name.

What we’ll learn

Along the way, we will learn a few things about dealing with messy data. If you happen to like financial data, you’ll also learn a bit more about how EDGAR works and its internal data. If your domain is outside of finance, the techniques will still work for matching data.

So let’s just go through the basic steps from the last article and get the stock index data. Note that I am putting a little bit of code in a python file and calling it, you can always get that code (and the notebook used to generate this article) on github.

import requests

import pandas as pd

from utils import elem2dict, get_nport_values

website = "your website"
email = "your email"
# the N-PORT data for SPY, the S&P 500, 500+ stocks
url = "https://www.sec.gov/Archives/edgar/data/884394/000175272422196968/primary_doc.xml"

nport = get_nport_values(url, website, email)

nport.head()

                                           name                   lei  \
0                   Honeywell International Inc  ISRPG12PN4EIEOEMW547   
1                   Discover Financial Services  Z1YLO2USPORE63VVUL20   
2                                      FMC Corp  CKDHZ2X64EEBQCSP7013   
3                                  Nordson Corp  14OS6Q5N55N95WM84M53   
4  Charles River Laboratories International Inc  549300BSQ0R4UZ5KX287   

                                          title      cusip   identifiers  \
0                   Honeywell International Inc  438516106  {'isin': {}}   
1                   Discover Financial Services  254709108  {'isin': {}}   
2                                      FMC Corp  302491303  {'isin': {}}   
3                                  Nordson Corp  655663102  {'isin': {}}   
4  Charles River Laboratories International Inc  159864107  {'isin': {}}   

     balance units curCd        valUSD    pctVal payoffProfile assetCat  \
0  7390330.0    NS   USD  1.284513e+09  0.370781          Long       EC   
1  3050290.0    NS   USD  2.884964e+08  0.083276          Long       EC   
2  1371493.0    NS   USD  1.467635e+08  0.042364          Long       EC   
3   587512.0    NS   USD  1.189359e+08  0.034331          Long       EC   
4   541951.0    NS   USD  1.159613e+08  0.033473          Long       EC   

  issuerCat invCountry isRestrictedSec fairValLevel  \
0      CORP         US               N            1   
1      CORP         US               N            1   
2      CORP         US               N            1   
3      CORP         US               N            1   
4      CORP         US               N            1   

                                     securityLending issuerConditional  
0  {'isCashCollateral': 'N', 'isNonCashCollateral...               NaN  
1  {'isCashCollateral': 'N', 'isNonCashCollateral...               NaN  
2  {'isCashCollateral': 'N', 'isNonCashCollateral...               NaN  
3  {'isCashCollateral': 'N', 'isNonCashCollateral...               NaN  
4  {'isCashCollateral': 'N', 'isNonCashCollateral...               NaN

A first check

Looking at this data, you can see that there is a company name, something called an lei, and a CUSIP. The ISIN is not supplied in this file. The CUSIP and ISIN and LEI are standard financial industry identifiers.

We want to turn either the name, LE, or CUSIP into Stock Exchange tickers. (Note that I will use the term symbol interchangeably for ticker). If we have the symbol, we can translate that to a CIK, the internal company code used by the SEC. If you use the EDGAR search page, you can enter a symbol, name or CIK. But we don’t want to manually process all 500+ rows of the DataFrame. And even if we did submit the name to the form, there might be multiple matches presented by the user interface to choose from.

How do we automate the mapping? Ideally, we could just find a free source of CUSIP to ticker mapping. If you work in the financial industry, you surely have a system at work that does this for you, and this article would be pointless. When I worked for a large bank, there was an entire department that maintained cross references of data. I wasn’t able to find a free and reliable API that does CUSIP to ticker mappings, though perhaps one does exist. Sometimes, you can find free sources for the more popular stock indexes. For example, we’ll use Wikipedia later to verify how well we did. But we’d like to have this mapping work for any stock or index, even the more obscure ones. For the purposes of this article, let’s pretend that we have to figure this out with only what we can observe on EDGAR.

A starting point

The SEC does publish a cross reference for CIK to symbol, and that mapping also contains a company name. Let’s start there. The file is a just a big JSON object, which we can convert to a pandas DataFrame.

symbol_to_cik = requests.get("https://www.sec.gov/files/company_tickers.json").json()
print(symbol_to_cik['0'])

symbol_to_cik = pd.DataFrame(symbol_to_cik).T
print(symbol_to_cik.shape)
symbol_to_cik.head(10)

{'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'}
(11975, 3)
   cik_str ticker                   title
0   320193   AAPL              Apple Inc.
1   789019   MSFT          MICROSOFT CORP
2  1018724   AMZN          AMAZON COM INC
3  1067983  BRK-B  BERKSHIRE HATHAWAY INC
4   731766    UNH  UNITEDHEALTH GROUP INC
5    34088    XOM        EXXON MOBIL CORP
6   200406    JNJ       JOHNSON & JOHNSON
7   104169    WMT            Walmart Inc.
8    19617    JPM     JPMORGAN CHASE & CO
9    93410    CVX            CHEVRON CORP

As you can see, it has a lot of symbols, over 11K. Looking at the list, I suspect that it is sorted by search frequency on EDGAR or popularity. We see that Apple Inc. is the most popular and has a ticker of AAPL and a CIK of 320193.

Let’s look at the largest holdings in the S&P 500. If you recall from the previous article, the S&P 500 is a market cap weighted index, so it contains a larger amount of the bigger companies.

nport.sort_values(by='pctVal', ascending=False).head(15)[['title', 'cusip', 'pctVal']]

                       title      cusip    pctVal
433                Apple Inc  037833100  6.587783
186           Microsoft Corp  594918104  6.019457
483           Amazon.com Inc  023135106  2.912308
140             Alphabet Inc  02079K305  2.054029
223             Alphabet Inc  02079K107  1.890388
290                Tesla Inc  88160R101  1.770921
268   Berkshire Hathaway Inc  084670702  1.547893
49    UnitedHealth Group Inc  91324P102  1.511444
44         Johnson & Johnson  478160104  1.463783
214              NVIDIA Corp  67066G104  1.189526
344       Meta Platforms Inc  30303M102  1.158956
250         Exxon Mobil Corp  30231G102  1.130543
170  Procter & Gamble Co/The  742718109  1.081133
481      JPMorgan Chase & Co  46625H100  1.036458
148                 Visa Inc  92826C839  1.015419

Our job is to find the ticker by searching for the title/name that matches. Could this work? By inspection, I spot a few issues:

the CIK lookup has Apple Inc. whereas the N-PORT contains Apple Inc (without the period)
Microsoft is capitalized in one and not the other
Companies that start with The have it placed at the end in N-PORT
There are duplicate rows for Alphabet Inc

So it’s not going to be very straightforward, but let’s give it a quick try. Our initial approach is to do a pandas merge, joining on the title exactly, and doing a left join. This means we want all rows from the left DataFrame, and will get null values for the new merged columns if there’s not a match. We’ll create a new DataFrame to hold these results.

nport_tickers = nport.merge(symbol_to_cik, left_on='title', right_on='title', how='left')
nport_tickers.loc[~pd.isnull(nport_tickers['ticker'])].shape

(52, 20)

OK, we got 52 exact matches out of 505 rows. That’s a start, but not very good. What do the matches (the non-null rows) look like?

nport_tickers.loc[~pd.isnull(nport_tickers['ticker']), ['title', 'ticker']]

                                  title   ticker
1           Discover Financial Services      DFS
25                    General Motors Co       GM
35                          Viatris Inc     VTRS
61              Marathon Petroleum Corp      MPC
96                          Snap-on Inc      SNA
100                           Chubb Ltd       CB
110                        Allegion plc     ALLE
113                       Celanese Corp       CE
119                            Fox Corp     FOXA
120                            Fox Corp      FOX
121                            PPL Corp      PPL
145                         Phillips 66      PSX
150       Hewlett Packard Enterprise Co      HPE
157           Archer-Daniels-Midland Co      ADM
161                     Baker Hughes Co      BKR
164                        Vontier Corp      VNT
190                    Paramount Global     PARA
191                    Paramount Global    PARAA
192                    Paramount Global    PARAP
204           Constellation Energy Corp      CEG
220  Johnson Controls International plc      JCI
228                          Cigna Corp       CI
250                      Public Storage      PSA
251                      Public Storage   PSA-PH
252                      Public Storage   PSA-PK
253                      Public Storage   PSA-PL
254                      Public Storage   PSA-PF
255                      Public Storage   PSA-PQ
256                      Public Storage   PSA-PJ
257                      Public Storage   PSA-PG
258                      Public Storage   PSA-PO
259                      Public Storage   PSA-PS
260                      Public Storage   PSA-PR
261                      Public Storage   PSA-PN
262                      Public Storage   PSA-PI
263                      Public Storage   PSA-PM
264                      Public Storage   PSA-PP
283               Monster Beverage Corp     MNST
295                      Mastercard Inc       MA
302                   DXC Technology Co      DXC
325                 Synchrony Financial      SYF
326                 Synchrony Financial   SYF-PA
351                            APA Corp      APA
357           MGM Resorts International      MGM
374           Edwards Lifesciences Corp       EW
376                 Avery Dennison Corp      AVY
381                            Fox Corp     FOXA
382                            Fox Corp      FOX
431                 Otis Worldwide Corp     OTIS
449                        Fortive Corp      FTV
481                           Aptiv PLC     APTV
482                           Aptiv PLC  APTV-PA

Assessing the exact matches

I see several issues. First, some of the symbols show up more than once (FOXA, FOX), and some of the titles show up multiple times (Fox Corp, Public Storage, Synchrony Financial, Paramount Global). Why does this happen?

Well, in some cases, companies might have more than one publicly traded common stock listing. If that’s the case, and they are in the S&P 500, then they will have both stocks in the index. In other cases, companies may have other shares listed that are preferred shares but aren’t in the index. In still other cases, the companies may have different classes of shares (like A and B), and only one class of shares are part of the index.

So we will need a better solution than a simple match on title. Let’s consider two issues:

The fact that the title doesn’t match exactly
The fact that matching companies may have more than one matching symbol

There is a bit of extra info since the symbol to CIK mapping appears to be sorted by search popularity or market cap. We can use this information to weight the first match more heavily if we find multiple matches.

We’ll build a simple matching algorithm to solve this, but we’ll do it in several steps.

A first attempt at improving matching on title

Since this data is somewhat messy, let’s see if we can get better than 10% of the names to match. Consider two of the names we looked at above, Apple and Microsoft. For Microsoft, we could force the title to all upper or all lower case and get a match. For Apple, we could remove puncuation. It also makes sense to remove some common words found in many of the company names, including the trailing /The. We can do this using regular expressions with replace on string columns.

Now, regular expressions can be a confusing topic. Don’t look at this one and think I did this in one step. Instead, here’s the approach I used. First, I looked at the top 50 values in the DataFrame.

symbol_to_cik["title"]

Then, I started to chain methods the pandas string methods, starting with converting all the strings to upper case, and adding regular expressions to replace words one at a time, re-executing the notebook cell with the code until it gave me the final result.

Even after doing that, I found that it needed tweaking a few times after I continued through the process in this article. This is what I ended up with after a few attempts. Below is a full explanation of the regex (note you can’t run the code with these comments inline, you need to run what is in the cells below. To follow along at home, you can just try these lines one at a time to see what changes).

symbol_to_cik["title"].str                 # use the pandas string methods
    .upper()                               # convert to upper case so words match
    .replace(r"\.|\,|&", " ", regex=True)  # replace punctuation with space, (,.&)
    .replace(r"\/(\w+)?(\/)?", " ", regex=True)  # replace words like this /xx/ with a space
    .replace(r"\s(CO[M|S|R]?P?|INC|PLC|LP|NA|NV|CU|LTD)(?=\s|$)", "", regex=True)  # remove common words
    .replace(r"\s+", " ", regex=True).str.strip() # remove all extra spaces

Let’s also break the second to last line down a bit more. This expression might be a bit confusing:

r"\s(CO[M|S|R]?P?|INC|PLC|LP|NA|NV|CU|LTD)(?=\s|$)"

This is saying match a space, followed by the word COM or COS or COR or CORP or INC or PLC or LP or NA or NV or CU or LTD followed by a space or the end of the line. The ?= is a lookhead, so it’s checking that the space or end of line exists in order to match, but it doesn’t consume it in the replacement. This allows us to replace multiple matching words in a row, like in Amazon.com Inc (which becomes AMAZON COM INC). Here’s an example:

import re
re.sub("\s(CO[M|S|R]?P?|INC|PLC|LP|NA|NV|CU|LTD)(?=\s|$)", " ", "FOO LTD PLC CORP")

'FOO   '

Applying the title changes

Now we’ll apply the title search changes. Note how the names now just contain the “main” part of their name, not all the extra adornments.

def make_title_search(df):
    return df["title"].str.upper() \
        .replace(r"\.|\,|&", " ", regex=True) \
        .replace(r"\/(\w+)?(\/)?", " ", regex=True) \
        .replace(r"\s(CO[M|S|R]?P?|INC|PLC|LP|NA|NV|CU|LTD)(?=\s|$)", "", regex=True) \
        .replace(r"\s+", " ", regex=True).str.strip()

make_title_search(symbol_to_cik).head(15)

0                                  APPLE
1                              MICROSOFT
2                                 AMAZON
3                     BERKSHIRE HATHAWAY
4                     UNITEDHEALTH GROUP
5                            EXXON MOBIL
6                        JOHNSON JOHNSON
7                                WALMART
8                         JPMORGAN CHASE
9                                CHEVRON
10                             ELI LILLY
11                SPDR S P 500 ETF TRUST
12                                NVIDIA
13                        PROCTER GAMBLE
14    TAIWAN SEMICONDUCTOR MANUFACTURING
Name: title, dtype: object

We will make a new column for merging, then do a new merge using that column on both DataFrames.

symbol_to_cik["title-search"] = make_title_search(symbol_to_cik)
nport["title-search"] = make_title_search(nport)

nport_tickers = nport.merge(symbol_to_cik, left_on='title-search', right_on='title-search', how='left')
nport_tickers.loc[~pd.isnull(nport_tickers['ticker'])].shape

(629, 22)

Analyzing the new match results

Whoa, we now have more matches than we had original rows! Let’s look at cases where we have multiple matches. One quick way to do this is to group by the matching criteria (the title-search column in our case), and count the values. The result is the number of rows found for that match. We pick a random column (cusip) to make the output more readable.

nport_tickers.groupby("title-search").count().sort_values(by="cusip", ascending=False).head(10)['cusip']

title-search
BANK OF AMERICA          17
PUBLIC STORAGE           15
MORGAN STANLEY            9
FIRST REPUBLIC BANK       8
JPMORGAN CHASE            8
CAPITAL ONE FINANCIAL     6
VORNADO REALTY TRUST      6
GOLDMAN SACHS GROUP       6
KEYCORP                   5
ALLSTATE                  5
Name: cusip, dtype: int64

nport_tickers.loc[nport_tickers["title-search"] == 'BANK OF AMERICA', "ticker"]

362       BAC
363    BML-PG
364    BML-PH
365    BAC-PB
366    BAC-PK
367    BML-PL
368    BAC-PL
369    BAC-PE
370    BML-PJ
371    BAC-PM
372    BAC-PN
373    BAC-PP
374    BAC-PQ
375    BAC-PO
376     BACRP
377    MER-PK
378    BAC-PS
Name: ticker, dtype: object

This is similar to what we saw earlier with some companies having mutiple matches for preferred stock listings. We can see that the first match is probably the best one, so we need to further enhance our matching algorithm. We benefit from the fact that our symbol_to_cik DataFrame is sorted in order of popularity. It’s very likely that the first match we encounter is the symbol we want.

A second attempt at improving matching with duplicates

Since our simple merge results in duplicate rows, we can clean up our data by only keeping the best row, or in our case the first one. There are a few approaches you can use when you want to remove duplicate data in pandas. You can read this article for some more details on how to remove duplicates and keep certain data. In our case, we are grouping by title-search, but since the data is sorted by “most popular”, we can choose the first match. So an approach using groupby, with first() works well. This will group by our search title, and pick the first match.

nport_second_attempt = nport_tickers.groupby("title-search").first()
print(nport_second_attempt.head()[['name', 'cusip', 'ticker']])
nport_second_attempt.shape

                                    name      cusip ticker
title-search                                              
3M                                 3M Co  88579Y101    MMM
A O SMITH                 A O Smith Corp  831865209   None
ABBOTT LABORATORIES  Abbott Laboratories  002824100    ABT
ABBVIE                        AbbVie Inc  00287Y109   ABBV
ABIOMED                      ABIOMED Inc  003654100   ABMD
(502, 21)

Looking at the results, I see several problems. The first problem here is that the output here results in the loss of 3 rows of data. You’ll recall our original nport DataFrame had 505 rows. Three of the companies have the same exact title-search, so we are only picking one of them when we do the groupby. The second issue is that not all tickers were matched, since there is a None in the second row. Let’s look at how many missing symbols we have at this stage, and see what some of these look like.

nport_second_attempt.loc[pd.isnull(nport_second_attempt["ticker"])].shape

(50, 21)

nport_second_attempt.loc[pd.isnull(nport_second_attempt["ticker"])].head()['name']

title-search
A O SMITH                                     A O Smith Corp
AIR PRODUCTS AND CHEMICALS    Air Products and Chemicals Inc
AMERICAN WATER WORKS             American Water Works Co Inc
BECTON DICKINSON AND                 Becton Dickinson and Co
BRISTOL-MYERS SQUIBB                 Bristol-Myers Squibb Co
Name: name, dtype: object

So at this point, we can maybe match 90% of our symbols, and we have an issue with three of our companies having two choices. We need a slightly better matching method. Let’s see how we might match the missing values, starting with A O Smith Corp. Let’s find rows that have the name Smith.

symbol_to_cik[symbol_to_cik["title"].str.contains("SMITH")]

      cik_str ticker                        title          title-search
748    845982    SNN           SMITH & NEPHEW PLC          SMITH NEPHEW
864     91142    AOS               SMITH A O CORP             SMITH A O
1752  1689796   JBGS         JBG SMITH Properties  JBG SMITH PROPERTIES
3210  1092796   SWBI  SMITH & WESSON BRANDS, INC.   SMITH WESSON BRANDS
4952   948708   SMSI   SMITH MICRO SOFTWARE, INC.  SMITH MICRO SOFTWARE
4990   924719   SMID           SMITH MIDLAND CORP         SMITH MIDLAND

Fuzzy matching

We need to find a way to match the closest name. For this first example, consider how we are searching for A O SMITH but that doesn’t match the correct value, SMITH A O, with ticker AOS. We need a solution that would look through all the titles and find the closest match.

There is a metric in linguistics and computer science known as the Levenshtein distance. It gives a good approximation the similarity of two strings. At a high level, it tries to figure out how many changes would be needed to make one string match another. We don’t have to implement this algorithm from scratch (though you can do that!). There’s an existing Python library that makes it easy to do fuzzy searches. It’s called the fuzz (formerly known as fuzzy wuzzy).

You can install it using

pip install "thefuzz[speedup]"

or just

pip install thefuzz

The former will install a faster implementation of the Levenshtein distance calculation.

You can read the documentation for more examples, but here’s a quick look using our example:

from thefuzz import fuzz, process

print(fuzz.WRatio("A O SMITH", "SMITH A O"))
print(fuzz.WRatio("A O SMITH", "A. O. Smith"))
print(fuzz.WRatio("A O SMITH", "Apple"))
print(fuzz.WRatio("A O SMITH", "A O SMITH"))

We then use the process.extract function to search through a list of values.

process.extract("A O SMITH", symbol_to_cik["title-search"])

[('SMITH A O', 95, '864'),
 ('NOVO NORDISK A S', 86, '20'),
 ('ITAU UNIBANCO HOLDING S A', 86, '176'),
 ('O REILLY AUTOMOTIVE', 86, '189'),
 ('PERNOD RICARD S A', 86, '230')]

The results returned by process are the matched value (SMITH A O), along with the score (95) and the index ('872') of the matched value found in the data. We can use the index to find the row in the lookup DataFrame and get the symbol and CIK from it. You should also notice that the similarity score can actually be pretty high even when the two values don’t look very similar. We need to ensure that we don’t use the wrong value if the match isn’t good enough. To do this, we’ll add a threshold to our search.

Verifying that fuzzy searching does what we want

Since it’s possible for a single company to have multiple matches, we also want to make sure we only use a value once. We track our choices and ensure we don’t reuse them. This way, we’ll end up picking the two most popular symbols for those companies with two matches.

def find_title(title, used, lookup, threshold=90):
    # get all the matches
    results = process.extract(title, lookup["title-search"])
    for res in results:
        if res[1] < threshold:
            return
        idx = res[2]
        if idx not in used:
            # return first non-used result
            used.add(idx)
            row = lookup.loc[idx]
            return row['cik_str'], row['ticker'], res[0], res[1]

used = set()
find_title("APPLE", used, symbol_to_cik)

(320193, 'AAPL', 'APPLE', 100)

OK, fair enough. But how does it handle the symbols that were problems? Let’s see.

find_title("FOX", used, symbol_to_cik)

(1754301, 'FOXA', 'FOX', 100)

find_title("FOX", used, symbol_to_cik)

(1754301, 'FOX', 'FOX', 100)

OK, that seems to potentially work. We see that they both have the same CIK, but we get two stock symbols.

find_title("PARAMOUNT GLOBAL", used, symbol_to_cik)

(813828, 'PARA', 'PARAMOUNT GLOBAL', 100)

find_title("PUBLIC STORAGE", used, symbol_to_cik)

(1393311, 'PSA', 'PUBLIC STORAGE', 100)

Let’s make sure a fake company doesn’t work.

find_title("Vandalay Industries", set(), symbol_to_cik)

find_title("Vandalay Industries", set(), symbol_to_cik, threshold=50) # very low threshold

(1489393, 'LYB', 'LYONDELLBASELL INDUSTRIES N V', 86)

This seems to be working, we have to set the threshold below 90 to get a match. If multiple rows in the lookup match the name, it’s picking the first match which happens to be the most popular due to the sorting of symbol_to_cik. If we wanted to choose a better algorithm for selecting the match, we could easily do that.

Applying fuzzy search to the full dataset

Now before we try to run this on the full dataset, let’s see what we’re in for. (Note when creating this article, I tried running the whole dataset first and saw that it was incredibly slow, as I was expecting).

%timeit find_title("Public Storage", set(), symbol_to_cik)

614 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Running one title takes a bit less than a second. So we’d expect the full 505 symbols to take 5-7 minutes. While that’s not a crazy amount of time, this is not something we’d want to run multiple times or on a larger data set. The idea is, we’d run this one time to make a mapping, then use that in the future.

used.clear() # we want to restart
cik_symbol = nport["title-search"].apply(lambda t: find_title(t, used, symbol_to_cik))

Note that cik_symbol is a Series with the same index as nport, with two values. Let’s put them into our DataFrame. The reason this is slow is that each title is passed into the find_title function, and that function has to run the fuzzy search over every name in the set for each row (all 11K+ of them).

Checking the fuzzy results

Let’s first check if any of the returned values were not set.

print(cik_symbol.head())
print("There are ", cik_symbol[pd.isnull(cik_symbol)].shape[0], "missing values")

0          (773840, HON, HONEYWELL INTERNATIONAL, 100)
1     (1393612, DFS, DISCOVER FINANCIAL SERVICES, 100)
2                               (37785, FMC, FMC, 100)
3                          (72331, NDSN, NORDSON, 100)
4    (1100682, CRL, CHARLES RIVER LABORATORIES INTE...
Name: title-search, dtype: object
There are  6 missing values

This is pretty good, only 6 of our values didn’t find a match. We don’t know yet if the matches are correct, but matching almost 500 names with a fuzzy score of 90+ is a good start.

Since both nport and cik_symbol have the same index, we can just update the data directly. Since there were null rows above, we need to handle null in our update. There is a tuple of 4 values in each row on the Series so we need to apply a lambda to the data, returning the first (CIK) and second (symbol) values appropriately. I use a little trick here with the lambda. If the row (passed into the lambda as x) is not None, the and part is returned with the correct value. Otherwise, it returns the or part (which is None).

nport['cik'] = cik_symbol.apply(lambda x: x and x[0] or x)
nport['symbol'] = cik_symbol[~pd.isnull(cik_symbol)].apply(lambda x: x and x[1] or x)

Now let’s look at our missing values to see if we can figure out why we didn’t get a match.

nport.loc[pd.isnull(nport['symbol']), ['title', 'title-search', 'cusip', 'symbol']]

                              title                title-search      cusip  \
175                  Lowe's Cos Inc                      LOWE'S  548661107   
203            Federal Realty OP LP           FEDERAL REALTY OP  313745101   
327         Sherwin-Williams Co/The            SHERWIN-WILLIAMS  824348106   
365    Jacobs Engineering Group Inc    JACOBS ENGINEERING GROUP  469814107   
383  JB Hunt Transport Services Inc  JB HUNT TRANSPORT SERVICES  445658107   
388            Nielsen Holdings PLC            NIELSEN HOLDINGS  000000000   

    symbol  
175    NaN  
203    NaN  
327    NaN  
365    NaN  
383    NaN  
388    NaN

Final cleanup techniques

First, we will try Lowe’s without the apostrophe.

find_title("LOWES", set(), symbol_to_cik, threshold=50)

(60667, 'LOW', 'LOWES COMPANIES', 90)

The fix for this is to add a ' to our replacement regex earlier. Now what about Federal Realty OP?

find_title("FEDERAL REALTY OP", set(), symbol_to_cik, threshold=50)

(34903, 'FRT', 'FEDERAL REALTY INVESTMENT TRUST', 86)

find_title("FEDERAL REALTY", set(), symbol_to_cik, threshold=50)

(34903, 'FRT', 'FEDERAL REALTY INVESTMENT TRUST', 90)

The fix here is to add OP to our replacement words earlier. Then we’d find a match. What about the rest?

find_title("SHERWIN-WILLIAMS", set(), symbol_to_cik, threshold=50)

(89800, 'SHW', 'SHERWIN WILLIAMS', 100)

It look like for Sherwin-Williams, the symbol was “stolen” by another company. Who has the value for SHW? That will need to be corrected. We can do another search for the result for that company and swap their tickers.

nport.loc[nport['symbol'] == 'SHW', 'title']

95    Williams Cos Inc/The
Name: title, dtype: object

# pass in our previously used values so it doesn't pick 'Sherwin-Williams' again. What's the second choice?
find_title("Williams", used, symbol_to_cik, threshold=50)

(107263, 'WMB', 'WILLIAMS COMPANIES', 90)

find_title("JACOBS", set(), symbol_to_cik, threshold=50)

(52988, 'J', 'JACOBS SOLUTIONS', 90)

Jacobs has a fairly different name, and is right on the threshold.

find_title("JB HUNT TRANSPORT SERVICES", set(), symbol_to_cik, threshold=50)

(728535, 'JBHT', 'HUNT J B TRANSPORT SERVICES', 89)

JB Hunt is just below the threshold due to the initials being after the name.

find_title("NIELSEN HOLDINGS", set(), symbol_to_cik, threshold=50)

(1585689, 'HLT', 'HILTON WORLDWIDE HOLDINGS', 86)

In 2022, Nielsen Holdings was taken private. It showed up in an earlier report, but is not active and thus can’t be mapped. Let’s just fix these all manually.

nport.loc[pd.isnull(nport['symbol']), 'symbol'] = ('LOW', 'FRT', 'SWH', 'J', 'JBHT','NLSN')
nport.loc[95, 'symbol'] = 'WMB' # fix the "stolen" symbol

How did we do?

Phew. If you stayed with me through this entire process, congratulations. That was a lot of work. At this point, we have a DataFrame that has all the cusip and symbol entries. How did we do?

As I said earlier, getting this mapping can be hard to find for free for all symbols, but we can grab the current symbol list of S&P 500 companies from a free source. This will tell us how much we overlap and get a rough idea of how we did. Pandas makes grabbing a table from Wikipedia pretty easy.

sp500 = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")[0]
sp500['Symbol']

0       MMM
1       AOS
2       ABT
3      ABBV
4      ABMD
       ... 
498     YUM
499    ZBRA
500     ZBH
501    ZION
502     ZTS
Name: Symbol, Length: 503, dtype: object

set(nport['symbol']).difference(set(sp500['Symbol']))

{'ATI',
 'BF-B',
 'BRK-B',
 'DRE',
 'EMBC',
 'NLSN',
 'PVH',
 'STEM',
 'SVFA',
 'SWH',
 'TWTR',
 'VNT'}

set(sp500['Symbol']).difference(set(nport['symbol']))

{'ACGL', 'BF.B', 'BRK.B', 'CSGP', 'EQT', 'INVH', 'PCG', 'SHW', 'TRGP', 'VFC'}

Not bad. The symbols BF.B and BRK.B are just different ways of representing the same symbols as BF-B and BRK-B (and are a continual pain to those of us in finance). We already know about NLSN. TRGP, PCG, EQT, CSGP, and INVH recently replaced NLSN, CTXS, DRE, PVH, and PENN in the index, respectively. With a little bit more manual updating, we could get to 100% accuracy.

Summary

This article covered several techniques to match data between two data sources, using Python and pandas. We tried a simple match, some data cleanup and normalization with another match, eliminated duplicate data, then used fuzzy search. Finally, we did some manual verification and cleanup.