Data is often messy and rarely in perfect shape. This is especially true if the data comes from many different sources and the specifications are loosely defined. If you have access to data that is in great shape, it’s probably because someone else did the dirty work of validating it, cleaning it up, and normalizing it for you.
One particular type of data problem is matching data between data sources when exact linking identifiers are missing. This situation has come up for me quite often.
Matching techniques
When you need to match data, there are a number of techniques that can help. The best way to illustrate this is with a concrete example. To do this, we’ll pull data from the SEC’s EDGAR system. You can read this article to learn how to connect to EDGAR and fetch data, and another article to learn about stock index data in EDGAR. We’ll use some of the stock index data mentioned there to demonstrate some matching techniques.
In this article, we’ll look at multiple ways to attempt to match data between two data sources. The data in this case is a company name. We’ll use the following techniques to try to match as many values as we can:
- Merging with exact data
- Cleaning data with a regular expression (and then trying to match again)
- Picking the best data with duplicates using
groupby
- Fuzzy searching
- Good old manual verification and cleanup
A quick review
If you don’t want to read the earlier articles, I’ll give you a quick summary of how we got to this point. We found a filing on EDGAR that contains all the investments in an ETF. That filing has an XML document that lists the investments. If we choose an ETF that tracks a stock index, we can use the filing data to build the index holdings. At this point, we want to match the index members with the search terms used to find companies in EDGAR. This will allow us to fetch more data from EDGAR and build a rich database of information about all the stocks in an index, for free.
The problem
Once we dig into the data, we start to recognize the problem. The stocks held by the ETF are listed by name and CUSIP. We need to figure out how to match the investments in the N-PORT document to EDGAR’s internal system for tracking companies, using other identifiers. But all we have to match on is the name.
What we’ll learn
Along the way, we will learn a few things about dealing with messy data. If you happen to like financial data, you’ll also learn a bit more about how EDGAR works and its internal data. If your domain is outside of finance, the techniques will still work for matching data.
So let’s just go through the basic steps from the last article and get the stock index data. Note that I am putting a little bit of code in a python file and calling it, you can always get that code (and the notebook used to generate this article) on github.
import requests import pandas as pd from utils import elem2dict, get_nport_values website = "your website" email = "your email" # the N-PORT data for SPY, the S&P 500, 500+ stocks url = "https://www.sec.gov/Archives/edgar/data/884394/000175272422196968/primary_doc.xml" nport = get_nport_values(url, website, email) nport.head()
name lei \ 0 Honeywell International Inc ISRPG12PN4EIEOEMW547 1 Discover Financial Services Z1YLO2USPORE63VVUL20 2 FMC Corp CKDHZ2X64EEBQCSP7013 3 Nordson Corp 14OS6Q5N55N95WM84M53 4 Charles River Laboratories International Inc 549300BSQ0R4UZ5KX287 title cusip identifiers \ 0 Honeywell International Inc 438516106 {'isin': {}} 1 Discover Financial Services 254709108 {'isin': {}} 2 FMC Corp 302491303 {'isin': {}} 3 Nordson Corp 655663102 {'isin': {}} 4 Charles River Laboratories International Inc 159864107 {'isin': {}} balance units curCd valUSD pctVal payoffProfile assetCat \ 0 7390330.0 NS USD 1.284513e+09 0.370781 Long EC 1 3050290.0 NS USD 2.884964e+08 0.083276 Long EC 2 1371493.0 NS USD 1.467635e+08 0.042364 Long EC 3 587512.0 NS USD 1.189359e+08 0.034331 Long EC 4 541951.0 NS USD 1.159613e+08 0.033473 Long EC issuerCat invCountry isRestrictedSec fairValLevel \ 0 CORP US N 1 1 CORP US N 1 2 CORP US N 1 3 CORP US N 1 4 CORP US N 1 securityLending issuerConditional 0 {'isCashCollateral': 'N', 'isNonCashCollateral... NaN 1 {'isCashCollateral': 'N', 'isNonCashCollateral... NaN 2 {'isCashCollateral': 'N', 'isNonCashCollateral... NaN 3 {'isCashCollateral': 'N', 'isNonCashCollateral... NaN 4 {'isCashCollateral': 'N', 'isNonCashCollateral... NaN
A first check
Looking at this data, you can see that there is a company name, something called an lei, and a CUSIP. The ISIN is not supplied in this file. The CUSIP and ISIN and LEI are standard financial industry identifiers.
We want to turn either the name, LE, or CUSIP into Stock Exchange tickers. (Note that I will use the term symbol interchangeably for ticker). If we have the symbol, we can translate that to a CIK, the internal company code used by the SEC. If you use the EDGAR search page, you can enter a symbol, name or CIK. But we don’t want to manually process all 500+ rows of the DataFrame. And even if we did submit the name to the form, there might be multiple matches presented by the user interface to choose from.
How do we automate the mapping? Ideally, we could just find a free source of CUSIP to ticker mapping. If you work in the financial industry, you surely have a system at work that does this for you, and this article would be pointless. When I worked for a large bank, there was an entire department that maintained cross references of data. I wasn’t able to find a free and reliable API that does CUSIP to ticker mappings, though perhaps one does exist. Sometimes, you can find free sources for the more popular stock indexes. For example, we’ll use Wikipedia later to verify how well we did. But we’d like to have this mapping work for any stock or index, even the more obscure ones. For the purposes of this article, let’s pretend that we have to figure this out with only what we can observe on EDGAR.
A starting point
The SEC does publish a cross reference for CIK to symbol, and that mapping also contains a company name. Let’s start there. The file is a just a big JSON object, which we can convert to a pandas DataFrame
.
symbol_to_cik = requests.get("https://www.sec.gov/files/company_tickers.json").json() print(symbol_to_cik['0']) symbol_to_cik = pd.DataFrame(symbol_to_cik).T print(symbol_to_cik.shape) symbol_to_cik.head(10)
{'cik_str': 320193, 'ticker': 'AAPL', 'title': 'Apple Inc.'} (11975, 3) cik_str ticker title 0 320193 AAPL Apple Inc. 1 789019 MSFT MICROSOFT CORP 2 1018724 AMZN AMAZON COM INC 3 1067983 BRK-B BERKSHIRE HATHAWAY INC 4 731766 UNH UNITEDHEALTH GROUP INC 5 34088 XOM EXXON MOBIL CORP 6 200406 JNJ JOHNSON & JOHNSON 7 104169 WMT Walmart Inc. 8 19617 JPM JPMORGAN CHASE & CO 9 93410 CVX CHEVRON CORP
As you can see, it has a lot of symbols, over 11K. Looking at the list, I suspect that it is sorted by search frequency on EDGAR or popularity. We see that Apple Inc. is the most popular and has a ticker of AAPL
and a CIK of 320193
.
Let’s look at the largest holdings in the S&P 500. If you recall from the previous article, the S&P 500 is a market cap weighted index, so it contains a larger amount of the bigger companies.
nport.sort_values(by='pctVal', ascending=False).head(15)[['title', 'cusip', 'pctVal']]
title cusip pctVal 433 Apple Inc 037833100 6.587783 186 Microsoft Corp 594918104 6.019457 483 Amazon.com Inc 023135106 2.912308 140 Alphabet Inc 02079K305 2.054029 223 Alphabet Inc 02079K107 1.890388 290 Tesla Inc 88160R101 1.770921 268 Berkshire Hathaway Inc 084670702 1.547893 49 UnitedHealth Group Inc 91324P102 1.511444 44 Johnson & Johnson 478160104 1.463783 214 NVIDIA Corp 67066G104 1.189526 344 Meta Platforms Inc 30303M102 1.158956 250 Exxon Mobil Corp 30231G102 1.130543 170 Procter & Gamble Co/The 742718109 1.081133 481 JPMorgan Chase & Co 46625H100 1.036458 148 Visa Inc 92826C839 1.015419
Our job is to find the ticker by searching for the title/name that matches. Could this work? By inspection, I spot a few issues:
- the CIK lookup has
Apple Inc.
whereas the N-PORT containsApple Inc
(without the period) - Microsoft is capitalized in one and not the other
- Companies that start with
The
have it placed at the end in N-PORT - There are duplicate rows for Alphabet Inc
So it’s not going to be very straightforward, but let’s give it a quick try. Our initial approach is to do a pandas merge, joining on the title exactly, and doing a left join. This means we want all rows from the left DataFrame, and will get null values for the new merged columns if there’s not a match. We’ll create a new DataFrame to hold these results.
nport_tickers = nport.merge(symbol_to_cik, left_on='title', right_on='title', how='left') nport_tickers.loc[~pd.isnull(nport_tickers['ticker'])].shape
(52, 20)
OK, we got 52 exact matches out of 505 rows. That’s a start, but not very good. What do the matches (the non-null rows) look like?
nport_tickers.loc[~pd.isnull(nport_tickers['ticker']), ['title', 'ticker']]
title ticker 1 Discover Financial Services DFS 25 General Motors Co GM 35 Viatris Inc VTRS 61 Marathon Petroleum Corp MPC 96 Snap-on Inc SNA 100 Chubb Ltd CB 110 Allegion plc ALLE 113 Celanese Corp CE 119 Fox Corp FOXA 120 Fox Corp FOX 121 PPL Corp PPL 145 Phillips 66 PSX 150 Hewlett Packard Enterprise Co HPE 157 Archer-Daniels-Midland Co ADM 161 Baker Hughes Co BKR 164 Vontier Corp VNT 190 Paramount Global PARA 191 Paramount Global PARAA 192 Paramount Global PARAP 204 Constellation Energy Corp CEG 220 Johnson Controls International plc JCI 228 Cigna Corp CI 250 Public Storage PSA 251 Public Storage PSA-PH 252 Public Storage PSA-PK 253 Public Storage PSA-PL 254 Public Storage PSA-PF 255 Public Storage PSA-PQ 256 Public Storage PSA-PJ 257 Public Storage PSA-PG 258 Public Storage PSA-PO 259 Public Storage PSA-PS 260 Public Storage PSA-PR 261 Public Storage PSA-PN 262 Public Storage PSA-PI 263 Public Storage PSA-PM 264 Public Storage PSA-PP 283 Monster Beverage Corp MNST 295 Mastercard Inc MA 302 DXC Technology Co DXC 325 Synchrony Financial SYF 326 Synchrony Financial SYF-PA 351 APA Corp APA 357 MGM Resorts International MGM 374 Edwards Lifesciences Corp EW 376 Avery Dennison Corp AVY 381 Fox Corp FOXA 382 Fox Corp FOX 431 Otis Worldwide Corp OTIS 449 Fortive Corp FTV 481 Aptiv PLC APTV 482 Aptiv PLC APTV-PA
Assessing the exact matches
I see several issues. First, some of the symbols show up more than once (FOXA
, FOX
), and some of the titles show up multiple times (Fox Corp
, Public Storage
, Synchrony Financial
, Paramount Global
). Why does this happen?
Well, in some cases, companies might have more than one publicly traded common stock listing. If that’s the case, and they are in the S&P 500, then they will have both stocks in the index. In other cases, companies may have other shares listed that are preferred shares but aren’t in the index. In still other cases, the companies may have different classes of shares (like A and B), and only one class of shares are part of the index.
So we will need a better solution than a simple match on title. Let’s consider two issues:
- The fact that the title doesn’t match exactly
- The fact that matching companies may have more than one matching symbol
There is a bit of extra info since the symbol to CIK mapping appears to be sorted by search popularity or market cap. We can use this information to weight the first match more heavily if we find multiple matches.
We’ll build a simple matching algorithm to solve this, but we’ll do it in several steps.
A first attempt at improving matching on title
Since this data is somewhat messy, let’s see if we can get better than 10% of the names to match. Consider two of the names we looked at above, Apple and Microsoft. For Microsoft, we could force the title to all upper or all lower case and get a match. For Apple, we could remove puncuation. It also makes sense to remove some common words found in many of the company names, including the trailing /The
. We can do this using regular expressions with replace
on string columns.
Now, regular expressions can be a confusing topic. Don’t look at this one and think I did this in one step. Instead, here’s the approach I used. First, I looked at the top 50 values in the DataFrame
.
symbol_to_cik["title"]
Then, I started to chain methods the pandas string methods, starting with converting all the strings to upper case, and adding regular expressions to replace words one at a time, re-executing the notebook cell with the code until it gave me the final result.
Even after doing that, I found that it needed tweaking a few times after I continued through the process in this article. This is what I ended up with after a few attempts. Below is a full explanation of the regex (note you can’t run the code with these comments inline, you need to run what is in the cells below. To follow along at home, you can just try these lines one at a time to see what changes).
symbol_to_cik["title"].str # use the pandas string methods .upper() # convert to upper case so words match .replace(r"\.|\,|&", " ", regex=True) # replace punctuation with space, (,.&) .replace(r"\/(\w+)?(\/)?", " ", regex=True) # replace words like this /xx/ with a space .replace(r"\s(CO[M|S|R]?P?|INC|PLC|LP|NA|NV|CU|LTD)(?=\s|$)", "", regex=True) # remove common words .replace(r"\s+", " ", regex=True).str.strip() # remove all extra spaces
Let’s also break the second to last line down a bit more. This expression might be a bit confusing:
r"\s(CO[M|S|R]?P?|INC|PLC|LP|NA|NV|CU|LTD)(?=\s|$)"
This is saying match a space, followed by the word COM or COS or COR or CORP or INC or PLC or LP or NA or NV or CU or LTD followed by a space or the end of the line. The ?=
is a lookhead, so it’s checking that the space or end of line exists in order to match, but it doesn’t consume it in the replacement. This allows us to replace multiple matching words in a row, like in Amazon.com Inc
(which becomes AMAZON COM INC
). Here’s an example:
import re re.sub("\s(CO[M|S|R]?P?|INC|PLC|LP|NA|NV|CU|LTD)(?=\s|$)", " ", "FOO LTD PLC CORP")
'FOO '
Applying the title changes
Now we’ll apply the title search changes. Note how the names now just contain the “main” part of their name, not all the extra adornments.
def make_title_search(df): return df["title"].str.upper() \ .replace(r"\.|\,|&", " ", regex=True) \ .replace(r"\/(\w+)?(\/)?", " ", regex=True) \ .replace(r"\s(CO[M|S|R]?P?|INC|PLC|LP|NA|NV|CU|LTD)(?=\s|$)", "", regex=True) \ .replace(r"\s+", " ", regex=True).str.strip() make_title_search(symbol_to_cik).head(15)
0 APPLE 1 MICROSOFT 2 AMAZON 3 BERKSHIRE HATHAWAY 4 UNITEDHEALTH GROUP 5 EXXON MOBIL 6 JOHNSON JOHNSON 7 WALMART 8 JPMORGAN CHASE 9 CHEVRON 10 ELI LILLY 11 SPDR S P 500 ETF TRUST 12 NVIDIA 13 PROCTER GAMBLE 14 TAIWAN SEMICONDUCTOR MANUFACTURING Name: title, dtype: object
We will make a new column for merging, then do a new merge using that column on both DataFrames
.
symbol_to_cik["title-search"] = make_title_search(symbol_to_cik) nport["title-search"] = make_title_search(nport) nport_tickers = nport.merge(symbol_to_cik, left_on='title-search', right_on='title-search', how='left') nport_tickers.loc[~pd.isnull(nport_tickers['ticker'])].shape
(629, 22)
Analyzing the new match results
Whoa, we now have more matches than we had original rows! Let’s look at cases where we have multiple matches. One quick way to do this is to group by the matching criteria (the title-search
column in our case), and count the values. The result is the number of rows found for that match. We pick a random column (cusip
) to make the output more readable.
nport_tickers.groupby("title-search").count().sort_values(by="cusip", ascending=False).head(10)['cusip']
title-search BANK OF AMERICA 17 PUBLIC STORAGE 15 MORGAN STANLEY 9 FIRST REPUBLIC BANK 8 JPMORGAN CHASE 8 CAPITAL ONE FINANCIAL 6 VORNADO REALTY TRUST 6 GOLDMAN SACHS GROUP 6 KEYCORP 5 ALLSTATE 5 Name: cusip, dtype: int64
nport_tickers.loc[nport_tickers["title-search"] == 'BANK OF AMERICA', "ticker"]
362 BAC 363 BML-PG 364 BML-PH 365 BAC-PB 366 BAC-PK 367 BML-PL 368 BAC-PL 369 BAC-PE 370 BML-PJ 371 BAC-PM 372 BAC-PN 373 BAC-PP 374 BAC-PQ 375 BAC-PO 376 BACRP 377 MER-PK 378 BAC-PS Name: ticker, dtype: object
This is similar to what we saw earlier with some companies having mutiple matches for preferred stock listings. We can see that the first match is probably the best one, so we need to further enhance our matching algorithm. We benefit from the fact that our symbol_to_cik
DataFrame
is sorted in order of popularity. It’s very likely that the first match we encounter is the symbol we want.
A second attempt at improving matching with duplicates
Since our simple merge results in duplicate rows, we can clean up our data by only keeping the best row, or in our case the first one. There are a few approaches you can use when you want to remove duplicate data in pandas. You can read this article for some more details on how to remove duplicates and keep certain data. In our case, we are grouping by title-search
, but since the data is sorted by “most popular”, we can choose the first match. So an approach using groupby
, with first()
works well. This will group by our search title, and pick the first match.
nport_second_attempt = nport_tickers.groupby("title-search").first() print(nport_second_attempt.head()[['name', 'cusip', 'ticker']]) nport_second_attempt.shape
name cusip ticker title-search 3M 3M Co 88579Y101 MMM A O SMITH A O Smith Corp 831865209 None ABBOTT LABORATORIES Abbott Laboratories 002824100 ABT ABBVIE AbbVie Inc 00287Y109 ABBV ABIOMED ABIOMED Inc 003654100 ABMD (502, 21)
Looking at the results, I see several problems. The first problem here is that the output here results in the loss of 3 rows of data. You’ll recall our original nport
DataFrame
had 505 rows. Three of the companies have the same exact title-search
, so we are only picking one of them when we do the groupby
. The second issue is that not all tickers were matched, since there is a None
in the second row. Let’s look at how many missing symbols we have at this stage, and see what some of these look like.
nport_second_attempt.loc[pd.isnull(nport_second_attempt["ticker"])].shape
(50, 21)
nport_second_attempt.loc[pd.isnull(nport_second_attempt["ticker"])].head()['name']
title-search A O SMITH A O Smith Corp AIR PRODUCTS AND CHEMICALS Air Products and Chemicals Inc AMERICAN WATER WORKS American Water Works Co Inc BECTON DICKINSON AND Becton Dickinson and Co BRISTOL-MYERS SQUIBB Bristol-Myers Squibb Co Name: name, dtype: object
So at this point, we can maybe match 90% of our symbols, and we have an issue with three of our companies having two choices. We need a slightly better matching method. Let’s see how we might match the missing values, starting with A O Smith Corp. Let’s find rows that have the name Smith.
symbol_to_cik[symbol_to_cik["title"].str.contains("SMITH")]
cik_str ticker title title-search 748 845982 SNN SMITH & NEPHEW PLC SMITH NEPHEW 864 91142 AOS SMITH A O CORP SMITH A O 1752 1689796 JBGS JBG SMITH Properties JBG SMITH PROPERTIES 3210 1092796 SWBI SMITH & WESSON BRANDS, INC. SMITH WESSON BRANDS 4952 948708 SMSI SMITH MICRO SOFTWARE, INC. SMITH MICRO SOFTWARE 4990 924719 SMID SMITH MIDLAND CORP SMITH MIDLAND
Fuzzy matching
We need to find a way to match the closest name. For this first example, consider how we are searching for A O SMITH
but that doesn’t match the correct value, SMITH A O
, with ticker AOS. We need a solution that would look through all the titles and find the closest match.
There is a metric in linguistics and computer science known as the Levenshtein distance. It gives a good approximation the similarity of two strings. At a high level, it tries to figure out how many changes would be needed to make one string match another. We don’t have to implement this algorithm from scratch (though you can do that!). There’s an existing Python library that makes it easy to do fuzzy searches. It’s called the fuzz (formerly known as fuzzy wuzzy).
You can install it using
pip install "thefuzz[speedup]"
or just
pip install thefuzz
The former will install a faster implementation of the Levenshtein distance calculation.
You can read the documentation for more examples, but here’s a quick look using our example:
from thefuzz import fuzz, process print(fuzz.WRatio("A O SMITH", "SMITH A O")) print(fuzz.WRatio("A O SMITH", "A. O. Smith")) print(fuzz.WRatio("A O SMITH", "Apple")) print(fuzz.WRatio("A O SMITH", "A O SMITH"))
95 95 18 100
We then use the process.extract
function to search through a list of values.
process.extract("A O SMITH", symbol_to_cik["title-search"])
[('SMITH A O', 95, '864'), ('NOVO NORDISK A S', 86, '20'), ('ITAU UNIBANCO HOLDING S A', 86, '176'), ('O REILLY AUTOMOTIVE', 86, '189'), ('PERNOD RICARD S A', 86, '230')]
The results returned by process
are the matched value (SMITH A O
), along with the score (95
) and the index ('872'
) of the matched value found in the data. We can use the index to find the row in the lookup DataFrame
and get the symbol and CIK from it. You should also notice that the similarity score can actually be pretty high even when the two values don’t look very similar. We need to ensure that we don’t use the wrong value if the match isn’t good enough. To do this, we’ll add a threshold to our search.
Verifying that fuzzy searching does what we want
Since it’s possible for a single company to have multiple matches, we also want to make sure we only use a value once. We track our choices and ensure we don’t reuse them. This way, we’ll end up picking the two most popular symbols for those companies with two matches.
def find_title(title, used, lookup, threshold=90): # get all the matches results = process.extract(title, lookup["title-search"]) for res in results: if res[1] < threshold: return idx = res[2] if idx not in used: # return first non-used result used.add(idx) row = lookup.loc[idx] return row['cik_str'], row['ticker'], res[0], res[1]
used = set() find_title("APPLE", used, symbol_to_cik)
(320193, 'AAPL', 'APPLE', 100)
OK, fair enough. But how does it handle the symbols that were problems? Let’s see.
find_title("FOX", used, symbol_to_cik)
(1754301, 'FOXA', 'FOX', 100)
find_title("FOX", used, symbol_to_cik)
(1754301, 'FOX', 'FOX', 100)
OK, that seems to potentially work. We see that they both have the same CIK, but we get two stock symbols.
find_title("PARAMOUNT GLOBAL", used, symbol_to_cik)
(813828, 'PARA', 'PARAMOUNT GLOBAL', 100)
find_title("PUBLIC STORAGE", used, symbol_to_cik)
(1393311, 'PSA', 'PUBLIC STORAGE', 100)
Let’s make sure a fake company doesn’t work.
find_title("Vandalay Industries", set(), symbol_to_cik)
find_title("Vandalay Industries", set(), symbol_to_cik, threshold=50) # very low threshold
(1489393, 'LYB', 'LYONDELLBASELL INDUSTRIES N V', 86)
This seems to be working, we have to set the threshold below 90 to get a match. If multiple rows in the lookup match the name, it’s picking the first match which happens to be the most popular due to the sorting of symbol_to_cik
. If we wanted to choose a better algorithm for selecting the match, we could easily do that.
Applying fuzzy search to the full dataset
Now before we try to run this on the full dataset, let’s see what we’re in for. (Note when creating this article, I tried running the whole dataset first and saw that it was incredibly slow, as I was expecting).
%timeit find_title("Public Storage", set(), symbol_to_cik)
614 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Running one title takes a bit less than a second. So we’d expect the full 505 symbols to take 5-7 minutes. While that’s not a crazy amount of time, this is not something we’d want to run multiple times or on a larger data set. The idea is, we’d run this one time to make a mapping, then use that in the future.
used.clear() # we want to restart cik_symbol = nport["title-search"].apply(lambda t: find_title(t, used, symbol_to_cik))
Note that cik_symbol
is a Series
with the same index as nport
, with two values. Let’s put them into our DataFrame
. The reason this is slow is that each title is passed into the find_title
function, and that function has to run the fuzzy search over every name in the set for each row (all 11K+ of them).
Checking the fuzzy results
Let’s first check if any of the returned values were not set.
print(cik_symbol.head()) print("There are ", cik_symbol[pd.isnull(cik_symbol)].shape[0], "missing values")
0 (773840, HON, HONEYWELL INTERNATIONAL, 100) 1 (1393612, DFS, DISCOVER FINANCIAL SERVICES, 100) 2 (37785, FMC, FMC, 100) 3 (72331, NDSN, NORDSON, 100) 4 (1100682, CRL, CHARLES RIVER LABORATORIES INTE... Name: title-search, dtype: object There are 6 missing values
This is pretty good, only 6 of our values didn’t find a match. We don’t know yet if the matches are correct, but matching almost 500 names with a fuzzy score of 90+ is a good start.
Since both nport
and cik_symbol
have the same index, we can just update the data directly. Since there were null rows above, we need to handle null in our update. There is a tuple of 4 values in each row on the Series
so we need to apply
a lambda
to the data, returning the first (CIK) and second (symbol) values appropriately. I use a little trick here with the lambda
. If the row (passed into the lambda as x
) is not None
, the and
part is returned with the correct value. Otherwise, it returns the or
part (which is None
).
nport['cik'] = cik_symbol.apply(lambda x: x and x[0] or x) nport['symbol'] = cik_symbol[~pd.isnull(cik_symbol)].apply(lambda x: x and x[1] or x)
Now let’s look at our missing values to see if we can figure out why we didn’t get a match.
nport.loc[pd.isnull(nport['symbol']), ['title', 'title-search', 'cusip', 'symbol']]
title title-search cusip \ 175 Lowe's Cos Inc LOWE'S 548661107 203 Federal Realty OP LP FEDERAL REALTY OP 313745101 327 Sherwin-Williams Co/The SHERWIN-WILLIAMS 824348106 365 Jacobs Engineering Group Inc JACOBS ENGINEERING GROUP 469814107 383 JB Hunt Transport Services Inc JB HUNT TRANSPORT SERVICES 445658107 388 Nielsen Holdings PLC NIELSEN HOLDINGS 000000000 symbol 175 NaN 203 NaN 327 NaN 365 NaN 383 NaN 388 NaN
Final cleanup techniques
First, we will try Lowe’s without the apostrophe.
find_title("LOWES", set(), symbol_to_cik, threshold=50)
(60667, 'LOW', 'LOWES COMPANIES', 90)
The fix for this is to add a '
to our replacement regex earlier. Now what about Federal Realty OP
?
find_title("FEDERAL REALTY OP", set(), symbol_to_cik, threshold=50)
(34903, 'FRT', 'FEDERAL REALTY INVESTMENT TRUST', 86)
find_title("FEDERAL REALTY", set(), symbol_to_cik, threshold=50)
(34903, 'FRT', 'FEDERAL REALTY INVESTMENT TRUST', 90)
The fix here is to add OP
to our replacement words earlier. Then we’d find a match. What about the rest?
find_title("SHERWIN-WILLIAMS", set(), symbol_to_cik, threshold=50)
(89800, 'SHW', 'SHERWIN WILLIAMS', 100)
It look like for Sherwin-Williams, the symbol was “stolen” by another company. Who has the value for SHW
? That will need to be corrected. We can do another search for the result for that company and swap their tickers.
nport.loc[nport['symbol'] == 'SHW', 'title']
95 Williams Cos Inc/The Name: title, dtype: object
# pass in our previously used values so it doesn't pick 'Sherwin-Williams' again. What's the second choice? find_title("Williams", used, symbol_to_cik, threshold=50)
(107263, 'WMB', 'WILLIAMS COMPANIES', 90)
find_title("JACOBS", set(), symbol_to_cik, threshold=50)
(52988, 'J', 'JACOBS SOLUTIONS', 90)
Jacobs has a fairly different name, and is right on the threshold.
find_title("JB HUNT TRANSPORT SERVICES", set(), symbol_to_cik, threshold=50)
(728535, 'JBHT', 'HUNT J B TRANSPORT SERVICES', 89)
JB Hunt is just below the threshold due to the initials being after the name.
find_title("NIELSEN HOLDINGS", set(), symbol_to_cik, threshold=50)
(1585689, 'HLT', 'HILTON WORLDWIDE HOLDINGS', 86)
In 2022, Nielsen Holdings was taken private. It showed up in an earlier report, but is not active and thus can’t be mapped. Let’s just fix these all manually.
nport.loc[pd.isnull(nport['symbol']), 'symbol'] = ('LOW', 'FRT', 'SWH', 'J', 'JBHT','NLSN') nport.loc[95, 'symbol'] = 'WMB' # fix the "stolen" symbol
How did we do?
Phew. If you stayed with me through this entire process, congratulations. That was a lot of work. At this point, we have a DataFrame
that has all the cusip and symbol entries. How did we do?
As I said earlier, getting this mapping can be hard to find for free for all symbols, but we can grab the current symbol list of S&P 500 companies from a free source. This will tell us how much we overlap and get a rough idea of how we did. Pandas makes grabbing a table from Wikipedia pretty easy.
sp500 = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")[0] sp500['Symbol']
0 MMM 1 AOS 2 ABT 3 ABBV 4 ABMD ... 498 YUM 499 ZBRA 500 ZBH 501 ZION 502 ZTS Name: Symbol, Length: 503, dtype: object
set(nport['symbol']).difference(set(sp500['Symbol']))
{'ATI', 'BF-B', 'BRK-B', 'DRE', 'EMBC', 'NLSN', 'PVH', 'STEM', 'SVFA', 'SWH', 'TWTR', 'VNT'}
set(sp500['Symbol']).difference(set(nport['symbol']))
{'ACGL', 'BF.B', 'BRK.B', 'CSGP', 'EQT', 'INVH', 'PCG', 'SHW', 'TRGP', 'VFC'}
Not bad. The symbols BF.B
and BRK.B
are just different ways of representing the same symbols as BF-B
and BRK-B
(and are a continual pain to those of us in finance). We already know about NLSN
. TRGP
, PCG
, EQT
, CSGP
, and INVH
recently replaced NLSN
, CTXS
, DRE
, PVH
, and PENN
in the index, respectively. With a little bit more manual updating, we could get to 100% accuracy.
Summary
This article covered several techniques to match data between two data sources, using Python and pandas. We tried a simple match, some data cleanup and normalization with another match, eliminated duplicate data, then used fuzzy search. Finally, we did some manual verification and cleanup.