The Jargonator T-800 Newsletter Entry
Hasta La Vista, Jargon! Plus, the best of the Data Score Newsletter so far
Welcome to the Data Score newsletter, composed by DataChorus LLC. The newsletter is your go-to source for insights into the world of data-driven decision-making. Whether you're an insight seeker, a unique data company, a software-as-a-service provider, or an investor, this newsletter is for you. I'm Jason DeRise, a seasoned expert in the field of data-driven insights. As one of the first 10 members of UBS Evidence Lab, I was at the forefront of pioneering new ways to generate actionable insights from alternative data. Before that, I successfully built a sell-side equity research franchise based on proprietary data and non-consensus insights. After moving on from UBS Evidence Lab, I’ve remained active in the intersection of data, technology, and financial insights. Through my extensive experience as a purchaser and creator of data, I have gained a unique perspective, which I am sharing through the newsletter.
There’s so much jargon in financial markets, data, and technology. At the end of each data score newsletter, there is a long list of jargon defined in (hopefully) simple terms. This entry in the data score will become a living document, capturing all jargon ever used in the Data Score.
As of October 2nd, 2023, there were already well over 100 terms defined in the previous articles, since first publishing The Data Score in April 2023. That’s a lot of jargon to demystify. Going forward, I’ll continue adding footnotes to future entries and also come back to this entry and update it as a living document.
Before the Jargonator begins terminating abbreviations and technical terms, I’d like to share the top 5 most viewed articles since we began the Data Score in early April 2023.
#1: A different Approach to Revenue Estimates Leveraging Alternative Data https://thedatascore.substack.com/p/a-different-approach-to-revenue-estimates
#2: NVIDIA: How Could Alternative Data Be Used to Assess Its Long-Term Potential? https://thedatascore.substack.com/p/nvidia-how-could-alternative-data
#3: 8 Point Approach to Evaluating Data Partners https://thedatascore.substack.com/p/8-point-approach-to-evaluating-data
#4: Data Deep Dive: Web mining Show Inflections in Apparel Sector Results: https://thedatascore.substack.com/p/data-deep-dive-web-mining-shows-inflections
#5: Blending AI and Human Creativity: Generative AI and Content Strategy https://thedatascore.substack.com/p/blending-ai-and-human-creativity
Thanks for sharing, commenting, and liking each data score newsletter entry. I couldn’t have imagined getting to 600+ subscribers so quickly.
As I started the newsletter, I wasn’t sure about the level of depth needed to dig into the topics related to generating valuable insights from data and technology. The feedback has been positive about how deep I’ve gone into the topics, provided relevant examples, and explored the nuances associated with generating insight from data. The feedback has also been positive about demystifying what happens downstream when decision-makers use the data and how the data and technology come together to create valuable data products.
If you continue to find it valuable, please feel free to share the newsletter with your colleagues and network.
Also, please check out my recent appearance on The Alternative Data Podcast:
Mark Fleming-Williams asked me a wide range of questions about my experience in the financial markets, alternative data, how to structure a data team to generate high impact at scale, and my views on how the industry would evolve going forward. I enjoyed the conversation and hope you do to!
Let the jargon destruction begin:
“The Jargonator T-800” Hasta La Vista, Jargon: As discussed in “Blending AI and Human Creativity: Generative AI and Content Strategy,” I outlined the use of LLMs to help me as an editor after I create the draft content. I ask ChatGPT to identify jargon that needs to be defined to improve clarity. Not only does it list the jargon, but it also provides definitions. I typically rewrite the definitions to improve accuracy and make them more approachable to the audience, which I include as footnotes.
https://thedatascore.substack.com/p/blending-ai-and-human-creativity
This is every term that has been defined in the Data Score Newsletter, which can be used as a reference anytime you find yourself hearing a team either in the finance or data world and needing a simple explanation.
If anything is confusing or you’d like more details, please reach out to let me know. If there’s a term you've been hearing a lot lately that's not on this list, feel free to let me know as well.
The complete list of Jargon defined in the Data Score Newsletter (initial publication October 2, 2023, but will be updated going forward).
A/B Testing: a way to compare two versions of something to figure out which performs better. While it’s most often associated with websites and apps, the method is almost 100 years old and it’s one of the simplest forms of a randomized controlled experiment. https://hbr.org/2017/06/a-refresher-on-ab-testing
Accelerated Computing: This term refers to using GPUs to perform tasks traditionally handled by CPUs. It's central to the functions of NVIDIA's products and the competitive landscape.
AIS (Automatic Identification System): A tracking system used on ships and by vessel traffic services for identifying and locating vessels by electronically exchanging data with other nearby ships and AIS Base stations.
Alpha: A term used in finance to describe an investment strategy's ability to beat the market or generate excess returns. A simple way to think about alpha is that it’s a measure of the outperformance of a portfolio compared to a pre-defined benchmark for performance. Investopedia has a lot more detail https://www.investopedia.com/terms/a/alpha.asp
Alternative Data Council at FISD: “Founded in January 2019, the Alternative Data Council is series of working groups and information-sharing forums within FISD. It was created as part of the FISD Executive Committee’s strategic initiative to engage the alternative data community. We establish best practices for the delivery of alternative data to the investment industry and provide opportunities for education, information sharing, and networking.” https://fisd.net/alternative-data-council/
Alternative data: Alternative data refers to data that is not traditional or conventional in the context of the finance and investing industries. Traditional data often includes factors like share prices, a company's earnings, valuation ratios, and other widely available financial data. Alternative data can include anything from transaction data, social media data, web traffic data, web mined data, satellite images, and more. This data is typically unstructured and requires more advanced data engineering and science skills to generate insights.
Alternative Investment Fund is a broad classification for institutional investors who are focused on private equity, private credit, and venture capital investments, but can include other types of non-traditional equity or fixed income market investments too.
Anchoring Bias: A cognitive bias that involves relying too heavily on the first piece of information encountered (the "anchor") when making decisions.
Anomaly Detection: In data analysis, it's a statistical technique used to identify unusual patterns that do not conform to expected behavior or an expected pattern because they deviate materially from the historic distribution of the data.
API (Application Programming Interface): Allows software programs to communicate with each other, pulling or pushing data between applications.
App analytics: This refers to the measurement of user engagement and usage patterns within a mobile app. It can help identify how users interact with the app, what features are most used, and where users are facing issues.
Artificial Intelligence (AI): The simulation of human intelligence processes by machines, especially computer systems, which include learning, reasoning, problem-solving, perception, and language understanding.
Availability Heuristic: A mental shortcut that relies on immediate examples that come to mind when evaluating a specific topic, concept, method, or decision.
Backtesting: The process of testing a trading strategy or model using historical data to evaluate its performance before applying it in production. The goal is to determine how well the model would have performed in the past and, by extension, how it might perform in the future.
Ballasting Pattern (AIS Data): Refers to the practice of loading ballast water into a ship to increase its stability and maneuverability.
Bayesian approach: The Bayesian approach is a statistical method that uses prior knowledge or beliefs to update and revise probabilities based on new evidence or data. It allows for the incorporation of prior information and the updating of probabilities as new information becomes available.
Berths (AIS Data): A location on the terminal used specifically for mooring vessels.
Best-Seller Share: Websites often provide a ranking of products by sales to help customers find items they would likely buy quickly (before they leave for another competitor website). By collecting this rank information and setting a threshold for being a best seller, the share of best sellers can be tracked as a proxy for demand.
Beta: In finance, beta is a measure of investment portfolio risk. It represents the sensitivity of a portfolio's returns to changes in the market's returns. A beta of 1 means the investment's price will move with the market, while a beta less than 1 means the investment will be less volatile than the market. In the context of data, beta refers to the data’s ability to explain the market’s movements because the data is widely available and therefore fully digested into the share price almost immediately. This level of market pricing efficiency means there’s not much alpha to be generated, but the data is still needed to understand why the market is moving.
Bill of lading data: A bill of lading is a legal document between a shipper and carrier detailing the type, quantity, and destination of the goods being carried. The bill of lading also serves as a shipment receipt when the carrier delivers the goods to their predetermined destination.
Black box model: A model whose inner workings are hidden or not understandable to the user.
Bulker (AIS Data): A type of ship designed to transport unpackaged bulk commodities, such as grains, coal, ore, and cement, in its cargo holds.
Bullish and Bearish are financial market jargon for positive (Bullish) and negative (Bearish) opinions about what will happen next for an industry or investment.
Buyside typically refers to institutional investors (Hedge funds, mutual funds, etc.) who invest large amounts of capital, and Sellside typically refers to investment banking and research firms that provide execution and advisory services (research reports, investment recommendations, and financial analyses) to institutional investors.
CAC: Customer Acquisition Cost (CAC) is the total cost of acquiring a new customer, including marketing and sales expenses. It is an important metric for evaluating the efficiency of a company's customer acquisition efforts.
Calibration: In the context of AI, calibration refers to adjusting the model's predictions to align more closely with reality or with the user's needs. For instance, adjusting the model to generate more or less formal text based on feedback.
Capex (Capital Expenditure): This is the money a company spends on acquiring, maintaining, or improving physical assets such as buildings, equipment, or technology.
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart): Tools used by websites to distinguish human users from automated scripts or bots.
Catastrophic Forgetting: A phenomenon in machine learning where a model, after being trained on new tasks, completely forgets the old tasks it was trained on. This is a significant issue in neural networks and an ongoing area of research in the development of AI models that can retain knowledge from previous tasks while learning new ones.
Causal Analysis/Causal AI: A method of identifying relationships that suggest causation rather than mere correlation in statistical data, focusing on determining what affects an outcome.
Clean room technologies: A data processing method where two or more parties can analyze combined datasets without directly sharing the raw data. This ensures data privacy and compliance, especially in environments with strict data regulations like GDPR.
Clickstream data: Clickstream, or web traffic data, refers to the record of the web pages a user visits and the actions they take while navigating a website. Clickstream data can provide insights into user behavior, preferences, and interactions on a website or app.
COGS (Cost of Goods Sold): This is the cost of creating the goods or services a company sells, which will vary depending on the type of business.
Cohort: In the context of this article, a cohort refers to a group of users or customers that share a common characteristic, such as the time they started using a product or the type of product they use. Analyzing cohorts can help businesses understand user behavior, product adoption, and retention patterns.
Canary Deployment: A software deployment strategy that releases changes to a small subset of users or systems before rolling it out to the entire infrastructure, used to catch potential issues early.
Confirmation Bias: A tendency to search for, interpret, favor, and recall information in a way that confirms one's preexisting beliefs or hypotheses.
Configurable parameters: In the context of data, “configurable parameters” are aspects of a data product that can be easily adjusted or customized by users to meet their specific needs or preferences.
Conjoint analysis: A survey-based market research technique used to understand how consumers value different features of a product or service.
Consensus: “The consensus” is the average view of the sell-side for a specific financial measure. Typically, it refers to revenue or earnings per share (EPS), but it can be any financial measure. It is used as a benchmark for what is currently factored into the share price and for assessing if new results or news are better or worse than expected. However, it is important to know that sometimes there’s an unstated buyside consensus that would be the better benchmark for expectations.
Container (AIS Data): A container or cargo vessel is a type of ship designed to transport goods in large, standardized containers. These vessels are equipped with special infrastructure, such as cranes or guides, to load and unload the containers. The size of container ships can vary significantly, from small coastal feeder ships to massive vessels capable of carrying thousands of twenty-foot equivalent units (TEUs).
Content Interface Testing: A type of testing that ensures the interactions between different software components or systems function correctly and reliably.
Corporate Actions: Events initiated by a public company that bring changes to its securities, such as stock splits, dividends, mergers, and acquisitions.
Cross-dock cargo drops (AIS Data): The process of unloading goods from an incoming vessel and loading these goods directly onto outbound vessels, with little or no storage in between.
CPG (Consumer Packaged Goods): Refers to items used daily by average consumers that require routine replacement or replenishment. Examples include food, beverages, tobacco, and household products.
Crowded trade: When a trade is crowded, it means many investors have taken on the same investment, which can be seen in the trading flows data showing excess buying or selling of a specific investment asset.
C-suite: Top executives at a company - CEO, CFO, COO, etc.
Data Governance: The overall management of the availability, usability, integrity, and security of the data employed in an organization. “Data governance is everything you do to ensure data is secure, private, accurate, available, and usable. It includes the actions people must take, the processes they must follow, and the technology that supports them throughout the data life cycle.” - Google Cloud’s definition: https://cloud.google.com/learn/what-is-data-governance#:~:text=Get the whitepaper-,Data governance defined,throughout the data life cycle.)
Data Harvesting: Also known as data extraction, is the process of extracting large amounts of data from various sources for processing and analysis.
Data Lake: A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. https://aws.amazon.com/what-is/data-lake/
Data Monetization: involves the process by which corporations that accumulate large quantities of data through their day-to-day operations, create an additional revenue stream. This is achieved by refining (cleansing and enriching) and structuring their data in such a way that it can be packaged and sold, often to external entities such as investment communities, for various purposes.
Data pipeline: A set of data processing elements or tasks connected in series, where the output of one element is the input of the next one, converting raw data into cleansed and enriched data, typically managed on an automated schedule.
Data Protection Laws: Legislation intended to protect individuals' personal data in the context of professional or commercial activity. An example is the GDPR (General Data Protection Regulation) in Europe.
Data schema: In the context of databases, a data schema is an outline or a blueprint of how data is organized and accessed. It describes both the structure of the data and the relationships between data entities.
Data steward: Person responsible for data governance, including quality, availability, and security.
DCF (Discounted Cash Flow): A method for company valuation based on projected future cash flows.
Deadweight tonnage (AIS Data): The carrying capacity of a ship, which includes cargo, fuel, passengers, crew, and their belongings.
DevOps: The combination of Development and Operations in a continuous cycle typically used in software development. The goal is for applications to be created quickly and deployed into production on an ongoing basis, where the operations team is involved in the process and provides consistent feedback to the development team to handle new feature requirements and the removal of bugs.
Diffusion Index: An economic indicator that represents the net number of positive signals or conditions occurring in a given set.
Direct Channel: The means by which a company sells its products directly to consumers, bypassing any third-party retailers, wholesalers, or any other intermediaries.
Distributed ledger technology is a platform that uses ledgers stored on separate, connected devices in a network to ensure data accuracy and security. Blockchains evolved from distributed ledgers to address growing concerns that too many third parties are involved in too many transactions. https://www.investopedia.com/terms/d/distributed-ledger-technology-dlt.asp
Draft (Draught (AIS Data)): The vertical distance between the waterline and the bottom of the hull (keel), which tells about the deepest point of a vessel. It is used to determine the depth needed for a vessel to safely navigate and dock at a port.
Economies of Scale: Economies of scale are cost advantages reaped by companies when production becomes efficient. Companies can achieve economies of scale by increasing production and lowering costs. This happens because costs are spread over a larger number of goods. Costs can be both fixed and variable. https://www.investopedia.com/terms/e/economiesofscale.asp
Economics of Scope: Economies of scope refer to cost advantages that a business obtains due to a broader scope of operations, often achieved by producing a variety of products or services using the same operations or resources.
ELT (Extract, Load, Transform): A processes for data integration. Similar to ETL, but in this process, raw data is loaded directly into the data warehouse. The transformation step happens later, whenever the data is needed. This is a more common approach when using cloud computing, which allows for data to be saved at a lower cost as well as the flexibility to expand the size of the storage.
EPS (Earnings per share): the net income of the company divided by the number of outstanding shares.
ESG: Environmental, Social, and Governance (ESG) refers to the three central factors in measuring the sustainability and societal impact of an investment in a company or business.
ETF (Exchange-Traded Fund): An ETF is an investment fund traded on stock exchanges, much like stocks. An ETF holds assets such as stocks, commodities, or bonds and generally operates with an arbitrage mechanism designed to keep it trading close to its net asset value, although deviations can occasionally occur. ETFs offer a cost-effective, liquid, and flexible way for investors to purchase a diversified portfolio that tracks a particular index, sector, commodity, or other asset classes. Unlike mutual funds, which are priced at the end of each trading day, ETFs are bought and sold throughout the day at market price, offering more flexibility for investors.
ETL (Extract, Transform, Load): Process for moving data from sources into a data warehouse and transforming the data before storing the transformed data on the server. This was the common approach when data was stored on physical servers, which limited the amount of data that could be saved.
Evergreen: Products that are consistently available for sale over the long-term.
Exhaust data: refers to the data generated as a by-product of regular organizational activities and processes. This data can sometimes be repurposed or sold, offering potential additional value or revenue streams.
Explainability: In the context of machine learning, this refers to the degree to which a machine learning model's behavior can be understood by humans.
F1 Score: The harmonic mean of precision (what percentage of predictions were correct) and recall (of actual answers, what percentage was predicted correctly) tests. It is used as a way to generate a single measure of a model’s prediction accuracy.
FAB: Short for Fabrication plant, a factory where devices such as integrated circuits are manufactured.
Factor investing: an investment approach that involves targeting quantifiable firm characteristics or “factors” that can explain differences in stock returns. Security characteristics that may be included in a factor-based approach include size, low-volatility, value, momentum, asset growth, profitability, leverage, term and carry. https://en.wikipedia.org/wiki/Factor_investing
Fault Injection: A testing technique used to improve the robustness of a system by deliberately introducing errors or faults and observing how the system responds.
False Positive: An error in data interpretation in which a test result incorrectly indicates the presence of a condition (such as a pattern or trend) when it is not actually present.
FinBert: A specialized version of language processing AI models. FinBert is adapted for financial contexts. FinBert is a specialized variant of the BERT (Bidirectional Encoder Representations from Transformers) model, which was a breakthrough in the field of natural language processing (NLP). Developed by Google, BERT models are designed to understand the context of a word in a sentence more effectively than previous NLP models. FinBert leverages the advanced capabilities of BERT models while being fine-tuned to address the specific language and analytical needs of the finance sector.
Fine-tuning: This is a process in machine learning where a pre-trained model (like an LLM) is further trained on a more specific dataset to adapt to the particular task at hand. For example, fine-tuning ChatGPT could involve training it on a specific author's writing style.
FISD: "the global forum of choice for industry participants to discuss, understand, and facilitate the evolution of financial information for the key players in the value chain, including consumer firms, third-party groups, and data providers. It is a dynamic environment in which members identify the trends that will shape the industry and create education opportunities and industry initiatives to address them.” https://fisd.net/about-us/#fisd-about-mission
FOMO: Fear Of Missing Out
Freedom of Information Act (FOIA) Requests: A legal process by which individuals or organizations can request access to government-held information, which must be released unless it falls under specific exemptions.
Fundamental analysis: Assessing investment assets based on underlying economic and financial factors, typically by creating a model that forecasts the financial statements of an entity, sector, or financial market. Valuation methodologies are then typically applied to the forecasted financial statements to derive the value of the entity.
Fundamental Discretionary Investors: refers to institutional investors that leverage portfolio manager’s judgment and decision-making to allocate capital (leveraging varying degrees of statistical, data-driven analysis).
Fuzzing: An automated software testing technique that involves providing invalid, unexpected, or random data to the inputs of a program to find security vulnerabilities and coding errors.
Generative AI: AI models that can generate data like text, images, etc. For example, a generative AI model can write an article, paint a picture, or even compose music.
Geospatial data: information that has a geographical component, such as coordinates, addresses, or areas, related to points of interest and movements of people in and around the points of interest, is used in various analyses to spatially analyze trends.
GIS (Geographic Information System): A system designed to capture, store, manipulate, analyze, manage, and present all types of geographical data.
GPUs: An acronym for "Graphics Processing Units." These are specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device.
Hallucination: In AI, hallucination refers to instances where the model generates information that wasn't in the training data, makes unsupported assumptions, or provides outputs that don't align with reality. Or as Marc Andreessen noted on the Lex Fridman podcast, “Hallucinations is what we call it when we don’t like it, and creativity is what we call it when we do like it.”
Heuristic: an experience-based approach to problem solving that relies on rules of thumb.
Hindsight bias: Hindsight bias is a psychological phenomenon that allows people to convince themselves after an event that they accurately predicted it before it happened. This can lead people to conclude that they can accurately predict other events. Hindsight bias is studied in behavioral economics because it is a common failing of individual investors. https://www.investopedia.com/terms/h/hindsight-bias.asp#:~:text=Hindsight bias is a psychological,can accurately predict other events.
hiQ Labs, Inc. v. LinkedIn Corp): Reference to a landmark legal case that addressed the legality of web scraping publicly available data. Wikipedia page on the case: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
Human-in-the-loop: This is an approach to AI and machine learning where a human collaborates with the AI model during its operation, guiding its learning and correcting its output.
Human-on-the-Loop: A quality control setup where AI operates autonomously, but humans monitor the process and intervene only when necessary.
Impact Investing: Investments made into companies, organizations, and funds with the intention to generate a measurable, beneficial social or environmental impact alongside a financial return.
Institutional investors: professional investors, like mutual funds, pensions, and endowments (aka the Buyside), who invest the money of others on their behalf. This is different from a retail investor, who is an individual or nonprofessional investor who buys and sells securities through brokerage firms or retirement accounts like 401(k)s.
Interpolate: A method of constructing new data points within the range of a set of known data points to fill in the blanks of missing or bad data points in order to reduce the noise in the dataset.
Investment Thesis: A clear, definable idea or set of ideas that outlines an investor's expectations and reasons for the potential outcome of an investment.
Jobs To Be Done: A theory and methodology for understanding customer motivations and needs in business and product development, based on the idea that customers "hire" products or services to fulfill specific jobs.
Kernel-level Driver: Software that operates at the core of the operating system, managing communication between the hardware and the operating system. Errors at this level can cause significant system crashes.
Key Performance Indicators (KPIs): These are quantifiable measures used to evaluate the success of an organization, employee, etc. in meeting objectives for performance.
Large Language Models (LLMs): These are machine learning models trained on a large volume of text data. LLMs, such as GPT-4 or ChatGPT, are designed to understand context, generate human-like text, and respond to prompts based on the input they're given. It is designed to simulate human-like conversation and can be used in a range of applications, from drafting emails to writing Python code and more. It analyzes the input it receives and then generates an appropriate response, all based on the vast amount of text data it was trained on.
LBO (Leveraged Buyout): An acquisition of a company using a significant amount of borrowed money to acquire a company, typically taking the company from the public market (with tradable shares on the stock exchange) to a privately owned company.
Long Only Fund: These are funds that only buy investment positions and do not take short positions.
Long/Short Equity Hedge Fund: Long/Short Equity funds buy positions (long) in stocks they believe will go up in value and sell short stocks (short) that they believe will go down in value. Typically, there is a risk management overlay that pairs the long and short positions to be “market neutral,” meaning it doesn’t matter if the market goes up or down; what matters is that the long position outperforms the short position. Short selling, by a simplistic definition, is when an investor borrows stock from an investor who owns it and then sells the stock. The short seller will eventually need to buy back the stock at a later date to return it to the owner of the stock (and will profit if they buy back the stock at a lower price than they sell it).
Loughran-McDonald Lexicon: The Loughran-McDonald Lexicon is a specialized financial dictionary developed for the analysis of financial documents using natural language processing (NLP). The lexicon addresses a key challenge in the field of financial text analysis: the fact that common words often have different meanings in financial contexts compared to general usage. Unlike general-purpose sentiment dictionaries, which might misinterpret the sentiment of financial texts (for example, treating "liability" as a negative term in a general context, whereas in finance, it's a neutral term referring to debts or obligations).
LTV: Lifetime Value (LTV) is a metric that represents the total net profit a company can expect to make from a customer throughout their entire relationship with the company. It helps businesses understand the long-term value of their customers and make informed decisions about customer acquisition and retention strategies.
Machine Learning (ML): An application of AI that provides systems with the ability to automatically learn and improve from experience without being explicitly programmed.
Master data management (MDM) is a discipline in which business and information technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency, and accountability of the enterprise's official shared single source of data truth. https://en.wikipedia.org/wiki/Master_data_management
Macro Funds: as a simplified summary, they follow an investing strategy based on global macroeconomic views and are typically executed in a portfolio by investing across entire asset classes like fixed income, currencies, derivatives, and equities. They are not focused on company-specific, bottom-up investing choices.
Margin of Error: In statistics, the margin of error describes the amount of random sampling error in a survey's results.
Material, Non-Public Information (MNPI): Information about a company that is not publicly available and could have a significant impact on the company's stock price if it were made public.
Mean Absolute Percentage Error (MAPE): A statistical measure used to determine the accuracy of a forecasting method in predictive analytics based on the average of the percentage errors of each entry in a dataset.
Metadata: Metadata is data that provides information about other data. In other words, it's data about data. It can be used to index, catalog, discover, and retrieve data.
Minimum Viable Product (MVP): A product version with just enough features to be usable by early customers, who can then provide feedback for future product development. The key here is the viable part of the definition, which often gets missed in favor of the minimum description.
MLOps or Machine Learning Operations: a practice for collaboration and communication between data scientists and operations professionals to help manage the production machine learning (or deep learning) lifecycle. It aims to shorten the development cycle of machine learning systems, provide high-quality and reliable delivery, and innovate based on continuous feedback and monitoring.
Model: In the context of machine learning, a model is a representation of what a machine learning system has learned from its training data. Training is the process of teaching a machine learning model to make predictions by providing it with data.
Multi-Manager (MultiStrat) Hedge Funds: Hedge funds that allocate capital across multiple portfolio managers, each managing a distinct segment of the fund’s total assets, often employing diverse strategies.
Natural Language Generation (NLG): This is a subfield of artificial intelligence (AI) focused on generating natural language text by machine. This can be used to produce reports, write essays, or answer questions in a natural, human-like way.
Natural Language Processing (NLP): An AI technology that allows computers to understand, interpret, and respond to human language in a quantitative way, generating statistical measures of sentiment and importance of topics.
Neural Networks: A subset of machine learning, neural networks (also known as artificial neural networks) are computing systems inspired by the human brain's network of neurons. They're designed to 'learn' from numerical data. They can learn and improve from experience, adapting to new inputs without being explicitly programmed to do so.
Neurodiversity: A concept and movement acknowledging and respecting neurological differences among people, such as autism, ADHD, dyslexia, etc., as natural variations within the human population.
Nowcasting: In order to systematically forecast the next reported economic or company-specific financial result, multiple sources of high-frequency data are combined. The model continuously updates the forecast with increasing accuracy as the volume of data covering the unknown period increases.
Null Hypothesis: In scientific research, the null hypothesis is the claim that the effect being studied does not exist. Note that the term "effect" here is not meant to imply a causative relationship.The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data or variables being analyzed. If the null hypothesis is true, any experimentally observed effect is due to chance alone, hence the term "null". In contrast with the null hypothesis, an alternative hypothesis is developed, which claims that a relationship does exist between two variables. If the sample data is consistent with the null hypothesis, then you do not reject the null hypothesis. https://en.wikipedia.org/wiki/Null_hypothesis - I would add this does not mean the null hypothesis is true, but it does hurt the case that the hypothesis is true. Likewise, failing to reject the null hypothesis should only increase confidence that the relationship between the data points and the hypothesis is true. As the wikipedia definition stated, this does not mean there is proof of a causal relationship. But, so long as we fail to reject the null hypothesis, our hypothesis could still prove to be true. Keep on setting up new hypothesis tests as its not the end of the process in practical terms, especially in investing.
OKR (Objectives and Key Results): A goal-setting framework focused on transparent, measurable outcomes that align the organization. For more details check out the book “Measure What Matters” https://www.amazon.com/Measure-What-Matters-Google-Foundation/dp/0525536221
Omni Channel: A retail strategy that offers customers a consistent, seamless shopping experience (including the same inventory at the same price) across all channels (in-store, online, mobile, etc.)
On the loop: This term refers to a situation where humans are monitoring an automated process and intervene only when necessary, as opposed to being directly involved in the process (i.e., "in the loop").
Onshoring and Near-Shoring: Onshoring refers to the practice of bringing back business operations that were previously outsourced to a foreign country to the company's home country. Near-shoring refers to moving business operations to neighboring countries or those in close proximity rather than to a domestic location or a distant country.
Operating Leverage: One way to think about operating leverage is the sensitivity of operating profit growth (profit before interest and taxes) to changes in volume or revenue growth of the business. The primary driver of operating leverage is the mix of variable costs, which change as volumes change, and fixed costs, which do not change as volumes change. The more fixed costs as a portion of the business, the more sensitive operating profit is to changes in volume and revenue growth (declines).
Optimal control: A mathematical optimization method for determining a control policy that will achieve the best possible outcome in a dynamic system. It is used in various fields, including finance, to optimize decision-making processes.
Outcome-based research: A research approach focused on understanding the desired outcomes or goals of users or customers rather than just the features of a product or service.
Overfitting: When a model matches the training data very well when back-tested but fails in real-world use cases when the model is applied to new data.
P-value: In null-hypothesis significance testing, the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis. Even though reporting p-values of statistical tests is common practice in academic publications of many quantitative fields, misinterpretation and misuse of p-values is widespread and has been a major topic in mathematics and metascience. In 2016, the American Statistical Association (ASA) made a formal statement that "p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone" and that "a p-value, or statistical significance, does not measure the size of an effect or the importance of a result" or "evidence regarding a model or hypothesis."That said, a 2019 task force by ASA has issued a statement on statistical significance and replicability, concluding with: "p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data." https://en.wikipedia.org/wiki/P-value
Panel: In data, this typically refers to a group of individuals, households, or businesses whose behavior is tracked over time for research purposes. In the case of credit and debit card panels, they consist of a representative sample of consumers whose transaction data is tracked and analyzed over a period of time. The data collected from these panels can then be used to understand purchasing behaviors, track spending trends, and make predictions about future consumer behavior. Similarly, receipt panels would involve tracking and analyzing data from consumer receipts at the item level. It's important to note that the quality and representativeness of these panels can significantly impact the accuracy of the insights derived from them. For instance, if a panel disproportionately represents a certain demographic or geographic area, the data derived from it might not accurately reflect broader market trends.
P/E (Price-to-Earnings ratio): A valuation measure that is calculated by dividing a company's share price by earnings per share. This reflects the company’s equity value in terms of $1 of earnings. More details here: https://www.investopedia.com/terms/p/price-earningsratio.asp.
PE firms: Short for Private Equity firms, which are investment management companies that provide financial backing and make investments in the private equity of existing or startup companies either not yet listed on financial market exchanges or taking ownership of the company and removing its public market equity.
PEG ratios (Price-to-Earnings-to-growth): Compare the P/E to the estimated earnings per share growth rate as a way to reflect that companies with higher growth typically have higher P/E ratios when comparing the value of two or more companies.
Personal Identifiable Information (PII): Any information that can be used to identify an individual, such as a name, social security number, address, or phone number.
Point in time-stamped history: This phrase refers to a dataset that provides the time data to show how data has been revised. So it includes not only the time period the data was related to but also the date when the entire data set was originally released or revised. This allows investors to use the data in back-testing models as if it were seen in real time before revisions.
Portfolio Company (PortCo): A company in which a private equity firm or venture capital firm has invested.
Precision and Recall: Precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of the total number of relevant instances that were actually retrieved. Put another way, precision is the percentage of the predictions that were right. Recall is the percentage of the actual answer that was correctly predicted. Both are used to measure the quality of a machine learning model.
More info can be found on Wikipedia: https://en.wikipedia.org/wiki/Precision_and_recall
Product/Market Fit: The ability of a product to meet the needs of customers, generating strong and sustainable demand for the product. This term refers to the point at which a product or service has been optimized to meet the needs and preferences of its target market, resulting in strong customer satisfaction and retention. Achieving product/market fit is considered essential for the success of a startup or new product.
Programmatic Access: Refers to the method of accessing data or software functionalities through code or automated scripts, rather than manual interaction.
Prompt Engineering: Prompt engineering is the process of iterating a generative AI prompt to improve its accuracy and effectiveness. https://www.coursera.org/articles/what-is-prompt-engineering
Psychological leadership: Leading others through emotional intelligence, relationships, and communication.
Psychological safety: A culture where people feel comfortable taking risks and speaking up.
Quant funds: Short for "quantitative funds," also referred to as systematic Funds. Systematic refers to a quantitative (quant) approach to portfolio allocation based on advanced statistical models, and machine learning (with varying degrees of human involvement “in the loop” or “on the loop” managing the programmatic decision making).
Question Bursts: For more info on the benefits of question bursts, check out: https://mitsloan.mit.edu/ideas-made-to-matter/heres-how-question-bursts-make-better-brainstorms
Redundancy (in software): The inclusion of extra components that are not strictly necessary to functioning, used to increase reliability and prevent failure.
Regex: short for regular expressions, is a programming tool for pattern matching within text. It uses specific sequences of characters to find, replace, or manipulate strings in data. It's essential for search functions and data manipulation.
Remote sensing: The process of detecting and monitoring the physical characteristics of an area by measuring its reflected and emitted radiation from a distance, typically from a satellite.
Retail Channel: The various routes that retailers use to sell products to consumers, including physical stores and online platforms.
RLHF (Reinforcement learning from human feedback): This is a machine learning model where the automated decisions are based on a reward-based scoring system where the various decisions the model can take are given rewards or penalties. The human is in the loop in the training process, where the direct feedback alters the future actions of the model.
ROI (Return on Investment): a performance measure used to evaluate the efficiency or profitability of an investment or compare the efficiency of a number of different investments. ROI tries to directly measure the amount of return on a particular investment, relative to the investment’s cost. https://www.investopedia.com/terms/r/returnoninvestment.asp
ROIC (Return on Invested Capital): This financial metric measures a firm's profitability and the efficiency with which its capital is employed.
Rollback Testing: The process of testing the ability to revert a system or application to a previous state after a failed update or deployment.
R-Squared: A statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.
Scalable: A solution designed to handle growth, ensuring performance remains consistent as demand or volume increases
Scanner transaction data: Information collected when a product's barcode is scanned at a point of sale. This data can be used to analyze consumer purchasing behavior, inventory management, and more.
Section 204A: Refers to a section of the Investment Advisers Act of 1940 in the United States, that concerns codes of ethics and internal compliance procedures.
SG&A (Selling, General, and Administrative Expenses): The category of selling, general, and administrative expenses (SG&A) in a company's income statement includes all general and administrative expenses (G&A) as well as the direct and indirect selling expenses of the business. This line item includes nearly all business costs not directly attributable to making a product or performing a service. SG&A includes the costs of managing the company and the expenses of delivering its products or services. https://www.investopedia.com/terms/s/sga.asp. SG&A Ratio: A financial metric used to measure the selling, general, and administrative expenses as a percentage of total sales.
Short selling: as a simplistic definition, is when an investor borrows stock from an investor who owns it, and then sells the stock. The short seller will eventually need to buy back the stock at a later date to return to the owner of the stock (and will profit if they buy back the stock at a lower price than they sell it.
Sigmoid function: a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. S-curves are often used to describe the adoption of new technology.
SIIA: "the voice for the specialized information industry. Our members provide data, content and information that drives the global economy, informs financial networks and connects learners and educators. SIIA unites, defends and promotes our diverse membership. Learn more about our educational and networking opportunities, events and benefits helping you grow your business, your career and the industry at large.” https://www.siia.net/about-us/
SKU (Stock Keeping Unit): a unique code consisting of letters and numbers that identifies each distinct product in a store's inventory.
Social Proof (or informational social influence): “a psychological and social phenomenon wherein people copy the actions of others in choosing how to behave in a given situation. The term was coined by Robert Cialdini in his 1984 book Influence: Science and Practice. Social proof is used in ambiguous social situations where people are unable to determine the appropriate mode of behavior and is driven by the assumption that the surrounding people possess more knowledge about the current situation.” - Wikipedia definition: https://en.wikipedia.org/wiki/Social_proof In the context of data products, social proofing means demonstrating the value and success of the product by showcasing the positive experiences and endorsements of satisfied users, which in turn can help convince potential clients to adopt the product.
SOTP (Sum Of The Parts): a valuation approach that values a company by estimating the value of its divisions separately.
Stability Testing: A type of testing that checks if a software application can run consistently over an extended period without crashing or failing.
Stress Testing (software): A testing methodology used to evaluate how a system behaves under extreme conditions, such as high traffic or heavy computational load
Symbology-enriched entities: In the context of financial data, symbology refers to a system of symbols used to identify particular securities (like stocks or bonds). Symbology-enriched entities would mean data records include these identifying symbols as metadata.
Syndicated market research: Syndicated market research involves collecting data and conducting research studies on specific industries or markets by a third-party organization. The research findings are then made available to multiple subscribers or clients who are interested in understanding market trends and consumer behavior.
Synthetic Data: Artificially generated data that is created rather than obtained by direct measurement, used primarily to train machine learning models where real data may be incomplete or sensitive.
Systematic Fund: Systematic refers to a quantitative (quant) approach to portfolio allocation based on advanced statistical models, and machine learning (with varying degrees of human involvement “in the loop” or “on the loop” managing the programmatic decision making).
Tanker (AIS Data): A tanker vessel is a ship designed for the specific purpose of transporting large volumes of liquids, particularly hydrocarbons, over long distances. This includes substances like crude oil, petroleum products, liquefied natural gas (LNG), chemicals, and even wine.
Teardown: A detailed disassembly and analysis of a product to understand its components, manufacturing process, and costs.
Tech Debt (Technical Debt): the cost of reworking previously implemented code that is no longer suitable for business needs, typically created when applying quick solutions in the development process. The replacement of the initial work with sustainably written code reduces the technical debt, improving the efficiency of the product.
Temperature (referring to large language models): a setting in AI language models that adjusts how sure or unsure it is when making guesses at the right answer based on probabilities of being seen favorably by the user. High temperatures make AI guess with more variability (more creative); low temperatures make it more confident and predictable (less creative).
Terminals (AIS Data): Facilities where cargo containers are transshipped between different transport vehicles for onward transportation.
Tick history (historical tick data): a record of every trade and quote in a financial market, including the price, volume, and time of each transaction.
Tokenization: segmenting text into smaller units that are analyzed individually. See https://www.coursera.org/articles/tokenization-nlp for more details.
Total Addressable Market (TAM): Total addressable market (TAM), also called total available market, is a term that is typically used to refer to the revenue opportunity available for a product or service. TAM helps prioritize business opportunities by serving as a quick metric of a given opportunity's underlying potential. https://en.wikipedia.org/wiki/Total_addressable_market
Trade area: A geographic region where a business draws a majority of its customers from, usually defined by factors such as distance or drive time.
Underfitting: When a model fails to capture the relationships in the training dataset such that it introduces many errors when applied to real data inputs.
Unit Tests: In software development and data processing, these are tests that validate the functionality of specific sections of code, such as functions or methods, in isolation. This also applies to the coding of data pipelines to verify the correctness of a specific section of code.
Value Investors: As a simplified definition, value investors prefer a style of seeking underappreciated, undervalued businesses that will be more appropriately valued in the future.
VBA": Visual Basic for Applications. Programming language embedded in Microsoft applications like Excel.
Vector Embedding: a technique in natural language processing that represents words, phrases, or documents as arrays of numbers to capture their semantic meanings and relationships.
Vehicle registration data: Data on the registration of motor vehicles, often including details like make, model, year, and owner demographics.
Waterfall versus Agile Methodologies: “Agile project management is an incremental and iterative practice, while waterfall is a linear and sequential project management practice.” - Atlassian definition: https://www.atlassian.com/agile/project-management/project-management-intro
Web-mined data: Data collected and extracted from the web. This could be from websites, social media platforms, forums, or other online sources. It is used for various purposes, including market analysis and competitive intelligence.
Web Mining (or Web Scraping: The process of using automated software to extract large amounts of data from websites.
Web Mining Governance: The set of policies, procedures, and standards that guide the ethical and legal collection, analysis, and use of data from websites.
Whales and minnows analysis: This analysis categorizes customers into two groups: "whales" (top spenders or users) and "minnows" (low spenders or users). It helps understand the distribution of revenue or usage among customers and assess the impact of these groups on business performance.
Wireframe: A wireframe is a basic visual representation of a web page, app, or product layout, typically used during the design process to plan and communicate the structure and functionality of the product. Wireframes can be simple sketches or more detailed digital mockups.
Zero-based Budgeting: A budgeting method where all expenses must be justified and approved for each new period, starting from a "zero base," with no reference to past expenditures.
Wow, that’s a lot of jargon, and I’m sure more will be demystified in future newsletter entries! I will keep this list updated as I uncover more jargon in future Data Score Newsletter entries.