Home Blog Page 34

The Role of Machine Learning in Portfolio Optimization

0

The Role of Machine Learning in Portfolio Optimization

Introduction:

The world of finance has long been dominated by traditional investment strategies, often based on rigid algorithms and manual data analysis. However, the advent of machine learning (ML) has revolutionized the industry, especially in portfolio optimization. By combining vast amounts of data with advanced algorithms, machine learning offers the ability to make smarter, faster, and more accurate investment decisions. In this article, I will explore how machine learning in portfolio optimization is reshaping the landscape of investment management, its benefits, challenges, and real-world applications.

Understanding Portfolio Optimization

Before diving into the role of machine learning, it’s essential to understand what portfolio optimization is. At its core, portfolio optimization aims to find the ideal balance between risk and return for an investment portfolio. The goal is to maximize returns while minimizing risk, often using mathematical models to achieve this balance.

Traditional Portfolio Optimization

Traditionally, portfolio optimization has relied on models such as Modern Portfolio Theory (MPT), which emphasizes diversification to reduce risk. The efficient frontier, a concept introduced by Harry Markowitz, helps investors balance risk and return by optimizing the allocation of assets. While these models have been instrumental in portfolio management, they often fall short in the face of complex market conditions and rapidly changing financial environments.

The Need for Machine Learning

The limitation of traditional models is that they rely on static assumptions and human intervention. Machine learning offers a solution by enabling real-time data processing and adaptive decision-making. It can continuously learn from new market data and adjust investment strategies accordingly.

The Basics of Machine Learning

To fully appreciate how machine learning enhances portfolio optimization, we must first understand what machine learning is and how it works.

What is Machine Learning?

Machine learning is a subset of artificial intelligence (AI) that focuses on building systems that can learn from data, improve over time, and make predictions without being explicitly programmed. It involves the use of algorithms to analyze large sets of data, identify patterns, and make decisions based on that analysis.

Types of Machine Learning

There are three primary types of machine learning:

  • Supervised Learning: The model is trained using labeled data and learns to predict outcomes based on that data.
  • Unsupervised Learning: The model identifies hidden patterns in data without any prior labels.
  • Reinforcement Learning: The model learns by interacting with the environment and receiving feedback based on its actions.

Why Machine Learning is Crucial in Finance

In finance, machine learning allows for more accurate forecasting, more effective risk management, and a better understanding of market trends. The ability to process massive amounts of data in real time gives investors a competitive edge and helps optimize portfolios with precision.

Applications of Machine Learning in Portfolio Optimization

Machine learning is already making waves in portfolio optimization, bringing a wealth of benefits to asset managers and investors alike. Here’s how ML is applied:

Risk Assessment and Management: One of the most powerful applications of machine learning is in risk management. Traditional risk models are often based on historical data and static assumptions. In contrast, machine learning can process vast amounts of real-time data and predict potential risks with much greater accuracy. This enables portfolio managers to anticipate market shifts and make adjustments before risks materialize.

For example, ML algorithms can analyze patterns in financial markets to forecast volatility and adjust a portfolio’s exposure to different asset classes accordingly.

Asset Allocation: Machine learning is used to enhance asset allocation strategies. By analyzing historical data, economic indicators, and real-time market information, ML models can recommend optimal allocations for different asset types—equities, bonds, commodities, and more.

The algorithms continuously adapt to changing market conditions, ensuring that the portfolio stays aligned with the investor’s risk tolerance and objectives.

Predictive Analytics for Returns: Machine learning is also used to predict stock returns and market trends. By analyzing historical stock prices, economic data, and financial indicators, ML algorithms can identify patterns and correlations that traditional models may overlook. This predictive capability allows for more informed decision-making when selecting assets for a portfolio.

Moreover, NLP in financial news allows machine learning algorithms to analyze unstructured data, such as news articles, earnings reports, and market sentiment, further improving the accuracy of predictions.

Rebalancing Portfolios: Portfolio rebalancing involves adjusting the composition of assets to maintain a desired level of risk and return. ML algorithms help automate this process by continuously monitoring market conditions and portfolio performance, making rebalancing decisions in real time based on pre-defined rules or goals.

Portfolio Customization: Machine learning also enables customized portfolios tailored to individual investors. By analyzing an investor’s preferences, risk tolerance, and financial goals, ML models can create portfolios that are aligned with their unique requirements.

Benefits of Machine Learning in Portfolio Optimization

Machine learning’s impact on portfolio optimization is profound, offering several benefits that enhance both performance and efficiency:

Improved Decision-Making: Machine learning can process large datasets quickly and identify patterns that would take a human analyst years to uncover. This leads to more informed and accurate investment decisions.

Handling Large Datasets: Financial markets generate massive amounts of data every second. Machine learning can efficiently process and analyze this data, making it possible for portfolio managers to make decisions based on real-time information rather than relying on outdated data.

Real-Time Analysis: ML models can provide real-time analysis, which is crucial for staying ahead of market fluctuations. This enables investors to respond to changes quickly and adjust their portfolios accordingly.

Better Risk-Return Tradeoff: Machine learning’s ability to dynamically adjust portfolio allocations based on changing conditions ensures a better risk-return tradeoff. This can result in portfolios that achieve higher returns without taking on excessive risk.

Challenges and Limitations of Machine Learning in Portfolio Optimization

Despite its many benefits, machine learning in portfolio optimization is not without its challenges:

Data Quality and Availability: Machine learning algorithms rely heavily on high-quality data. The availability of clean, relevant data is essential for the accuracy of predictions. Inaccurate or incomplete data can lead to poor decision-making and losses.

Overfitting and Model Accuracy: One of the risks of machine learning models is overfitting, where a model is too closely aligned with historical data, making it less effective in predicting future trends. This is a critical issue in portfolio optimization, as market conditions can change rapidly.

Complexity of Algorithms: The complexity of machine learning models requires specialized knowledge to implement and interpret. While the technology has made significant advances, the need for skilled professionals to manage these models is still high.

Market Uncertainty: Machine learning models are built on historical data, and while they are excellent at predicting patterns based on the past, they may struggle to adapt to sudden, unforeseen market changes or crises.

Real-World Examples of Machine Learning in Portfolio Optimization

Machine learning has already found practical applications in the investment world:

Hedge Funds and Institutional Investors: Many hedge funds and institutional investors have adopted machine learning models to optimize their portfolios. For example, firms like Two Sigma and Renaissance Technologies use ML algorithms to manage billions of dollars in assets.

Retail Investors and Robo-Advisors: Retail investors benefit from robo-advisors powered by machine learning. These platforms, such as Betterment and Wealthfront, use algorithms to create and manage personalized portfolios with little human intervention.

Innovative ML Models: Several innovative ML models are being used for portfolio optimization, such as reinforcement learning algorithms that continuously adapt and learn from new data.

The Future of Machine Learning in Portfolio Optimization

The future of machine learning in portfolio optimization is bright. We can expect advancements in AI technologies, including better predictive models, integration with big data, and real-time adaptation to changing market conditions. Successful AI investment strategies will become more precise, making it possible for investors to achieve their financial goals with greater efficiency.

Trends and Innovations: Expect the rise of AI in risk management tools that will integrate more advanced data sources, including real-time economic indicators and global news feeds. These innovations will provide investors with even greater insights into their portfolios and the market.

Integration with Other Technologies: The future will see further integration of machine learning with technologies such as blockchain and quantum computing. These advancements will help optimize portfolios even more efficiently, enabling a level of precision that we cannot yet fully predict.

Conclusion

Machine learning is fundamentally changing the landscape of portfolio optimization. From predictive analytics for returns to more efficient risk management, machine learning is driving smarter investment decisions. While challenges remain, the potential benefits—faster, more accurate predictions, and better risk-adjusted returns—are immense. As machine learning continues to evolve, its role in investment management will only grow, offering investors new opportunities for success.

Anatomy of a Parquet File

0

In recent years, Parquet has become a standard format for data storage in Big Data ecosystems. Its column-oriented format offers several advantages:

  • Faster query execution when only a subset of columns is being processed
  • Quick calculation of statistics across all data
  • Reduced storage volume thanks to efficient compression

When combined with storage frameworks like Delta Lake or Apache Iceberg, it seamlessly integrates with query engines (e.g., Trino) and data warehouse compute clusters (e.g., Snowflake, BigQuery). In this article, the content of a Parquet file is dissected using mainly standard Python tools to better understand its structure and how it contributes to such performances.

Writing Parquet file(s)

To produce Parquet files, we use PyArrow, a Python binding for Apache Arrow that stores dataframes in memory in columnar format. PyArrow allows fine-grained parameter tuning when writing the file. This makes PyArrow ideal for Parquet manipulation (one can also simply use Pandas).

# generator.py

import pyarrow as pa
import pyarrow.parquet as pq
from faker import Faker

fake = Faker()
Faker.seed(12345)
num_records = 100

# Generate fake data
names = [fake.name() for _ in range(num_records)]
addresses = [fake.address().replace("\n", ", ") for _ in range(num_records)]
birth_dates = [
    fake.date_of_birth(minimum_age=67, maximum_age=75) for _ in range(num_records)
]
cities = [addr.split(", ")[1] for addr in addresses]
birth_years = [date.year for date in birth_dates]

# Cast the data to the Arrow format
name_array = pa.array(names, type=pa.string())
address_array = pa.array(addresses, type=pa.string())
birth_date_array = pa.array(birth_dates, type=pa.date32())
city_array = pa.array(cities, type=pa.string())
birth_year_array = pa.array(birth_years, type=pa.int32())

# Create schema with non-nullable fields
schema = pa.schema(
    [
        pa.field("name", pa.string(), nullable=False),
        pa.field("address", pa.string(), nullable=False),
        pa.field("date_of_birth", pa.date32(), nullable=False),
        pa.field("city", pa.string(), nullable=False),
        pa.field("birth_year", pa.int32(), nullable=False),
    ]
)

table = pa.Table.from_arrays(
    [name_array, address_array, birth_date_array, city_array, birth_year_array],
    schema=schema,
)

print(table)
pyarrow.Table
name: string not null
address: string not null
date_of_birth: date32[day] not null
city: string not null
birth_year: int32 not null
----
name: [["Adam Bryan","Jacob Lee","Candice Martinez","Justin Thompson","Heather Rubio"]]
address: [["822 Jennifer Field Suite 507, Anthonyhaven, UT 98088","292 Garcia Mall, Lake Belindafurt, IN 69129","31738 Jonathan Mews Apt. 024, East Tammiestad, ND 45323","00716 Kristina Trail Suite 381, Howelltown, SC 64961","351 Christopher Expressway Suite 332, West Edward, CO 68607"]]
date_of_birth: [[1955-06-03,1950-06-24,1955-01-29,1957-02-18,1956-09-04]]
city: [["Anthonyhaven","Lake Belindafurt","East Tammiestad","Howelltown","West Edward"]]
birth_year: [[1955,1950,1955,1957,1956]]

The output clearly reflects a columns-oriented storage, unlike Pandas, which usually displays a traditional “row-wise” table.

How is a Parquet file stored?

Parquet files are generally stored in cheap object storage databases like S3 (AWS) or GCS (GCP) to be easily accessible by data processing pipelines. These files are usually organized with a partitioning strategy by leveraging directory structures:

# generator.py

num_records = 100

# ...

# Writing the parquet files to disk
pq.write_to_dataset(
    table,
    root_path='dataset',
    partition_cols=['birth_year', 'city']
)

If birth_year and city columns are defined as partitioning keys, PyArrow creates such a tree structure in the directory dataset:

dataset/
├─ birth_year=1949/
├─ birth_year=1950/
│ ├─ city=Aaronbury/
│ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
│ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
│ │ ├─ …
│ ├─ city=Alicialand/
│ ├─ …
├─ birth_year=1951 ├─ ...

The strategy enables partition pruning: when a query filters on these columns, the engine can use folder names to read only the necessary files. This is why the partitioning strategy is crucial for limiting delay, I/O, and compute resources when handling large volumes of data (as has been the case for decades with traditional relational databases).

The pruning effect can be easily verified by counting the files opened by a Python script that filters the birth year:

# query.py
import duckdb

duckdb.sql(
    """
    SELECT * 
    FROM read_parquet('dataset/*/*/*.parquet', hive_partitioning = true)
    where birth_year = 1949
    """
).show()
> strace -e trace=open,openat,read -f python query.py 2>&1 | grep "dataset/.*\.parquet"

[pid    37] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    37] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Box%203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=DPO%20AP%2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=DPO%20AP%2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=East%20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=East%20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=FPO%20AA%2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=FPO%20AA%2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=New%20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=New%20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=North%20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=North%20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Port%20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Port%20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/city=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3

Only 23 files are read out of 100.

Reading a raw Parquet file

Let’s decode a raw Parquet file without specialized libraries. For simplicity, the dataset is dumped into a single file without compression or encoding.

# generator.py

# ...

pq.write_table(
    table,
    "dataset.parquet",
    use_dictionary=False,
    compression="NONE",
    write_statistics=True,
    column_encoding=None,
)

The first thing to know is that the binary file is framed by 4 bytes whose ASCII representation is “PAR1”. The file is corrupted if this is not the case.

# reader.py

with open("dataset.parquet", "rb") as file:
    parquet_data = file.read()

assert parquet_data[:4] == b"PAR1", "Not a valid parquet file"
assert parquet_data[-4:] == b"PAR1", "File footer is corrupted"

As indicated in the documentation, the file is divided into two parts: the “row groups” containing actual data, and the footer containing metadata (schema below).

The footer

The size of the footer is indicated in the 4 bytes preceding the end marker as an unsigned integer written in “little endian” format (noted “unpack function).

# reader.py

import struct

# ...

footer_length = struct.unpack("
Footer size in bytes: 1088

The footer information is encoded in a cross-language serialization format called Apache Thrift. Using a human-readable but verbose format like JSON and then translating it into binary would be less efficient in terms of memory usage. With Thrift, one can declare data structures as follows:

struct Customer {
	1: required string name,
	2: optional i16 birthYear,
	3: optional list interests
}

On the basis of this declaration, Thrift can generate Python code to decode byte strings with such data structure (it also generates code to perform the encoding part). The thrift file containing all the data structures implemented in a Parquet file can be downloaded here. After having installed the thrift binary, let’s run:

thrift -r --gen py parquet.thrift

The generated Python code is placed in the “gen-py” folder. The footer’s data structure is represented by the FileMetaData class – a Python class automatically generated from the Thrift schema. Using Thrift’s Python utilities, binary data is parsed and populated into an instance of this FileMetaData class.

# reader.py

import sys

# ...

# Add the generated classes to the python path
sys.path.append("gen-py")
from parquet.ttypes import FileMetaData, PageHeader
from thrift.transport import TTransport
from thrift.protocol import TCompactProtocol

def read_thrift(data, thrift_instance):
    """
    Read a Thrift object from a binary buffer.
    Returns the Thrift object and the number of bytes read.
    """
    transport = TTransport.TMemoryBuffer(data)
    protocol = TCompactProtocol.TCompactProtocol(transport)
    thrift_instance.read(protocol)
    return thrift_instance, transport._buffer.tell()

# The number of bytes read is not used for now
file_metadata_thrift, _ = read_thrift(footer_data, FileMetaData())

print(f"Number of rows in the whole file: {file_metadata_thrift.num_rows}")
print(f"Number of row groups: {len(file_metadata_thrift.row_groups)}")

Number of rows in the whole file: 100
Number of row groups: 1

The footer contains extensive information about the file’s structure and content. For instance, it accurately tracks the number of rows in the generated dataframe. These rows are all contained within a single “row group.” But what is a “row group?”

Row groups

Unlike purely column-oriented formats, Parquet employs a hybrid approach. Before writing column blocks, the dataframe is first partitioned vertically into row groups (the parquet file we generated is too small to be split in multiple row groups).

This hybrid structure offers several advantages:

Parquet calculates statistics (such as min/max values) for each column within each row group. These statistics are crucial for query optimization, allowing query engines to skip entire row groups that don’t match filtering criteria. For example, if a query filters for birth_year > 1955 and a row group’s maximum birth year is 1954, the engine can efficiently skip that entire data section. This optimisation is called “predicate pushdown”. Parquet also stores other useful statistics like distinct value counts and null counts.

# reader.py
# ...

first_row_group = file_metadata_thrift.row_groups[0]
birth_year_column = first_row_group.columns[4]

min_stat_bytes = birth_year_column.meta_data.statistics.min
max_stat_bytes = birth_year_column.meta_data.statistics.max

min_year = struct.unpack("
The birth year range is between 1949 and 1958
  • Row groups enable parallel processing of data (particularly valuable for frameworks like Apache Spark). The size of these row groups can be configured based on the computing resources available (using the row_group_size property in function write_table when using PyArrow).
# generator.py

# ...

pq.write_table(
    table,
    "dataset.parquet",
    row_group_size=100,
)

# /!\ Keep the default value of "row_group_size" for the next parts
  • Even if this is not the primary objective of a column format, Parquet’s hybrid structure maintains reasonable performance when reconstructing complete rows. Without row groups, rebuilding an entire row might require scanning the entirety of each column which would be extremely inefficient for large files.

Data Pages

The smallest substructure of a Parquet file is the page. It contains a sequence of values from the same column and, therefore, of the same type. The choice of page size is the result of a trade-off:

  • Larger pages mean less metadata to store and read, which is optimal for queries with minimal filtering.
  • Smaller pages reduce the amount of unnecessary data read, which is better when queries target small, scattered data ranges.

Now let’s decode the contents of the first page of the column dedicated to addresses whose location can be found in the footer (given by the data_page_offset attribute of the right ColumnMetaData) . Each page is preceded by a Thrift PageHeader object containing some metadata. The offset actually points to a Thrift binary representation of the page metadata that precedes the page itself. The Thrift class is called a PageHeader and can also be found in the gen-py directory.

💡 Between the PageHeader and the actual values contained within the page, there may be a few bytes dedicated to implementing the Dremel format, which allows encoding nested data structures. Since our data has a regular tabular format and the values are not nullable, these bytes are skipped when writing the file (https://parquet.apache.org/docs/file-format/data-pages/).

# reader.py
# ...

address_column = first_row_group.columns[1]
column_start = address_column.meta_data.data_page_offset
column_end = column_start + address_column.meta_data.total_compressed_size
column_content = parquet_data[column_start:column_end]

page_thrift, page_header_size = read_thrift(column_content, PageHeader())
page_content = column_content[
    page_header_size : (page_header_size + page_thrift.compressed_page_size)
]
print(column_content[:100])
b'6\x00\x00\x00481 Mata Squares Suite 260, Lake Rachelville, KY 874642\x00\x00\x00671 Barker Crossing Suite 390, Mooreto'

The generated values finally appear, in plain text and not encoded (as specified when writing the Parquet file). However, to optimize the columnar format, it is recommended to use one of the following encoding algorithms: dictionary encoding, run length encoding (RLE), or delta encoding (the latter being reserved for int32 and int64 types), followed by compression using gzip or snappy (available codecs are listed here). Since encoded pages contain similar values (all addresses, all decimal numbers, etc.), compression ratios can be particularly advantageous.

As documented in the specification, when character strings (BYTE_ARRAY) are not encoded, each value is preceded by its size represented as a 4-byte integer. This can be observed in the previous output:

To read all the values (for example, the first 10), the loop is rather simple:

idx = 0
for _ in range(10):
    str_size = struct.unpack("
481 Mata Squares Suite 260, Lake Rachelville, KY 87464
671 Barker Crossing Suite 390, Mooretown, MI 21488
62459 Jordan Knoll Apt. 970, Emilyfort, DC 80068
948 Victor Square Apt. 753, Braybury, RI 67113
365 Edward Place Apt. 162, Calebborough, AL 13037
894 Reed Lock, New Davidmouth, NV 84612
24082 Allison Squares Suite 345, North Sharonberg, WY 97642
00266 Johnson Drives, South Lori, MI 98513
15255 Kelly Plains, Richardmouth, GA 33438
260 Thomas Glens, Port Gabriela, OH 96758

And there we have it! We have successfully recreated, in a very simple way, how a specialized library would read a Parquet file. By understanding its building blocks including headers, footers, row groups, and data pages, we can better appreciate how features like predicate pushdown and partition pruning deliver such impressive performance benefits in data-intensive environments. I am convinced knowing how Parquet works under the hood helps making better decisions about storage strategies, compression choices, and performance optimization.

All the code used in this article is available on my GitHub repository at where you can explore more examples and experiment with different Parquet file configurations.

Whether you are building data pipelines, optimizing query performance, or simply curious about data storage formats, I hope this deep dive into Parquet’s inner structures has provided valuable insights for your Data Engineering journey.

All images are by the author.

How to Build a RAG System Using LangChain, Ragas, and Neptune

0

LangChain provides composable building blocks to create LLM-powered applications, making it an ideal framework for building RAG systems. Developers can integrate components and APIs of different vendors into coherent applications.

Evaluating a RAG system’s performance is crucial to ensure high-quality responses and robustness. The Ragas framework offers a large number of RAG-specific metrics as well as capabilities for generating dedicated evaluation datasets.

neptune.ai makes it easy for RAG developers to track evaluation metrics and metadata, enabling them to analyze and compare different system configurations. The experiment tracker can handle large amounts of data, making it well-suited for quick iteration and extensive evaluations of LLM-based applications.

Imagine asking a chat assistant about LLMOps only to receive outdated advice or irrelevant best practices. While LLMs are powerful, they rely solely on their pre-trained knowledge and lack the ability to fetch current data.

This is where Retrieval-Augmented Generation (RAG) comes in. RAG combines the generative power of LLMs with external data retrieval, enabling the assistant to access and use real-time information. For example, instead of outdated answers, the chat assistant could pull insights from Neptune’s LLMOps article collection to deliver accurate and contextually relevant responses.

In this guide, we’ll show you how to build a RAG system using the LangChain framework, evaluate its performance using Ragas, and track your experiments with neptune.ai. Along the way, you’ll learn to create a baseline RAG system, refine it using Ragas metrics, and enhance your workflow with Neptune’s experiment tracking.

Part 1: Building a baseline RAG system with LangChain

In the first part of this guide, we’ll use LangChain to build a RAG system for the blog posts in the LLMOps category on Neptune’s blog.

Overview of a baseline RAG system. A user’s question is used as the query to retrieve relevant documents from a database. The documents returned by the search are added to the prompt that is passed to the LLM together with the user’s question. The LLM uses the information in the prompt to generate an answer.
Overview of a baseline RAG system. A user’s question is used as the query to retrieve relevant documents from a database. The documents returned by the search are added to the prompt that is passed to the LLM together with the user’s question. The LLM uses the information in the prompt to generate an answer. | Source

What is LangChain?

LangChain offers a collection of open-source building blocks, including memory management, data loaders for various sources, and integrations with vector databases—all the essential components of a RAG system.

LangChain stands out among the frameworks for building RAG systems for its composability and versatility. Developers can combine and connect these building blocks using a coherent Python API, allowing them to focus on creating LLM applications rather than dealing with the nitty-gritty of API specifications and data transformations.

Overview of the categories of building blocks provided by LangChain. The framework includes interfaces to models and vector stores, document loaders, and text processing utilities like output parsers and text splitters. Further, LangChain offers features for prompt engineering, like templates and example selectors. The framework also contains a collection of tools that can be called by LLM agents.
Overview of the categories of building blocks provided by LangChain. The framework includes interfaces to models and vector stores, document loaders, and text processing utilities like output parsers and text splitters. Further, LangChain offers features for prompt engineering, like templates and example selectors. The framework also contains a collection of tools that can be called by LLM agents. | Source

Step 1: Setting up

We’ll begin by installing the necessary dependencies (I used Python 3.11.4 on Linux):

pip install -qU langchain-core==0.1.45 langchain-openai==0.0.6 langchain-chroma==0.1.4 ragas==0.2.8 neptune==1.13.0 pandas==2.2.3 datasets==3.2.0

For this example, we’ll use OpenAI’s models and configure the API key. To access OpenAI models, you’ll need to create an OpenAI account and generate an API key. Our usage in this blog should be well within the free-tier limits.

Once we have obtained our API key, we’ll set it as an environment variable so that LangChain’s OpenAI building blocks can access it:

import os
os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE"

You can also use any of LangChain’s other embedding and chat models, including local models provided by Ollama. Thanks to the compositional structure of LangChain, all it takes is replacing OpenAIEmbeddings and OpenAIChat in the code with the respective alternative building blocks.

Step 2: Load and parse the raw data

Source data for RAG systems is often unstructured documents. Before we can use it effectively, we’ll need to process and parse it into a structured format.

Fetch the source data

Since we’re working with a blog, we’ll use LangChain’s WebBaseLoader to load data from Neptune’s blog. WebBaseLoader reads raw webpage content, capturing text and structure, such as headings.

The web pages are loaded as LangChain documents, which include the page content as a string and metadata associated with that document, e.g., the source page’s URL.

In this example, we select 3 blog posts to create the chat assistant’s knowledge base:

import bs4
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    web_paths=[
        "https://neptune.ai/blog/llm-hallucinations",
        "https://neptune.ai/blog/llmops",
        "https://neptune.ai/blog/llm-guardrails"
    ],
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(name=["p", "h2", "h3", "h4"])
    ),
)
docs = loader.load()

Split the data into smaller chunks

To meet the embedding model’s token limit and improve retrieval performance, we’ll split the long blog posts into smaller chunks.

The chunk size is a trade-off between specificity (capturing detailed information within each chunk) and efficiency (reducing the total number of resulting chunks). By overlapping chunks, we mitigate the loss of critical information that occurs when a self-contained sequence of the source text is split into two incoherent chunks.

Visualization of the chunks created from the article LLM Hallucinations 101. The text is split into four chunks highlighted in blue, lime green, dark orange, and dark yellow. The overlaps between chunks are marked in olive green.
Visualization of the chunks created from the article LLM Hallucinations 101. The text is split into four chunks highlighted in blue, lime green, dark orange, and dark yellow. The overlaps between chunks are marked in olive green. | Created with ChunkViz

For generic text, LangChain recommends the RecursiveCharacterTextSplitter. We set the chunk size to a maximum of 1,000 characters with an overlap of 200 characters. We also filter out unnecessary parts of the documents, such as the header, footer, and any promotional content:

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

header_footer_keywords = ["peers about your research", "deepsense", "ReSpo", "Was the article useful?", "related articles", "All rights reserved"]

splits = []
for s in text_splitter.split_documents(docs):
    if not any(kw in s.page_content for kw in header_footer_keywords):
        splits.append(s)

len(splits)

Step 3: Set up the vector store

Vector stores are specialized data stores that enable indexing and retrieving information based on vector representations.

Choose a vector store

LangChain supports many vector stores. In this example, we’ll use Chroma, an open-source vector store specifically designed for LLM applications.

By default, Chroma stores the collection in memory; once the session ends, all the data (embeddings and indices) are lost. While this is fine for our small example, in production, you’ll want to persist the database to disk by passing the persist_directory keyword argument when initializing Chroma.

Specify which embedding model to use

Embedding models convert chunks into vectors. There are many embedding models to choose from. The Massive Text Embedding Benchmark (MTEB) leaderboard is a great resource for selecting one based on model size, embedding dimensions, and performance requirements.

The MTEB Leaderboard provides a standardized comparison of embedding models across diverse tasks and datasets, including retrieval, clustering, classification, and reranking. The leaderboard provides a clear comparison of model performance and makes selecting embedding models easier through filters and ranking.
The MTEB Leaderboard provides a standardized comparison of embedding models across diverse tasks and datasets, including retrieval, clustering, classification, and reranking. The leaderboard provides a clear comparison of model performance and makes selecting embedding models easier through filters and ranking.

For our example LLMOps RAG system, we’ll use OpenAIEmbeddings with its default model. (At the time of writing, this was text-embedding-ada-002.)

Create a retriever object from the vector store

A retriever performs semantic searches to find the most relevant pieces of information based on a user query. For this baseline example, we’ll configure the retriever to return only the top result, which will be used as context for the LLM to generate an answer.

Initializing the vector store for our RAG system and instantiating a retriever takes only two lines of code:

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(
   documents=splits,
   embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

In the last line, we have specified through search_kwargs that the retriever only returns the most similar document (top-k retrieval with k = 1).

Step 4: Bring it all together

Now that we’ve set up a vector database with the source data and initialized the retriever to return the most relevant chunk given a query, we’ll combine it with an LLM to complete our baseline RAG chain.

Define a prompt template

We need to set a prompt to guide the LLM in responding. This prompt should tell the model to use the retrieved context to answer the query.

We’ll use a standard RAG prompt template that specifically asks the LLM to use the provided context (the retrieved chunk) to answer the user query concisely:

from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

Create the full RAG chain

We’ll use the create_stuff_documents_chain utility function to set up the generative part of our RAG chain. It combines an instantiated LLM and a prompt template with a {context} placeholder into a chain that takes a set of documents as its input, which are “stuffed” into the prompt before it is fed into the LLM. In our case, that’s OpenAI’s GPT4o-mini.

from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain

llm = ChatOpenAI(model="gpt-4o-mini")
question_answer_chain = create_stuff_documents_chain(llm, prompt)

Then, we can use the create_retrieval_chain utility function to finally instantiate our complete RAG chain: 

from langchain.chains import create_retrieval_chain

rag_chain = create_retrieval_chain(retriever, question_answer_chain)

Get an output from the RAG chain

To see how our system works, we can run a first inference call. We’ll send a query to the chain that we know can be answered using the contents of one of the blog posts:

response = rag_chain.invoke({"input": "What are DOM-based attacks?"})
print(response["answer"])

The response is a dictionary that contains “input,” “context,” and “answer” keys:

{
  "input": 'What are DOM-based attacks?',
  'context': [Document(metadata={'source': 'https://neptune.ai/blog/llm-guardrails'}, page_content='By prompting the application to pretend to be a chatbot that “can do anything” and is not bound by any restrictions, users were able to manipulate ChatGPT to provide responses to questions it would usually decline to answer.Although “prompt injection” and “jailbreaking” are often used interchangeably in the community, they refer to distinct vulnerabilities that must be handled with different methods.DOM-based attacksDOM-based attacks are an extension of the traditional prompt injection attacks. The key idea is to feed a harmful instruction into the system by hiding it within a website’s code.Consider a scenario where your program crawls websites and feeds the raw HTML to an LLM on a daily basis. The rendered page looks normal to you, with no obvious signs of anything wrong. Yet, an attacker can hide a malicious key phrase by matching its color to the background or adding it in parts of the HTML code that are not rendered, such as a style Tag.While invisible to human eyes, the LLM will')],
  "answer": "DOM-based attacks are a type of vulnerability where harmful instructions are embedded within a website's code, often hidden from view. Attackers can conceal malicious content by matching its color to the background or placing it in non-rendered sections of the HTML, like style tags. This allows the malicious code to be executed by a system, such as a language model, when it processes the website's HTML."}

We see that the retriever appropriately identified a snippet from the LLM Guardrails: Secure and Controllable Deployment article as the most relevant chunk.

Define a prediction function

Now that we have a fully functioning end-to-end RAG chain, we can create a convenience function that enables us to query our RAG chain. It takes a RAG chain and a query and returns the chain’s response. We’ll also implement the option to pass just the stuff documents chain and provide the list of context documents via an additional input parameter. This will come in handy when evaluating the different parts of our RAG system.

Here’s what this function looks like:

from langchain_core.runnables.base import Runnable
from langchain_core.documents import Document

def predict(chain: Runnable, query: str, context: list[Document] | None = None)-> dict:
    """
    Accepts a retrieval chain or a stuff documents chain. If the latter, context must be passed in.
    Return a response dict with keys "input", "context", and "answer"
    """
    inputs = {"input": query}
    if context:
        inputs.update({"context": context})

    response = chain.invoke(inputs)

    result = {
        response["input"]: {
            "context": [d.page_content for d in response['context']],
            "answer": response["answer"],
        }
    }
    return result

Part 2: Evaluating a RAG system using Ragas and neptune.ai

Once a RAG system is built, it’s important to evaluate its performance and establish a baseline. The proper way to do this is by systematically testing it using a representative evaluation dataset. Since such a dataset is not available in our case yet, we’ll have to generate one.

To assess both the retrieval and generation aspects of the system, we’ll use Ragas as the evaluation framework and neptune.ai to track experiments as we iterate.

What is Ragas?

Ragas is an open-source toolkit for evaluating RAG applications. It offers both LLM-based and non-LLM-based metrics to assess the quality of retrieval and generated responses. Ragas works smoothly with LangChain, making it a great choice for evaluating our RAG system.

Step 1: Generate a RAG evaluation dataset

An evaluation set for RAG tasks is similar to a question-answering task dataset. The key difference is that each row includes not just the query and a reference answer but also reference contexts (documents that we expect to be retrieved to answer the query).

Thus, an example evaluation set entry looks like this:

Query

Reference context

Reference answer

How can users trick a chatbot to bypass restrictions?

[‘By prompting the application to pretend to be a chatbot that “can do anything” and is not bound by any restrictions, users were able to manipulate ChatGPT to provide responses to questions it would usually decline to answer.’]

Users trick chatbots to bypass restrictions by prompting the application to pretend to be a chatbot that ‘can do anything’ and is not bound by any restrictions, allowing it to provide responses to questions it would usually decline to answer.

Ragas provides utilities to generate such a dataset from a list of reference documents using an LLM.

As the reference documents, we’ll use the same chunks that we fed into the Chroma vector store in the first part, which is precisely the knowledge base from which our RAG system is drawing.

To test the generative part of our RAG chain, we’ll need to generate example queries and reference answers using a different model. Otherwise, we’d be testing our system’s self-consistency. We’ll use the full-sized GPT-4o model, which should outperform the GPT-4o-mini in our RAG chain.

As in the first part, it is possible to use a different LLM. The LangchainLLMWrapper and LangChainEmbeddingsWrapper make any model available via LangChain accessible to Ragas.

What happens under the hood?

Ragas’ TestSetGenerator builds a knowledge graph in which each node represents a chunk. It extracts information like named entities from the chunks and uses this data to model the relationship between nodes. From the knowledge graph, so-called query synthesizers derive scenarios consisting of a set of nodes, the desired query length and style, and a user persona. This scenario is used to populate a prompt template instructing an LLM to generate a query and answer (example). For more details, refer to the Ragas Testset Generation documentation.

Creating an evaluation dataset with 50 rows for our RAG system should take about a minute. We’ll generate a mixture of abstract queries (“What is concept A?”) and specific queries (“How often does subscription plan B bill its users?”):

from ragas.llms import LangChainLLMWrapper
from ragas.embeddings import LangChainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.testset import TestsetGenerator
from ragas.testset.synthesizers import AbstractQuerySynthesizer, SpecificQuerySynthesizer

generator_llm = LangChainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangChainEmbeddingsWrapper(OpenAIEmbeddings())

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

dataset = generator.generate_with_langchain_docs(
    splits,
    testset_size=50,
    query_distribution=[
        (AbstractQuerySynthesizer(llm=generator_llm), 0.1),
        (SpecificQuerySynthesizer(llm=generator_llm), 0.9),
    ],
)

Filtering unwanted data

We want to focus our evaluation on cases where the reference answer is helpful. In particular, we don’t want to include test samples with responses containing phrases like “the context is insufficient” or “the context does not contain.” Duplicate entries in the dataset would skew the evaluation, so they should also be omitted.

For filtering, we’ll use the ability to easily convert Ragas datasets into Pandas DataFrames or Hugging Face Datasets:


unique_indices = set(dataset.to_pandas().drop_duplicates(subset=["user_input"]).index)


not_helpful = set(dataset.to_pandas()[dataset.to_pandas()["reference"].str.contains("does not contain|does not provide|context does not|is insufficient|is incomplete", case=False, regex=True)].index)

unique_helpful_indices = unique_indices - not_helpful

ds = dataset.to_hf_dataset().select(unique_helpful_indices)

This leaves us with unique samples that look like this:

User input

Reference contexts

Reference answer

What role does reflection play in identifying and correcting hallucinations in LLM outputs?

[‘After the responseCorrecting a hallucination after the LLM output has been generated is still beneficial, as it prevents the user from seeing the incorrect information. This approach can effectively transform correction into prevention by ensuring that the erroneous response never reaches the user. The process can be broken down into the following steps:This method is part of multi-step reasoning strategies, which are increasingly important in handling complex problems. These strategies, often referred to as “agents,” are gaining popularity. One well-known agent pattern is reflection. By identifying hallucinations early, you can address and correct them before they impact the user.’]

Reflection plays a role in identifying and correcting hallucinations in LLM outputs by allowing early identification and correction of errors before they impact the user.

What are some examples of LLMs that utilize a reasoning strategy to improve their responses?

[‘Post-training or alignmentIt is hypothesized that an LLM instructed not only to respond and follow instructions but also to take time to reason and reflect on a problem could largely mitigate the hallucination issue—either by providing the correct answer or by stating that it does not know how to answer.Furthermore, you can teach a model to use external tools during the reasoning process,\xa0 like getting information from a search engine. There are a lot of different fine-tuning techniques being tested to achieve this. Some LLMs already working with this reasoning strategy are Matt Shumer’s Reflection-LLama-3.1-70b and OpenAI’s O1 family models.’]

Some examples of LLMs that utilize a reasoning strategy to improve their responses are Matt Shumer’s Reflection-LLama-3.1-70b and OpenAI’s O1 family models.

What distnguishes ‘promt injecton’ frm ‘jailbraking’ in vulnerabilties n handling?

[‘Although “prompt injection” and “jailbreaking” are often used interchangeably in the community, they refer to distinct vulnerabilities that must be handled with different methods.’]

‘Prompt injection’ and ‘jailbreaking’ are distinct vulnerabilities that require different handling methods.

In the third sample, the query contains a lot of typos. This is an example of the “MISSPELLED” query style.

💡 You can find a full example evaluation dataset on Hugging Face.

Step 2: Choose RAG evaluation metrics

As mentioned earlier, Ragas offers both LLM-based and non-LLM-based metrics for RAG system evaluation.

For this example, we’ll focus on LLM-based metrics. LLM-based metrics are more suitable for tasks requiring semantic and contextual understanding than quantitative metrics while being significantly less resource-intensive than having humans evaluate each response. This makes them a reasonable tradeoff despite concerns about reproducibility.

From the wide range of metrics available in Ragas, we’ll select five:

  1. LLM Context Recall measures how many of the relevant documents are successfully retrieved. It uses the reference answer as a proxy for the reference context and determines whether all claims in the reference answer can be attributed to the retrieved context.
  2. Faithfulness measures the generated answer’s factual consistency with the given context by assessing how many claims in the generated answer can be found in the retrieved context.
  3. Factual Correctness evaluates the factual accuracy of the generated answer by assessing whether claims are present in the reference answer (true and false positives) and whether any claims from the reference answer are missing (false negatives). From this information, precision, recall, or F1 scores are calculated.
  4. Semantic Similarity measures the similarity between the reference answer and the generated answer.
  5. Noise Sensitivity measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents.

Each of these metrics requires specifying an LLM or an embedding model for its calculations. We’ll again use GPT-4o for this purpose:

from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, SemanticSimilarity, NoiseSensitivity
from ragas import EvaluationDataset
from ragas import evaluate

evaluator_llm = LangChainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangChainEmbeddingsWrapper(OpenAIEmbeddings())

metrics = [
    LLMContextRecall(llm=evaluator_llm),
    FactualCorrectness(llm=evaluator_llm),
    Faithfulness(llm=evaluator_llm),
    SemanticSimilarity(embeddings=evaluator_embeddings),
    NoiseSensitivity(llm=evaluator_llm),
]

Step 3: Evaluate the baseline RAG system’s performance

To evaluate our baseline RAG system, we’ll generate predictions and analyze them with the five selected metrics.

To speed up the process, we’ll use a concurrent approach to handle the I/O-bound predict calls from the RAG chain. This allows us to process multiple queries in parallel. Afterward, we can convert the results into a data frame for further inspection and manipulation. We’ll also store the results in a CSV file.

Here’s the complete performance evaluation code:

from concurrent.futures import ThreadPoolExecutor, as_completed
from datasets import Dataset

def concurrent_predict_retrieval_chain(chain: Runnable, dataset: Dataset):
    results = {}
    threads = []
    with ThreadPoolExecutor(max_workers=5) as pool:
        for query in dataset["user_input"]:
            threads.append(pool.submit(predict, chain, query))
        for task in as_completed(threads):
            results.update(task.result())
    return results

predictions = concurrent_predict_retrieval_chain(rag_chain, ds)


ds_k_1 = ds.map(lambda example: {"response": predictions[example["user_input"]]["answer"], "retrieved_contexts": predictions[example["user_input"]]["context"]})

results = evaluate(dataset=EvaluationDataset.from_hf_dataset(ds_k_1), metrics=metrics)


df = results.to_pandas()
df.to_csv("eval_results.csv", index=False)

Part 3: Iteratively refining the RAG performance

With the evaluation setup in place, we can now start to improve our RAG system. Using the initial evaluation results as our baseline, we can systematically make changes to our RAG chain and assess whether they improve performance.

While we could make do with saving all evaluation results in cleanly named files and taking notes, we’d quickly be overwhelmed with the amount of information. To efficiently iterate and keep track of our progress, we’ll need a way to record, analyze, and compare our experiments.

What is neptune.ai?

Neptune is a machine-learning experiment tracker focused on collaboration and scalability. It provides a centralized platform for tracking, logging, and comparing metrics, artifacts, and configurations.

Neptune can track not only single metrics values but also more complex metadata, such as text, arrays, and files. All metadata can be accessed and analyzed through a highly versatile user interface as well as programmatically. All this makes it a great tool for developing RAG systems and other LLM-based applications.

Step 1: Set up neptune.ai for experiment tracking

To get started with Neptune, sign up for a free account at app.neptune.ai and follow the steps to create a new project. Once that’s done, set the project name and API token as environment variables and initialize a run:

os.environ["NEPTUNE_PROJECT"] = "YOUR_PROJECT"
os.environ["NEPTUNE_API_TOKEN"] = "YOUR_API_TOKEN"

import neptune

run = neptune.init_run()

In Neptune, each run corresponds to one tracked experiment. Thus, every time we’ll execute our evaluation script, we’ll start a new experiment.

Logging Ragas metrics to neptune.ai

To make our lives easier, we’ll define a helper function that stores the Ragas evaluation results in the Neptune Run object, which represents the current experiment.

We’ll track the metrics for each sample in the evaluation dataset and an overall performance metric, which in our case is simply the average across all metrics for the entire dataset: 

import io

import neptune
import pandas as pd

def log_detailed_metrics(results_df: pd.DataFrame, run: neptune.Run, k: int):
    run[f"eval/k"].append(k)

    
    for i, row in results_df.iterrows():
        for m in metrics:
            val = row[m.name]
            run[f"eval/q{i}/{m.name}"].append(val)

        
        run[f"eval/q{i}/user_input"] = row["user_input"]
        run[f"eval/q{i}/response"].append(row["response"])
        run[f"eval/q{i}/reference"] = row["reference"]

        
        context_df = pd.DataFrame(
            zip(row["retrieved_contexts"], row["reference_contexts"]
            columns=["retrieved", "reference"],
        )
        context_stream = io.StringIO()
        context_data = context_df.to_csv(
            context_stream, index=True, index_label="k")
        run[f"eval/q{i}/contexts/{k}}"].upload(
            neptune.types.File.from_stream(context_stream, extension="csv")
        )
      
    
    overall_metrics = results_df[[m.name for m in metrics]].mean(axis=0).to_dict()
    for k, v in overall_metrics.items():
        run[f"eval/overall"].append(v)

log_detailed_metrics(df, run, k=1)


run.stop()

Once we run the evaluation and switch to Neptune’s Experiments tab, we see our currently active run and the first round of metrics that we’ve logged.

Step 2: Iterate over a retrieval parameter

In our baseline RAG chain, we only use the first retrieved document chunk in the LLM context. But what if there are relevant chunks ranked lower, perhaps in the top 3 or top 5? To explore this, we can experiment with using different values for k, the number of retrieved documents.

We’ll start by evaluating k = 3 and k = 5 to see how the results change. For each experiment, we instantiate a new retrieval chain, run the prediction and evaluation functions, and log the results for comparison:

for k in [1, 3, 5]:
    retriever_k = vectorstore.as_retriever(search_kwargs={"k": k})
    rag_chain_k = create_retrieval_chain(retriever_k, question_answer_chain)
    predictions_k = concurrent_predict_retrieval_chain(rag_chain_k, ds)

    
    ds_k = ds.map(lambda example: {
        "response": predictions_k[example["user_input"]]["answer"],
        "retrieved_contexts": predictions_k[example["user_input"]]["context"]
    })

    results_k = evaluate(dataset=EvaluationDataset.from_hf_dataset(ds_k), metrics=metrics)
    df_k = results_k.to_pandas()

    
    df_k.to_csv("eval_results.csv", index=False)
    run[f"eval/eval_data/{k}"].upload("eval_results.csv")

    log_detailed_metrics(df_k, run, k)


run.stop()

Once the evaluation is complete (this should take between 5 and 10 minutes), the script should display “Shutting down background jobs” and show “Done!” once the process is finished.

Results overview

Let’s take a look at the results. Navigate to the Charts tab. The graphs all share a common x-axis labeled “step.” The evaluations for k = [1, 3, 5] are recorded as steps [0, 1, 2].


Comparison of metrics values over three different values of k: The averaged metrics values over all samples (top row) and the metric values for the first sample question (bottom row) indicate that the third step (k = 5) yielded the best outcome.

Looking at the overall metrics, we can observe that increasing k has improved most metrics. Factual correctness decreases by a small amount. Additionally, noise sensitivity, where a lower value is preferable, increased. This is expected since increasing k will lead to more irrelevant chunks being included in the context. However, as both context recall and answer semantic similarity have gone up, it seems to be a worthy tradeoff.

Step 3: Iterate further

From here on, there are numerous possibilities for further experimentation, for example:

  • Trying different chunking strategies, such as semantic chunking, which determines the breakpoints between chunks based on semantic similarity rather than strict token counts.
  • Leveraging hybrid search, which combines keyword search algorithms like BM25 and semantic search with embeddings.
  • Trying other models that excel at question-answering tasks, like the Anthropic models, which are also available through LangChain.
  • Adding support components for dialogue systems, such as chat history.

Looking ahead

In the three parts of this tutorial, we’ve used LangChain to build a RAG system based on OpenAI models and the Chroma vector database, evaluated it with Ragas, and analyzed our progress with Neptune. Along the way, we explored essential foundations of developing performant RAG systems, such as:

  • How to efficiently chunk, store, and retrieve data to ensure our RAG system consistently delivers relevant and accurate responses to user queries.
  • How to generate an evaluation dataset for our particular RAG chain and use RAG-specific metrics like faithfulness and factual correctness to evaluate it.
  • How Neptune makes it easy to track, visualize, and analyze RAG system performance, allowing us to take a systematic approach when iteratively improving our application.

As we saw at the end of part 3, we’ve barely scratched the surface when it comes to improving retrieval performance and response quality. Using the triplet of tools we introduced and our evaluation setup, any new technique or change applied to the RAG system can be assessed and compared with alternative configurations. This allows us to confidently assess whether a modification improves performance and detect unwanted side effects.

Was the article useful?

Explore more content topics:

New OBSCURE#BAT Malware Targets Users with Fake Captchas

0

OBSCURE#BAT malware campaign exploits social engineering & fake software downloads to evade detection, steal data and persist on systems. Learn how to stay safe.

Cybersecurity researchers at Securonix Threat Labs have spotted a new malware campaign called OBSCURE#BAT. This campaign uses social engineering tactics and fake software downloads to trick users into executing malicious code, enabling attackers to infect systems and avoid detection.

The attack begins with a user executing a malicious batch file, which is often disguised as legitimate security features or malicious software downloads. Once executed, the malware establishes itself by creating scheduled tasks and modifying the Windows Registry to operate even after the system reboots.

The malware then uses a user-mode rootkit to hide its presence on the system, making it difficult for users and security tools to detect. The rootkit can hide files, registry entries, and running processes, allowing the malware to embed further into legitimate system processes and services.

Fake Captchas and Malicious Software Downloads

As seen in recent similar campaigns, hackers have been leveraging typosquatting and social engineering tactics to present fake products as legitimate within their supply chains. This includes:

Masquerading Software: Attackers also disguise their malicious files as trustworthy applications, such as Tor Browser, SIP (VoIP) software or Adobe products, increasing the chances that users will execute them.

Fake Captchas: Users may encounter a fake captcha, especially the Cloudflare captcha feature, that tricks them into executing malicious code. These captchas often originate from typosquatted domains, resembling legitimate sites. When users attempt to pass the captcha, they are prompted to execute code that has been copied to their clipboard.

Fake captcha used in the attack (Screenshot Securonix)

Evasion Techniques

The OBSCURE#BAT malware campaign is a major cybersecurity threat to both individuals and organizations, primarily due to its ability to compromise sensitive data through advanced evasion techniques. These include:

API Hooking: By using user-mode API hooking, the malware can hide files, registry entries, and running processes. This means that common tools like Windows Task Manager and command-line commands cannot see certain files or processes, particularly those that fit a specific naming scheme (e.g., those starting with “$nya-“).

Registry Manipulation: It registers a fake driver (ACPIx86.sys) in the registry to ensure further persistence. This driver is linked to a Windows service, allowing it to execute malicious code without raising suspicion.

Stealthy Logging: The malware monitors user interactions, such as clipboard activity, and regularly writes this data to encrypted files, further complicating detection and analysis.

Countries Targeted in the OBSCURE#BAT Attack

According to Securonix’s detailed technical report, shared with Hackread.com before its official release on Thursday, the malware appears to be financially motivated or aimed at espionage, targeting users primarily in the following countries:

  • Canada
  • Germany
  • United States
  • United Kingdom

How to Protect Yourself from the OBSCURE#BAT Attack

While common sense is a must when downloading software or clicking on unknown links, users and organizations should also follow these key security measures to protect their systems from OBSCURE#BAT and similar threats:

  • Clean downloads: Only download software from legitimate websites, and be wary of fake captchas and other social engineering tactics.
  • Use endpoint logging: For organizations, deploy endpoint logging tools, such as Sysmon and PowerShell logging, to enhance detection and response capabilities.
  • Monitor for suspicious activity: Regularly monitor systems for suspicious activity, such as unusual network connections or process behaviour.
  • Use threat detection tools: Consider using threat detection tools, such as behavioural analysis and machine learning-based systems, to detect and respond to threats like OBSCURE#BAT.


MIWIC25 – Eva Benn, Chief of Staff, Strategy – Microsoft Red Team

0

Organised by Eskenzi PR in media partnership with the IT Security Guru, the Most Inspiring Women in Cyber Awards aim to shed light on the remarkable women in our industry. The following is a feature on one of 2024’s Top 20 women selected by an esteemed panel of judges. Presented in a Q&A format, the nominee’s answers are written in their own words.

In 2025, the awards were sponsored by BT, KnowBe4, Mimecast, Varonis, Bridewell, Certes, Pentest Tools and AI Dionic. Community partners included WiCyS UK & Ireland Affiliate, Women in Tech and Cybersecurity Hub (WiTCH), CyBlack and Inclusive InCyber (LT Harper). 

What does your job role entail?

As Chief of Staff for Microsoft Red Team, I drive the strategy behind how we innovate and evolve red teaming—transforming it from purely technical operations into a strategic security pillar that directly shapes Microsoft’s overall security direction. My role is to define what modern red teaming looks like—not just for Microsoft, but for the industry—as this space rapidly evolves.
Microsoft is pioneering and reimagining how red teams operate in this new era, ensuring that every finding leads to measurable, lasting fixes. With the rise of AI and the Security Graph, we are shifting from product-based technical assessments to precision-driven security, uncovering micro-level vulnerabilities that could have massive impact. I also lead the vision for extending red teaming beyond human limitations, not only within Microsoft but across our customer ecosystem, helping to shape the future of collective defense. This includes driving strategies to accelerate remediation and push toward a self-healing security model—where threats are dynamically identified, understood, and resolved at scale.

How did you get into the cybersecurity industry?

I never set out to work in cybersecurity—I stumbled into it by accident. But looking back, I realize I’d been preparing for it all along. After leaving my small town in Bulgaria with just $50 and a dream for something bigger, I spent years relentlessly building my tech skills, working nights in restaurants and weekends for free on small tech projects just to prove myself. I was exhausted, broke, and doubted myself constantly. There were so many moments I almost gave up. But I didn’t.
And then, one day, cybersecurity found me. An unexpected opportunity appeared, and even though I felt unqualified and terrified, I took the leap. That leap changed everything.
Cybersecurity became the perfect place for my grit and curiosity to collide—a field where I could protect people, solve complex problems, and make a real impact. Today, I lead strategy for the Microsoft Red Team, helping shape the future of red teaming not just for Microsoft, but for the entire industry.
If my story proves anything, it’s this: you don’t have to see the whole path. You just have to keep going. Keep building. Keep believing. Because sometimes the thing you never planned for becomes the thing you were born to do.

What is one of the biggest challenges you have faced as a woman in the tech/cyber industry and how did you overcome it?

One of the biggest challenges I faced as a woman in cybersecurity was overcoming deep imposter syndrome—believing I didn’t belong in the room. Coming from a small town in Bulgaria with no role models in tech, I carried years of conditioning that told me success in this field wasn’t meant for people like me. Early in my career, I often felt like I had to blend in—dressing, speaking, and acting like the men around me just to be taken seriously.
What helped me overcome it was realizing that my unique story, my perspective, and my authenticity are exactly what make me strong. I found inspiration through the few women ahead of me who owned their space unapologetically, and they helped me see what was possible. Now, I make it my mission to be that example for others—showing women that we don’t have to change who we are to succeed in cybersecurity. We belong here exactly as we are.

What are you doing to support other women, and/or to increase diversity, in the tech/cyber industry?

I’m deeply committed to helping women and underrepresented groups break into cybersecurity and thrive. Over the years, I have served—and continue to serve—on various leadership boards and advisory groups to help shape the future of the industry and drive meaningful community impact. This includes organizations like OWASP Seattle, the EC-Council Certified Ethical Hacker (CEH) Advisory Board, Women in Cybersecurity (WiCyS), and ISACA Puget Sound.
As Co-Founder of Women in Tech Global and a leader in Microsoft Women in Security, I’ve helped build global communities that give women access to career opportunities, speaking platforms, and technical growth.
I also actively mentor young women, guiding them through career transitions, helping them overcome self-doubt, and supporting them as they step into leadership roles they may not have thought possible.
Beyond mentorship, I’m passionate about modernizing cybersecurity education. Through projects like The Hacking Games, I’m helping inspire the next generation of diverse talent by reimagining how we teach ethical hacking to Gen Z.
For me, this work is personal. I know how hard it is to build a path where none exists. That’s why I’m committed to being the example I wish I’d had—and ensuring no woman feels like she has to do it alone.

Who has inspired you in your life/career? 

I’ve been most inspired by the women who dared to take up space in rooms where they were never expected to belong—and did it unapologetically. Seeing strong women lead in cybersecurity with both confidence and authenticity showed me that we don’t have to trade our uniqueness to succeed in this industry. Their example helped me realize that my story, my background, and even my struggles are my power.
But beyond individual people, I’m inspired by the millions of women who haven’t yet been told they belong here. I think of the little girls staring out of windows in small towns, just like I once did, wondering if there’s more to life than what’s been handed to them. They inspire me to keep going, keep building, and keep showing up—because if I can be proof for even one of them that a different future is possible, then every challenge I’ve faced was worth it.
We need more examples to emulate—more women leading, succeeding, and owning their space—so others can see themselves in us. That’s why what we’re doing here is so important. Visibility creates possibility. And together, we’re redefining what’s possible for the next generation.

The post MIWIC25 – Eva Benn, Chief of Staff, Strategy – Microsoft Red Team appeared first on IT Security Guru.

This AI Paper Introduces BD3-LMs: A Hybrid Approach Combining Autoregressive and Diffusion Models for Scalable and Efficient Text Generation

0

Traditional language models rely on autoregressive approaches, which generate text sequentially, ensuring high-quality outputs at the expense of slow inference speeds. In contrast, diffusion models, initially developed for image and video generation, have gained attention in text generation due to their potential for parallelized generation and improved controllability. However, existing diffusion models struggle with fixed-length constraints and inefficiencies in likelihood modeling, limiting their effectiveness in generating flexible-length text.

A major challenge in language modeling is balancing efficiency and quality. Autoregressive models capture long-range dependencies effectively but suffer from slow token-by-token generation. Diffusion models, while promising, require multiple inference steps and typically generate fixed-length outputs. This limitation prevents them from being practical for real-world applications where variable-length sequences are necessary. The research addresses this issue by proposing a method that combines the strengths of both autoregressive and diffusion models, ensuring efficient and high-quality text generation without compromising flexibility.

Current methods primarily involve autoregressive models, which generate text one token at a time based on previously generated tokens. While these models achieve high fluency and coherence, they are inherently slow due to their sequential processing nature. Diffusion-based approaches have been explored as an alternative, offering parallel generation. However, existing diffusion models generate fixed-length sequences and lack efficient means of extending beyond predefined contexts. Despite their inefficiencies, the lack of scalability in diffusion models has led to continued reliance on autoregressive methods.

Cornell Tech and Stanford University researchers introduced **Block Discrete Denoising Diffusion Language Models (BD3-LMs)** to overcome these limitations. This new class of models interpolates between autoregressive and diffusion models by employing a structured approach that supports variable-length generation while maintaining inference efficiency. BD3-LMs use key-value caching and parallel token sampling to reduce computational overhead. The model is designed with specialized training algorithms that minimize gradient variance through customized noise schedules, optimizing performance across diverse language modeling benchmarks.

BD3-LMs operate by structuring text generation into blocks rather than individual tokens. Unlike traditional autoregressive models, which predict the next token sequentially, BD3-LMs generate a block of tokens simultaneously, significantly improving efficiency. A diffusion-based denoising process within each block ensures high-quality text generation while preserving coherence. The model architecture integrates transformers with a block-causal attention mechanism, allowing each block to condition on previously generated blocks. This approach enhances both contextual relevance and fluency. The training process includes a vectorized implementation that enables parallel computations, reducing training time and resource consumption. Researchers introduced data-driven noise schedules that stabilize training and improve gradient estimation to address the high variance issue in diffusion models.

Performance evaluations of BD3-LMs demonstrate substantial improvements over existing discrete diffusion models. The model achieves state-of-the-art perplexity scores among diffusion-based language models while enabling the generation of arbitrary-length sequences. In experiments conducted on language modeling benchmarks, BD3-LMs reduce perplexity by up to 13% compared to previous diffusion models. On the LM1B dataset, BD3-LMs achieved a perplexity of 28.23 when using a block size of four, outperforming previous models such as MDLM, which had a perplexity of 31.78. On OpenWebText, BD3-LMs attained a perplexity of 20.73, significantly better than other discrete diffusion models. Further, BD3-LMs generated sequences up to 10 times longer than those produced by traditional diffusion methods, demonstrating superior scalability. The proposed model also reduced the number of function evaluations required for inference, achieving improved sample efficiency and generation speed.

The introduction of BD3-LMs presents a significant advancement in language modeling by integrating autoregressive and diffusion-based methodologies. By addressing key challenges related to inference efficiency, likelihood estimation, and sequence flexibility, this research offers a practical and scalable solution for text generation. BD3-LMs improve training stability and computational efficiency, providing a framework that can be extended to future language modeling developments. The results highlight the effectiveness of BD3-LMs in bridging the gap between autoregressive and diffusion-based approaches, offering an optimized balance between quality and speed in text generation.


Check out the Paper, Project and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.


Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.

Parlant: Build Reliable AI Customer Facing Agents with LLMs 💬 ✅ (Promoted)

How AI is Changing the Landscape of Digital Relationships

0

How AI is Changing the Landscape of Digital Relationships

Introduction:

Digital relationships have grown beyond text messages and video calls. With advancements in artificial intelligence (AI), connections are being shaped by technology that not only enhances communication but also mimics human emotions. From personalized matchmaking to AI-powered companions, AI is revolutionizing how we form and sustain relationships.

In this article, I’ll explore the fascinating world of AI in digital relationships and dive deep into its potential, challenges, and ethical implications. Let’s discover how AI is creating new possibilities for human connections.

The Evolution of Digital Relationships

The shift from traditional forms of communication to digital platforms has been swift and transformative. Social media, dating apps, and virtual communities have bridged geographical gaps, allowing people to connect globally.

Initially, digital relationships were limited to email or instant messaging, but AI-powered tools now play a major role in creating more meaningful interactions. Early AI technologies, such as chatbots and recommendation systems, laid the foundation for today’s advancements, enabling everything from personalized matchmaking to tailored communication.

This evolution has been critical in addressing modern challenges, including loneliness, busy lifestyles, and even social anxiety, as AI tools adapt to the unique needs of users.

AI-Powered Matchmaking and Dating Platforms

One of the most significant impacts of AI is in modern matchmaking. Dating platforms like Tinder, Bumble, and Hinge use advanced algorithms to analyze user behavior, preferences, and interactions.

How AI Enhances Matchmaking:

  • Behavioral Analysis: AI observes patterns, such as likes and swipes, to recommend compatible matches.
  • Profile Optimization: AI assists users in crafting appealing profiles by suggesting photos or taglines that align with popular trends.
  • Real-Time Adjustments: AI learns from user feedback to fine-tune recommendations.

Despite the many benefits, challenges persist. For instance, biases in AI algorithms can skew results, and privacy concerns arise as sensitive user data is analyzed. Yet, the potential to revolutionize digital matchmaking is undeniable.

AI in Communication: Chatbots and Virtual Companions

AI has transformed digital communication with innovations like chatbots and virtual companions. These tools, designed to simulate human conversation, cater to various needs, from casual chats to emotional support.

  • AI Girlfriend Chatbots in the AI Ecosystem: These chatbots mimic romantic or platonic interactions, providing users with an alternative to traditional relationships. They’re especially appealing to individuals seeking companionship without the emotional complexities of real-life connections.
  • AI Sexting as a Growing Trend: With AI’s ability to craft personalized and engaging text, some users are exploring AI sexting tools to navigate intimate interactions digitally. This trend raises ethical questions about the boundaries of AI’s role in personal interactions.

While these tools offer companionship and entertainment, they also pose ethical challenges, particularly regarding the authenticity of such relationships. Are we at risk of becoming overly reliant on AI for emotional fulfillment?

AI’s Role in Long-Distance Relationships

Long-distance relationships (LDRs) have always faced unique challenges, including communication gaps and the absence of physical presence. AI has stepped in to address these obstacles, offering tools that make LDRs more manageable.

Key AI Tools for LDRs:

  • Sentiment analysis to gauge emotions in conversations.
  • Predictive AI that suggests activities or conversations based on shared interests.
  • Augmented reality (AR) and virtual reality (VR) applications to create immersive experiences, simulating physical closeness.

These advancements allow couples to connect on a deeper level, even when miles apart. However, ethical concerns about AI’s potential to intrude on private moments remain a topic of discussion.

Future Trends in AI and Digital Relationships

The future of AI in relationships looks incredibly promising. As technology advances, we can expect more hyper-personalized interactions powered by natural language processing (NLP) and machine learning.

Potential Developments:

  • AI-driven matchmaking apps that predict relationship longevity based on data patterns.
  • Enhanced virtual companions with lifelike personalities and emotional intelligence.
  • Improved tools for navigating complex emotions, such as breakups or reconciliation.

However, alongside these innovations, we must address the ethics of AI in personal interactions. Transparency, consent, and accountability will be vital as AI continues to blur the lines between human and digital connections.

Ethical Concerns and Limitations

While the possibilities are exciting, the ethical landscape is complex. Questions about privacy, data security, and emotional manipulation arise as AI becomes more integrated into our personal lives.

  • Privacy Issues: AI tools often require access to sensitive information, raising concerns about how this data is stored and used.
  • Emotional Manipulation: AI’s ability to simulate emotions can lead to unintended consequences, such as users forming attachments to AI entities.
  • Balancing Innovation with Responsibility: Developers must prioritize ethical considerations, ensuring that AI tools enhance relationships without exploiting vulnerabilities.

By addressing these challenges proactively, we can harness AI’s potential responsibly.

Conclusion

AI is undeniably reshaping the way we form and maintain relationships. From matchmaking algorithms to virtual companions, the technology offers exciting possibilities for connection and emotional support.

However, the journey is not without its challenges. By addressing ethical concerns, prioritizing transparency, and staying mindful of the balance between human and digital interaction, we can navigate this evolving landscape with confidence.

The future of digital relationships lies at the intersection of innovation and responsibility, and I, for one, am excited to see where this journey takes us.

Essential Review Papers on Physics-Informed Neural Networks: A Curated Guide for Practitioners

0

Staying on top of a fast-growing research field is never easy.

I face this challenge firsthand as a practitioner in Physics-Informed Neural Networks (PINNs). New papers, be they algorithmic advancements or cutting-edge applications, are published at an accelerating pace by both academia and industry. While it is exciting to see this rapid development, it inevitably raises a pressing question:

How can one stay informed without spending countless hours sifting through papers?

This is where I have found review papers to be exceptionally valuable. Good review papers are effective tools that distill essential insights and highlight important trends. They are big-time savers guiding us through the flood of information.

In this blog post, I would like to share with you my personal, curated list of must-read review papers on PINNs, that are especially influential for my own understanding and use of PINNs. Those papers cover key aspects of PINNs, including algorithmic developments, implementation best practices, and real-world applications.

In addition to what’s available in existing literature, I’ve included one of my own review papers, which provides a comprehensive analysis of common functional usage patterns of PINNs — a practical perspective often missing from academic reviews. This analysis is based on my review of around 200 arXiv papers on PINNs across various engineering domains in the past 3 years and can serve as an essential guide for practitioners looking to deploy these techniques to tackle real-world challenges.

For each review paper, I will explain why it deserves your attention by explaining its unique perspective and indicating practical takeaways that you can benefit from immediately.

Whether you’re just getting started with PINNs, using them to tackle real-world problems, or exploring new research directions, I hope this collection makes navigating the busy field of PINN research easier for you.

Let’s cut through the complexity together and focus on what truly matters.

1️⃣ Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and what’s next

📄 Paper at a glance

🔍 What it covers

  • Authors: S. Cuomo, V. Schiano di Cola, F. Giampaolo, G. Rozza, M. Raissi, and F. Piccialli
  • Year: 2022
  • Link: arXiv

This review is structured around key themes in PINNs: the fundamental components that define their architecture, theoretical aspects of their learning process, and their application to various computing challenges in engineering. The paper also explores the available toolsets, emerging trends, and future directions.

Fig 1. Overview of the #1 review paper. (Image by author)

✨ What’s unique

This review paper stands out in the following ways:

  • One of the best introductions to PINN fundamentals. This paper takes a well-paced approach to explaining PINNs from the ground up. Section 2 systematically dissects the building blocks of a PINN, covering various underlying neural network architectures and their associated characteristics, how PDE constraints are incorporated, common training methodologies, and learning theory (convergence, error analysis, etc.) of PINNs.
  • Putting PINNs in historical context. Rather than simply presenting PINNs as a standalone solution, the paper traces their development from earlier work on using deep learning to solve differential equations. This historical framing is valuable because it helps demystify PINNs by showing that they are an evolution of previous ideas, and it makes it easier for practitioners to see what alternatives are available.
  • Equation-driven organization. Instead of just classifying PINN research by scientific domains (e.g., geoscience, material science, etc.) as many other reviews do, this paper categorizes PINNs based on the types of differential equations (e.g., diffusion problems, advection problems, etc.) they solve. This equation-first perspective encourages knowledge transfer as the same set of PDEs could be used across multiple scientific domains. In addition, it makes it easier for practitioners to see the strengths and weaknesses of PINNs when dealing with different types of differential equations.

🛠 Practical goodies

Beyond its theoretical insights, this review paper offers immediately useful resources for practitioners:

  • A complete implementation example. In section 3.4, this paper walks through a full PINN implementation to solve a 1D Nonlinear Schrödinger equation. It covers translating equations into PINN formulations, handling boundary and initial conditions, defining neural network architectures, choosing training strategies, selecting collocation points, and applying optimization methods. All implementation details are clearly documented for easy reproducibility. The paper compares PINN performance by varying different hyperparameters, which could offer immediately applicable insights for your own PINN experiments.
  • Available frameworks and software tools. Table 3 compiles a comprehensive list of major PINN toolkits, with detailed tool descriptions provided in section 4.3. The considered backends include not only Tensorflow and PyTorch but also Julia and Jax. This side-by-side comparison of different frameworks is especially useful for picking the right tool for your needs.

💡Who would benefit

  • This review paper benefits anyone new to PINNs and looking for a clear, structured introduction.
  • Engineers and developers looking for practical implementation guidance would find the realistic, hands-on demo, and the thorough comparison of existing PINN frameworks most interesting. Additionally, they can find relevant prior work on differential equations similar to their current problem, which offers insights they can leverage in their own problem-solving.
  • Researchers investigating theoretical aspects of PINN convergence, optimization, or efficiency can also greatly benefit from this paper.

2️⃣ From PINNs to PIKANs: Recent Advances in Physics-Informed Machine Learning

📄 Paper at a glance

  • Authors: J. D. Toscano, V. Oommen, A. J. Varghese, Z. Zou, N. A. Daryakenari, C. Wu, and G. E. Karniadakis
  • Year: 2024
  • Link: arXiv

🔍 What it covers

This paper provides one of the most up-to-date overviews of the latest advancements in PINNs. It emphasises enhancements in network design, feature expansion, optimization strategies, uncertainty quantification, and theoretical insights. The paper also surveys key applications across a range of domains.

Fig 2. Overview of the #2 review paper. (Image by author)

✨ What’s unique

This review paper stands out in the following ways:

  • A structured taxonomy of algorithmic developments. One of the most fresh contributions of this paper is its taxonomy of algorithmic advancements. This new taxonomy scheme elegantly categorizes all the advancements into three core areas: (1) representation model, (2) handling governing equations, and (3) optimization process. This structure provides a clear framework for understanding both current developments and potential directions for future research. In addition, the illustrations used in the paper are top-notch and easily digestible.
Fig 3. The taxonomy of algorithmic developments in PINNs proposed by the #2 paper. (Image by author)
  • Spotlight on Physics-informed Kolmogorov–Arnold Networks (KAN). KAN, a new architecture based on the Kolmogorov–Arnold representation theorem, is currently a hot topic in deep learning. In the PINN community, some work has already been done to replace the multilayer perceptions (MLP) representation with KANs to gain more expressiveness and training efficiency. The community lacks a comprehensive review of this new line of research. This review paper (section 3.1) exactly fills in the gap.
  • Review on uncertainty quantification (UQ) in PINNs. UQ is essential for the reliable and trustworthy deployment of PINNs when tackling real-world engineering applications. In section 5, this paper provides a dedicated section on UQ, explaining the common sources of uncertainty in solving differential equations with PINNs and reviewing strategies for quantifying prediction confidence.
  • Theoretical advances in PINN training dynamics. In practice, training PINNs is non-trivial. Practitioners are often puzzled by why PINNs training sometimes fail, or how they should be trained optimally. In section 6.2, this paper provides one of the most detailed and up-to-date discussions on this aspect, covering the Neural Tangent Kernel (NTK) analysis of PINNs, information bottleneck theory, and multi-objective optimization challenges.

🛠 Practical goodies

Even though this review paper leans towards the theory-heavy side, two particularly valuable aspects stand out from a practical perspective:

  • A timeline of algorithmic advances in PINNs. In Appendix A Table, this paper tracks the milestones of key advancements in PINNs, from the original PINN formulation to the most recent extensions to KANs. If you’re working on algorithmic improvements, this timeline gives you a clear view of what’s already been done. If you’re struggling with PINN training or accuracy, you can use this table to find existing methods that might solve your issue.
  • A broad overview of PINN applications across domains. Compared to all the other reviews, this paper strives to give the most comprehensive and updated coverage of PINN applications in not only the engineering domains but also other less-covered fields such as finance. Practitioners can easily find prior works conducted in their domains and draw inspiration.

💡Who would benefit

  • For practitioners working in safety-critical fields that need confidence intervals or reliability estimates on their PINN predictions, the discussion on UQ would be useful. If you are struggling with PINN training instability, slow convergence, or unexpected failures, the discussion on PINN training dynamics can help unpack the theoretical reasons behind these issues.
  • Researchers may find this paper especially interesting because of the new taxonomy, which allows them to see patterns and identify gaps and opportunities for novel contributions. In addition, the review of cutting-edge work on PI-KAN can also be inspiring.

3️⃣ Physics-Informed Neural Networks: An Application-Centric Guide

📄 Paper at a glance

  • Authors: S. Guo (this author)
  • Year: 2024
  • Link: Medium

🔍 What it covers

This article reviews how PINNs are used to tackle different types of engineering tasks. For each task category, the article discusses the problem statement, why PINNs are useful, how PINNs can be implemented to address the problem, and is followed by a concrete use case published in the literature.

Fig 4. Overview of the #3 review paper. (Image by author)

✨ What’s unique

Unlike most reviews that categorize PINN applications either based on the type of differential equations solved or specific engineering domains, this article picks an angle that practitioners care about the most: the engineering tasks solved by PINNs. This work is based on reviewing papers on PINN case studies scattered in various engineering domains. The outcome is a list of distilled recurring functional usage patterns of PINNs:

  • Predictive modeling and simulations, where PINNs are leveraged for dynamical system forecasting, coupled system modeling, and surrogate modeling.
  • Optimization, where PINNs are commonly employed to achieve efficient design optimization, inverse design, model predictive control, and optimized sensor placement.
  • Data-driven insights, where PINNs are used to identify the unknown parameters or functional forms of the system, as well as to assimilate observational data to better estimate the system states.
  • Data-driven enhancement, where PINNs are used to reconstruct the field and enhance the resolution of the observational data.
  • Monitoring, diagnostic, and health assessment, where PINNs are leveraged to act as virtual sensors, anomaly detectors, health monitors, and predictive maintainers.

🛠 Practical goodies

This article places practitioners’ needs at the forefront. While most existing review papers merely answer the question, “Has PINN been used in my field?”, practitioners often seek more specific guidance: “Has PINN been used for the type of problem I’m trying to solve?”. This is precisely what this article tries to address.

By using the proposed five-category functional classification, practitioners can conveniently map their problems to these categories, see how others have solved them, and what worked and what did not. Instead of reinventing the wheel, practitioners can leverage established use cases and adapt proven solutions to their own problems.

💡Who would benefit

This review is best for practitioners who want to see how PINNs are actually being used in the real world. It can also be particularly valuable for cross-disciplinary innovation, as practitioners can learn from solutions developed in other fields.

4️⃣ An Expert’s Guide to Training Physics-informed Neural Networks

📄 Paper at a glance

  • Authors: S. Wang, S. Sankaran, H. Wang, P. Perdikaris
  • Year: 2023
  • Link: arXiv

🔍 What it covers

Even though it doesn’t market itself as a “standard” review, this paper goes all in on providing a comprehensive handbook for training PINNs. It presents a detailed set of best practices for training physics-informed neural networks (PINNs), addressing issues like spectral bias, unbalanced loss terms, and causality violations. It also introduces challenging benchmarks and extensive ablation studies to demonstrate these methods.

Fig 5. Overview of the #4 review paper. (Image by author)

✨ What’s unique

  • A unified “expert’s guide”. The main authors are active researchers in PINNs, working extensively on improving PINN training efficiency and model accuracy for the past years. This paper is a distilled summary of the authors’ past work, synthesizing a broad range of recent PINN techniques (e.g., Fourier feature embeddings, adaptive loss weighting, causal training) into a cohesive training pipeline. This feels like having a mentor who tells you exactly what does and doesn’t work with PINNs.
  • A thorough hyperparameter tuning study. This paper conducts various experiments to show how different tweaks (e.g., different architectures, training schemes, etc.) play out on different PDE tasks. Their ablation studies show precisely which methods move the needle, and by how much.
  • PDE benchmarks. The paper compiles a suite of challenging PDE benchmarks and offers state-of-the-art results that PINNs can achieve.

🛠 Practical goodies

  • A problem-solution cheat sheet. This paper thoroughly documents various techniques addressing common PINN training pain-points. Each technique is clearly presented using a structured format: the why (motivation), how (how the approach addresses the problem), and what (the implementation details). This makes it very easy for practitioners to identify the “cure” based on the “symptoms” observed in their PINN training process. What’s great is that the authors transparently discussed potential pitfalls of each approach, allowing practitioners to make well-informed decisions and effective trade-offs.
  • Empirical insights. The paper shares valuable empirical insights obtained from extensive hyperparameter tuning experiments. It offers practical guidance on choosing suitable hyperparameters, e.g., network architectures and learning rate schedules, and demonstrates how these parameters interact with the advanced PINN training techniques proposed.
  • Ready-to-use library. The paper is accompanied by an optimized JAX library that practitioners can directly adopt or customize. The library supports multi-GPU environments and is ready for scaling to large-scale problems.

💡Who would benefit

  • Practitioners who are struggling with unstable or slow PINN training can find many practical strategies to fix common pathologies. They can also benefit from the straightforward templates (in JAX) to quickly adapt PINNs to their own PDE setups.
  • Researchers looking for challenging benchmark problems and aiming to benchmark new PINN ideas against well-documented baselines will find this paper especially handy.

5️⃣ Domain-Specific Review Papers

Beyond general reviews in PINNs, there are several nice review papers that focus on specific scientific and engineering domains. If you’re working in one of these fields, these reviews could provide a deeper dive into best practices and cutting-edge applications.

1. Heat Transfer Problems

Paper: Physics-Informed Neural Networks for Heat Transfer Problems

The paper provides an application-centric discussion on how PINNs can be used to tackle various thermal engineering problems, including inverse heat transfer, convection-dominated flows, and phase-change modeling. It highlights real-world challenges such as missing boundary conditions, sensor-driven inverse problems, and adaptive cooling system design. The industrial case study related to power electronics is particularly insightful for understanding the usage of PINNs in practice.

2. Power Systems

Paper: Applications of Physics-Informed Neural Networks in Power Systems — A Review

This paper offers a structured overview of how PINNs are applied to critical power grid challenges, including state/parameter estimation, dynamic analysis, power flow calculation, optimal power flow (OPF), anomaly detection, and model synthesis. For each type of application, the paper discusses the shortcomings of traditional power system solutions and explains why PINNs could be advantageous in addressing those shortcomings. This comparative summary is useful for understanding the motivation for adopting PINNs.

3. Fluid Mechanics

Paper: Physics-informed neural networks (PINNs) for fluid mechanics: A review

This paper explored three detailed case studies that demonstrate PINNs application in fluid dynamics: (1) 3D wake flow reconstruction using sparse 2D velocity data, (2) inverse problems in compressible flow (e.g., shock wave prediction with minimal boundary data), and (3) biomedical flow modeling, where PINNs infer thrombus material properties from phase-field data. The paper highlights how PINNs overcome limitations in traditional CFD, e.g., mesh dependency, expensive data assimilation, and difficulty handling ill-posed inverse problems.

4. Additive Manufacturing

Paper: A review on physics-informed machine learning for monitoring metal additive manufacturing process

This paper examines how PINNs address critical challenges specific to additive manufacturing process prediction or monitoring, including temperature field prediction, fluid dynamics modeling, fatigue life estimation, accelerated finite element simulations, and process characteristics prediction.

6️⃣ Conclusion

In this blog post, we went through a curated list of review papers on PINNs, covering fundamental theoretical insights, the latest algorithmic advancements, and practical application-oriented perspectives. For each paper, we highlighted unique contributions, key takeaways, and the audience that would benefit the most from these insights. I hope this curated collection can help you better navigate the evolving field of PINNs.

Multimodal Large Language Models

0

Multimodal Large Language Models (MLLMs) process data from different modalities like text, audio, image, and video.

Compared to text-only models, MLLMs achieve richer contextual understanding and can integrate information across modalities, unlocking new areas of application. Prime use cases of MLLMs include content creation, personalized recommendations, and human-machine interaction.

Examples of MLLMs that process image and text data include Microsoft’s Kosmos-1, DeepMind’s Flamingo, and the open-source LLaVA. Google’s PaLM-E additionally handles information about a robot’s state and surroundings.

Combining different modalities and dealing with different types of data comes with some challenges and limitations, such as alignment of heterogeneous data, inherited biases from pre-trained models, and lack of robustness.

How would you translate this sentence: “The glasses are broken.” into French: “Les verres sont cases.” or  “Les lunettes sont cases.”? What if you have an image? Will you be able to choose the correct translation? As humans, we use different modalities daily to enhance communication. Machines can do the same.

Access to visual context can resolve ambiguity when translating between languages. In this example, the image of drinking glasses resolves the ambiguity in the meaning of “glasses” when translating the sentence from English to French.
Access to visual context can resolve ambiguity when translating between languages. In this example, the image of drinking glasses resolves the ambiguity in the meaning of “glasses” when translating the sentence from English to French. | Modified based on: source

While Large Language Models (LLMs) have shown impressive capabilities in understanding complex text, they are limited to a single data modality. However, many tasks span several modalities.

This article explores Multimodal Large Language Models, exploring their core functionalities, challenges, and potential for various machine-learning domains.

What is a multimodal large language model?

Let’s break down the concept of Multimodal Large Language Models (MLLMs) by first understanding the terms “modal” and “multimodal:”

“Modal” refers to a particular way of communicating or perceiving information. It’s like a channel through which we receive and express ourselves. Some of the common modalities are: 

  • Visual: Sight, including images, videos, and spatial information.
  • Auditory: Hearing, including sounds, music, and speech.
  • Textual: Written language, including words, sentences, and documents.
  • Haptic: Touch, including sensations of texture, temperature, and pressure.
  • Olfactory: Smell

“Multimodal” refers to incorporating various modalities to create a richer understanding of the task, e.g., as on a website or in a blog post that integrates text with visuals.

MLLMs can process not just text but other modalities as well. They are trained on samples containing different modalities, which allows them to develop joint representations and utilize multimodal information to solve tasks.

Why do we need multimodal LLMs?

Many industries heavily rely on multimodality, particularly those that handle a blend of data modalities. For example, MLLMs can be used in a healthcare setting to process patient reports comprising doctor notes (text), treatment plans (structured data), and X-rays or MRI scans (images).

Example of a multi-modal model. The model is trained on X-rays, medical reports, actions, and texts describing the diagnosis and outcome. This way, the model learns to use visual and textual information to predict potential diagnoses.
Example of a multi-modal model. The model is trained on X-rays, medical reports, actions, and texts describing the diagnosis and outcome. This way, the model learns to use visual and textual information to predict potential diagnoses. | Modified based on: source

MLLMs process and integrate information from different modalities (i.e., text, image, video, and audio), essential to solving many tasks. Some prominent applications are:

  1. Content creation: MLLMs can generate image captions, transform text into visually descriptive narratives, or create multimedia presentations, making them valuable tools for creative and professional industries.
  1. Enhanced human-machine interaction: By understanding and responding to inputs from diverse modalities such as text, speech, and images, MLLMs enable more natural communication. This can enrich the user experience in applications like virtual assistants, chatbots, and smart devices.
  1. Personalized recommendations: MLLMs contribute to refining recommendation systems by analyzing user preferences across diverse modalities. Whether suggesting movies based on textual reviews, recommending products through image recognition, or personalizing content recommendations across varied formats, these models elevate the precision and relevance of recommendations.
  1. Domain-specific problem solving: MLLMs are adaptable and invaluable in addressing challenges across various domains. In healthcare, their capability to interpret medical images aids in diagnostics, while in education, they enhance learning experiences by providing enriched materials that seamlessly combine text and visuals.

How do multimodal LLMs work?

A typical multimodal LLM has three primary modules:

  • The input module comprises specialized neural networks for each specific data type that output intermediate embeddings.
  • The fusion module converts the intermediate embeddings into a joint representation.
  • The output module generates outputs based on the task and the processed information. An output could be, e.g., a text, a classification (like “dog” for an image), or an image. Some MLLMs, like Google’s Gemini family, can produce outputs in more than one modality.
Basic structure of a multimodal LLM. Different modalities are processed by separate input modules. Then, the extracted information is joined in the fusion module. The output module (in this case, a classifier) generates the output in the desired modality.
Basic structure of a multimodal LLM. Different modalities are processed by separate input modules. Then, the extracted information is joined in the fusion module. The output module (in this case, a classifier) generates the output in the desired modality.

Examples of multimodal LLMs

Microsoft: Kosmos-1

Kosmos-1 (GitHub) is a multimodal LLM created by Microsoft for natural language and perception-intensive tasks. It can perform visual dialogue, visual explanation, visual question answering, image captioning, math equations, OCR, and zero-shot image classification with and without descriptions.

Architecture and training

Kosmos-1 processes inputs consisting of text and encoded image embeddings. Image embeddings are obtained through the pre-trained CLIP ViT-L/14 (GitHub) model. An embedding module processes this input before feeding it into a transformer-based decoder based on Magneto.

Kosmos-1 used the same initialization as the Magneto transformer for better optimization stability. To capture position information more precisely and better generalize to different sequence lengths (short sequences for training, long ones during testing), Kosmos-1 used xPOS as a relative position encoder.

Kosmos-1 has about 1.6 billion parameters in total, which is smaller than rival models like Flamingo, LLaVA, or GPT-4o. It was trained from scratch on web-scale multimodal corpora (text corpora, image caption pairs, and interleave image-text data).

A main limitation of Kosmos-1 is the limited number of input tokens (2,048) across text and image modalities.

Performance

The creators of Kosmos-1 proposed the Raven IQ test dataset to evaluate the nonverbal reasoning capabilities of MLLMs. This is the first time that a model is tested on nonverbal reasoning. The experimental results from the Kosmos-1 paper show that although the performance of Kosmos-1 is slightly better than that of random choice (random choosing one of the options), it is still far from the average results of adults for the same test. Nevertheless, this shows that MLLMs have the capability of nonverbal reasoning by aligning perception with language models.)

Experimental results published in the Kosmos-1 paper show that MLLMs benefit from performing cross-modal transfer, i.e., learning from one modality and transferring the knowledge to other modalities is more beneficial than using only one modality.

Microsoft published promising results for Kosmos-1 on the OCR-free language understanding task. In this task, the model reads and comprehends the meaning of words and sentences directly from the images. Microsoft also demonstrated that providing descriptions in the context improves the accuracy of zero-shot image classification tasks.

Examples of different Kosmos-1 tasks. The modal can explain an image (1, 2) or answer questions based on an image (3, 4). Kosmos-1 can also extract information from a text in an image (5) or answer math questions (6). The model is able to combine these capabilities to answer questions that require locating specific information in an image (7, 8)
Examples of different Kosmos-1 tasks. The modal can explain an image (1, 2) or answer questions based on an image (3, 4). Kosmos-1 can also extract information from a text in an image (5) or answer math questions (6). The model is able to combine these capabilities to answer questions that require locating specific information in an image (7, 8) | Source
Chain-of-thoughts prompting with Kosmos-1. In the first stage, given an image, a prompt is used to guide the model in generating a rationale. The model is then fed the rationale and a task-aware prompt to produce the final results.
Chain-of-thoughts prompting with Kosmos-1. In the first stage, given an image, a prompt is used to guide the model in generating a rationale. The model is then fed the rationale and a task-aware prompt to produce the final results. | Source

DeepMind: Flamingo

Flamingo architecture overview. Visual data is processed through a pretrained, frozen image encoder to extract image embeddings. These embeddings are passed through a Preceiver Sampler, trained from scratch, which outputs a fixed number of embeddings. The fixed image embeddings and text tokens are fed into gated cross-attention dense blocks, inserted between the frozen LLM blocks, and trained from scratch. The model produces free-form text as output.
Flamingo architecture overview. Visual data is processed through a pretrained, frozen image encoder to extract image embeddings. These embeddings are passed through a Preceiver Sampler, trained from scratch, which outputs a fixed number of embeddings. The fixed image embeddings and text tokens are fed into gated cross-attention dense blocks, inserted between the frozen LLM blocks, and trained from scratch. The model produces free-form text as output. | Source

Flamingo, a vision language model (VLM) developed by DeepMind, can perform various multimodal tasks, including image captioning, visual dialogue, and visual question answering (VQA). Flamingo models take interleaved image data and text as input and generate free-form text.

Flamingo consists of pre-trained vision and language models connected by a “Perceiver Resampler.” The Perceiver Resampler takes as input a variable number of image or video features from the pre-trained vision encoder and returns a fixed number of visual outputs. A pre-trained and frozen Normalizer-Free ResNet (NFNET) is used as a vision encoder, and a frozen Chinchilla is used as the language model. Gated cross-attention dense blocks (GATED XATTN-DENSE) are inserted between frozen LLM blocks and trained from scratch. The largest Flamingo model has 80B parameters and is trained on three datasets scraped from the web: interleaved image and text, image-text, and video-text pairs.

Experimental results on 16 multimodal image/video and language tasks show that Flamingo 80B models are more effective than fine-tuned models for specific tasks. However, as Flamingo focuses more on open-ended tasks, its performance on classification tasks is not as good as that of contrastive models like BASIC, CLI, and ALIGN.

Some limitations that Flamingo inherits from the pre-trained LLM used include hallucinations, poor sample efficiency during training, and poor generalizations for sequences that are longer than the ones used during training. Other limitations that many VLMs struggle with are outputting offensive language, toxicity, propagating social biases and stereotypes, and leaking private information. One way to mitigate these limitations is to filter them out of the training data and exclude them during evaluation.

LLaVA

The Large Language and Vision Assistant (LLaVA) is an end-to-end trained multimodal LLM that integrates the CLIP ViT-L/14 vision encoder and the Vicuna (a chat model created by fine-tuning Llama 2) for general-purpose visual and language understanding.

Given an input image, the pre-trained CLIP ViT-L/14 vision encoder extracts the vision features, which are transformed into the word embedding space using a simple linear layer. Vicuna was chosen as the LLM model because it is the best open-source instruction-following model for language tasks.

Overview of LLaVA architecture. The pretrained CLIP ViT-L/14 vision encoder extracts visual features from input images Xv, which are then mapped into the word embedding space using a linear projection W.
Overview of LLaVA architecture. The pretrained CLIP ViT-L/14 vision encoder extracts visual features from input images Xv, which are then mapped into the word embedding space using a linear projection W. | Source

LLaVA is trained using a two-stage instruction-tuning process. In the first pre-training stage for feature alignment, both the vision encoder and LLM weights are frozen, and the projection matrix is updated to align image features with the pre-trained LLM word embedding. In the second stage, end-to-end fine-tuning is performed to optimize the model for multimodal chatbot interactions and reasoning within the science domain.

Experimental results show that LLaVA 7B has better instruction-tuning capabilities than GPT-4 and Flamingo 80B despite having fewer parameters. LLaVA can follow user instructions and give a more comprehensive answer than GPT-4. LLaVA also outperforms GPT-4 on the ScienceQA dataset, which has multimodal multiple-choice questions from natural, social, and language sciences.

LLaVA has some limitations, including its perception of images as a “bag of patches,” failing to grasp the complex semantics within them. Similar to Flamingo, it inherits biases from both vision and language encoders and is prone to hallucinations and misinformation. Contrary to Flamingo, LLaVA cannot handle multiple images due to its lack of instructions.

This example shows LLaVA's capabilities of visual reasoning and chat. LLaVA accurately follows the user’s instructions instead of simply describing the scene and offers a comprehensive response. Even when merely asked to describe the image, LLaVA identifies atypical aspects of the image.
This example shows LLaVA’s capabilities of visual reasoning and chat. LLaVA accurately follows the user’s instructions instead of simply describing the scene and offers a comprehensive response. Even when merely asked to describe the image, LLaVA identifies atypical aspects of the image. | Source

Google: PaLM-E

Google developed an embodied language model, PaLM-E, to incorporate continuous sensor modalities into language models and establish the link between words and perceptions.

PaLM-E is a general-purpose MLLM for embodied reasoning, visual language, and language tasks. PaLM-E uses multimodal sentences, where inputs from different modalities (i.e., images in blue, state estimate of a robot in green) are inserted alongside text tokens (in orange) as input to an LLM and are trained end-to-end. PaLM-E can perform different tasks like robotic planning, visual question answering (VQA), and image captioning.
PaLM-E is a general-purpose MLLM for embodied reasoning, visual language, and language tasks. PaLM-E uses multimodal sentences, where inputs from different modalities (i.e., images in blue, state estimate of a robot in green) are inserted alongside text tokens (in orange) as input to an LLM and are trained end-to-end. PaLM-E can perform different tasks like robotic planning, visual question answering (VQA), and image captioning. | Source

Architecture and training

PaLM-E is a decoder-only LLM that auto-regressively generates text using a multimodal prompt consisting of text, tokenized image embeddings, and state estimates representing quantities like a robot’s position, orientation, and velocity.

PaLM-E combines PaLM, a decoder-only LLM with 540 billion parameters, and the ViT vision transformer by projecting the latter’s image representations into the former’s input token space. The same approach, which relies on a learned transformation function, is used for projecting state estimates.

Performance

Experimental results show that PALM-E outperforms other baselines like SayCan and PALI in different robotic domains and tasks. This shows that combining pre-trained PALM and ViT with the full mixture of robotics and general visual-language data increases the performance compared to training individual models on individual tasks. Moreover, PALM-E outperforms Flamingo in VQA tasks and PALM in language tasks.

PALM-E 562B has many capabilities, including zero-shot multi-modal chain of thought (CoT) reasoning, multi-image reasoning, OCR-free math reasoning, image captioning, VQA, and few-shot prompting.

Challenges, limitations, and future directions of MLLMs

Expanding LLMs to other modalities comes with challenges regarding data quality, interpretation, safety, and generalization. In a survey paper, Paul Liang et al. proposed a new taxonomy to characterize the challenges and limitations of large multimodal language models:

  1. Representation: How can one represent different modalities in a meaningful and comprehensive manner?

    Fusion, i.e., integrating two or more modalities and reducing the number of separate representations, is a closely related challenge. Fusion can happen after unimodal encoders capture unique representations of different modalities or directly using raw modalities, which is more challenging as data is heterogeneous.

    Representation coordination aims to organize different modalities in a shared coordinate space, such as Euclidian distance. The objective is to position similar modalities close together and put modalities that are not equivalent far away. For instance, the goal is that the representation of the text “a bike” and an image of a bike are placed close together in cosine distance but far away from an image of a cat.

    Human cognition offers valuable insights into developing and further improving multimodal models. Understanding how the brain processes different modalities and combining them can be a promising direction for proposing new approaches to multimodal learning and enabling more effective analysis of complex data.

  1. Alignment: Another challenge is identifying cross-modal connections and interactions between elements of different modalities. For instance, how can we align gestures with speech when a person is talking? Or how can we align an image with a description?

    When the elements of multiple modalities are discrete (i.e., there is a clear segmentation between elements, like words in a text) and supervised data exists, contrastive learning is used. It matches the representations of the same concepts expressed in different modalities (e.g., the word “car” with an image of a car).

    If the ground truth is unavailable, the alignment is done with all the elements of the modalities to learn the necessary connections and matchings between them. For example, aligning video clips with text descriptions when there are no ground truth labels that link descriptions with video clips requires comparing each video embedding with each text embedding. A similarity score (i.e., cosine similarity) is calculated for all pairs and aligns the modalities.

    Alignment is more challenging when elements of a modality are continuous (like time-series data) or data does not contain clear semantic boundaries (e.g., MRI images). Clustering can be used to group continuous data based on semantic similarity to achieve modality alignment.

    Further, current multimodal models struggle with long-range sequences and cannot learn interactions over long periods. For instance, aligning the text “After 25 minutes in the oven, the cupcakes are golden brown” with the correct scene in a video requires understanding that “25 minutes in the oven” corresponds to a specific scene later in the video. Capturing and aligning long-term interactions that happen very far in time and space is challenging and complex, but it is an important and promising future direction that needs to be explored.

  2. Reasoning: Reasoning is a complex process that involves drawing conclusions from knowledge through multiple logical steps and observations.

    One reasoning-related challenge in MLLMs is structure modeling, which involves learning and representing the relationships over which reasoning happens. Understanding hierarchical relationships where smaller components (atoms) are combined to create larger ones (molecules) is essential for complex reasoning. 

    Another challenge is encoding or representing multimodal concepts during reasoning so that they are interpretable and effective using attention mechanisms, language, or symbols. It is very important to understand how to go from low-level representations (e.g., pixels of an image or words) to high-level concepts (e.g., “What color is the jacket?”) while still being interpretable by humans.

    Understanding the reasoning process of the trained models and how they combine elements from different modalities (i.e., text, vision, audio) is very important for their transparency, reliability, and performance. This will help to discover potential biases and limitations in the reasoning process of MLLMs, enabling the development of robust models to overcome these challenges.

  3. Generation: Research is ongoing on generating meaningful outputs that reflect cross-modal interaction and are structured and coherent.

    Generative models focus on generating raw modalities (text, images, or videos) and capturing the relationships and interactions between different modalities. For instance, guided text summarization uses input modalities such as images, video, or audio to compress the data and summarize the most relevant and important information from the original content.

    Multimodal translation maps one modality to another while respecting semantic connections and information content. Generating novel high-dimensional data conditioned on initial inputs is extremely challenging. It has to preserve semantics, be meaningful and coherent, and capture many possible generations (different styles, colors, and shapes of the same scene).

    One of the main challenges of multimodal generation is the difficulty of evaluating the generated content, primarily when ethical issues (e.g., generating deepfakes, hate speech, and fake news) are involved. Evaluating user studies is time-consuming, costly, and biased.

    An insightful future work will be to study if the risk for the above ethical issues is reduced or increased when using a multimodal dataset and if there are ethical issues specific to multimodal generations. Multimodal datasets may reduce ethical issues as they are more diverse and contextually complete and may improve model fairness. On the other hand, the biases from one modality can interact and amplify biases in other modalities, leading to complex ethical issues (i.e., combining video with text metadata may reveal sensitive information).)

  1. Transference: In multimodal modeling, transference refers to the process of transferring knowledge from one modality (the second modality) to another (the primary modality) when the primary modality’s resources are limited (e.g., lack of annotated data, unreliable labels, noisy inputs). By leveraging the information from the second modality, the primary modality can enhance performance and learn new capabilities, which would not be possible without the shared information.

    In cross-modal transfer settings, large-scale pre-trained models are fine-tuned for specific downstream tasks with a focus on the primary modality. For example, fine-tuning pre-trained frozen large language models for image captioning. On the other hand, multimodal co-learning aims to transfer the learned information by sharing intermediate spaces between modalities. In this case, a single joint model is used across all modalities. For instance, having both image and text modalities during training and using the model for image classification. Contrary model induction, exemplified by co-training, promotes independent training of models and only exchanges their model predictions (outputs) to enable information transfer while maintaining separation.

Learning from many modalities increases the data heterogeneity and complexity challenges during data processing. Dealing with modalities that aren’t all present simultaneously is a direction that needs further exploration to enhance multimodality models’ performance.

  1. Quantification: Quantification aims to understand better and improve multimodal models’ reliability, interpretability, and robustness. Understanding the dimensions of heterogeneity and their effect on multimodal learning and modeling is very important. Exploring interactions and connections of multimodal modalities enhances the understanding of modality interconnections of the trained models. Improving how the multimodal models are trained and optimized is crucial to achieving better generalization, usability, and efficiency.

    Having formal guidelines and theories for evaluating which modalities are beneficial or harmful (adversarial attacks) is a critical challenge. Understanding what modalities to select and compare them in a systematic way is very important for improving multimodal models. Furthermore, it is essential to interpret and explain complex relationships and patterns of the multimodal models before employing them in real-world applications. For instance, recognizing social biases of the data (text or image) is key to ensuring fairness while guaranteeing the robustness of the model against noisy or out-of-distribution modalities. These unresolved core challenges require thorough analysis to ensure that multimodal models can be reliably applied across different domains. 

As this extensive list of open research questions and practical challenges shows, multimodal LLMs are still in their early stages. The LLaVA GitHub repository and the unit on multi-modal models in the Hugging Face Community Computer Vision Course are excellent resources to dive deeper and get hands-on experience training and fine-tuning MLLMs.

Was the article useful?

Explore more content topics:

HealthTech Database Exposed 108GB Medical and Employment Records

0

A misconfigured database exposed 108.8 GB of sensitive data, including information on over 86,000 healthcare workers affiliated with ESHYFT, a New Jersey-based HealthTech company operating across 29 states. ESHYFT also provides a mobile platform that connects healthcare facilities with qualified nursing professionals.

The exposed database was not password-protected or encrypted and contained a treasure trove of personally identifiable information (PII) including SSNs, scans of identification documents, salary details, work history, and more.

The database was discovered by cybersecurity researcher Jeremiah Fowler who shared their report with Hackread.com revealing that the exposed data included profile images, facial images, professional certificates, work assignment agreements, CVs, and resumes.

Additionally, one spreadsheet document contained over 800,000 entries detailing nurses’ internal IDs, facility names, time and date of shifts, hours worked, and more. What’s worse, medical documents, including medical reports containing information on diagnoses, prescriptions, or treatments, were also exposed.

The exposure of such sensitive data could potentially fall under HIPAA regulations. It can also expose vulnerable users to online and physical risks, including identity theft, employment fraud, financial fraud, and targeted phishing campaigns.

The good news is that Fowler immediately notified ESHYFT. The bad news is that it took the company over a month after being alerted to restrict public access to the database. However, according to Fowler, the exposed database was not owned or directly managed by ESHYFT.

It remains unclear whether a third-party contractor was responsible for its management. Additionally, the duration of the exposure and whether unauthorized parties accessed the data are unknown.

Nevertheless, cybercriminals could use the exposed data to commit crimes in the victims’ names or deceive them into revealing additional personal or financial information. Therefore, HealthTech must implement proper cybersecurity measures including:

  • Implement mandatory encryption protocols for sensitive data.
  • Use multi-factor authentication to prevent unauthorized access.
  • Conduct regular security audits to identify potential vulnerabilities.
  • Segregate sensitive data and assign expiration dates for data that is no longer in use.
  • Have a data breach response plan in place and a dedicated communication channel for reporting potential security incidents.
  • Provide timely responsible disclosure notices to affected individuals and educate them on how to recognize phishing attempts.


Popular Posts

My Favorites

MIWIC25 – Eva Benn, Chief of Staff, Strategy – Microsoft Red...

0
Organised by Eskenzi PR in media partnership with the IT Security Guru, the Most Inspiring Women in Cyber Awards aim to shed light on the remarkable women...