Because Lance is built on top of Apache Arrow , LanceDB is tightly integrated with the Python data ecosystem, including Pandas and pyarrow . The sequence of steps in a typical workflow with Pandas is shown below.
Create dataset
Let’s first import LanceDB:
import lancedbNext, we’ll import pandas
import pandas as pdSync API
We’ll first connect to LanceDB.
uri = "data/sample-lancedb"
db = lancedb.connect(uri)We can create a LanceDB table directly from a Pandas DataFrame by passing it as the data parameter:
data = pd.DataFrame(
{
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0],
}
)
table = db.create_table("pd_table", data=data)Similar to the
pyarrow.write_dataset()
method, LanceDB’s
db.create_table()
accepts data in a variety of forms, including pyarrow datasets.
Async API
Connect to LanceDB:
uri = "data/sample-lancedb"
async_db = await lancedb.connect_async(uri)We can create a LanceDB table directly from a Pandas DataFrame by passing it as the data parameter:
data = pd.DataFrame(
{
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0],
}
)
await async_db.create_table("pd_table_async", data=data)Larger-than-memory data
If you have a dataset that is larger than memory, you can create a table with Iterator[pyarrow.RecordBatch] to lazily load the data:
from typing import Iterableimport pyarrow as padef make_batches() -> Iterable[pa.RecordBatch]:
for i in range(5):
yield pa.RecordBatch.from_arrays(
[
pa.array([[3.1, 4.1], [5.9, 26.5]]),
pa.array(["foo", "bar"]),
pa.array([10.0, 20.0]),
],
["vector", "item", "price"],
)You can then pass the make_batches() function to the data parameter, while specifying the pyarrow schema in the create_table() function.
Sync API
schema = pa.schema(
[
pa.field("vector", pa.list_(pa.float32())),
pa.field("item", pa.utf8()),
pa.field("price", pa.float32()),
]
)
table = db.create_table("iterable_table", data=make_batches(), schema=schema)Async API
schema = pa.schema(
[
pa.field("vector", pa.list_(pa.float32())),
pa.field("item", pa.utf8()),
pa.field("price", pa.float32()),
]
)
await async_db.create_table(
"iterable_table_async", data=make_batches(), schema=schema
)You will find detailed instructions of creating a LanceDB dataset in Getting Started and API sections.
Vector search
We can now perform similarity search via the LanceDB Python API.
Sync API
# Open the table previously created.
table = db.open_table("pd_table")
query_vector = [100, 100]
# Pandas DataFrame
df = table.search(query_vector).limit(1).to_pandas()
print(df)Async API
# Open the table previously created.
async_tbl = await async_db.open_table("pd_table_async")
query_vector = [100, 100]
# Pandas DataFrame
df = await (await async_tbl.search(query_vector)).limit(1).to_pandas()
print(df)This returns a Pandas DataFrame as follows:
vector item price _distance
0 [5.9, 26.5] bar 20.0 14257.05957If you have a simple filter, it’s faster to provide a where clause to LanceDB’s search method.
For more complex filters or aggregations, you can always resort to using the underlying DataFrame methods after performing a search.
Sync API
# Apply the filter via LanceDB
results = table.search([100, 100]).where("price < 15").to_pandas()
assert len(results) == 1
assert results["item"].iloc[0] == "foo"
# Apply the filter via Pandas
df = results = table.search([100, 100]).to_pandas()
results = df[df.price < 15]
assert len(results) == 1
assert results["item"].iloc[0] == "foo"Async API
# Apply the filter via LanceDB
results = await (await async_tbl.search([100, 100])).where("price < 15").to_pandas()
assert len(results) == 1
assert results["item"].iloc[0] == "foo"
# Apply the filter via Pandas
df = results = await (await async_tbl.search([100, 100])).to_pandas()
results = df[df.price < 15]
assert len(results) == 1
assert results["item"].iloc[0] == "foo"