mirror of
https://gitee.com/infiniflow/ragflow.git
synced 2025-12-06 07:19:03 +08:00
Fix: Correct pagination and early termination bugs in chunk_list() (#11692)
## Summary
This PR fixes two critical bugs in `chunk_list()` method that prevent
processing large documents (>128 chunks) in GraphRAG and
other workflows.
## Bugs Fixed
### Bug 1: Incorrect pagination offset calculation
**Location:** `rag/nlp/search.py` lines 530-531
**Problem:** The loop variable `p` was used directly as offset, causing
incorrect pagination:
```python
# BEFORE (BUGGY):
for p in range(offset, max_count, bs): # p = 0, 128, 256, 384...
es_res = self.dataStore.search(..., p, bs, ...) # p used as offset
Fix: Use page number multiplied by batch size:
# AFTER (FIXED):
for page_num, p in enumerate(range(offset, max_count, bs)):
es_res = self.dataStore.search(..., page_num * bs, bs, ...)
Bug 2: Premature loop termination
Location: rag/nlp/search.py lines 538-539
Problem: Loop terminates when any page returns fewer than 128 chunks,
even when thousands more remain:
# BEFORE (BUGGY):
if len(dict_chunks.values()) < bs: # Breaks at 126 chunks even if 3,000+
remain
break
Fix: Only terminate when zero chunks returned:
# AFTER (FIXED):
if len(dict_chunks.values()) == 0:
break
Enhancement: Add max_count parameter to GraphRAG
Location: graphrag/general/index.py line 60
Added max_count=10000 parameter to chunk loading for both LightRAG and
General GraphRAG paths to ensure all chunks are
processed.
Testing
Validated with a 314-page legal document containing 3,207 chunks:
Before fixes:
- Only 2-126 chunks processed
- GraphRAG generated 25 nodes, 8 edges
After fixes:
- All 3,209 chunks processed ✅
- GraphRAG processing complete dataset
Impact
These bugs affect any workflow using chunk_list() with large documents,
particularly:
- GraphRAG knowledge graph generation
- RAPTOR hierarchical summarization
- Document processing pipelines with >128 chunks
Related Issue
Fixes #11687
Checklist
- Code follows project style guidelines
- Tested with large documents (3,207+ chunks)
- Both bugs validated by Dosu bot in issue #11687
- No breaking changes to API
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
This commit is contained in:
committed by
GitHub
parent
a3c9402218
commit
3c224c817b
@@ -57,7 +57,7 @@ async def run_graphrag(
|
||||
start = trio.current_time()
|
||||
tenant_id, kb_id, doc_id = row["tenant_id"], str(row["kb_id"]), row["doc_id"]
|
||||
chunks = []
|
||||
for d in settings.retriever.chunk_list(doc_id, tenant_id, [kb_id], fields=["content_with_weight", "doc_id"], sort_by_position=True):
|
||||
for d in settings.retriever.chunk_list(doc_id, tenant_id, [kb_id], max_count=10000, fields=["content_with_weight", "doc_id"], sort_by_position=True):
|
||||
chunks.append(d["content_with_weight"])
|
||||
|
||||
with trio.fail_after(max(120, len(chunks) * 60 * 10) if enable_timeout_assertion else 10000000000):
|
||||
@@ -174,13 +174,19 @@ async def run_graphrag_for_kb(
|
||||
chunks = []
|
||||
current_chunk = ""
|
||||
|
||||
for d in settings.retriever.chunk_list(
|
||||
# DEBUG: Obtener todos los chunks primero
|
||||
raw_chunks = list(settings.retriever.chunk_list(
|
||||
doc_id,
|
||||
tenant_id,
|
||||
[kb_id],
|
||||
max_count=10000, # FIX: Aumentar límite para procesar todos los chunks
|
||||
fields=fields_for_chunks,
|
||||
sort_by_position=True,
|
||||
):
|
||||
))
|
||||
|
||||
callback(msg=f"[DEBUG] chunk_list() returned {len(raw_chunks)} raw chunks for doc {doc_id}")
|
||||
|
||||
for d in raw_chunks:
|
||||
content = d["content_with_weight"]
|
||||
if num_tokens_from_string(current_chunk + content) < 1024:
|
||||
current_chunk += content
|
||||
|
||||
@@ -537,7 +537,8 @@ class Dealer:
|
||||
doc["id"] = id
|
||||
if dict_chunks:
|
||||
res.extend(dict_chunks.values())
|
||||
if len(dict_chunks.values()) < bs:
|
||||
# FIX: Solo terminar si no hay chunks, no si hay menos de bs
|
||||
if len(dict_chunks.values()) == 0:
|
||||
break
|
||||
return res
|
||||
|
||||
|
||||
Reference in New Issue
Block a user