### What problem does this PR solve?
-Add Api for download "message" component output's file
-Change the attachment output type check from tuple to
dictionary,because 'attachement' is not instance of tuple
-Update the message type to message_end to avoid the problem that
content does not send an error message when the message type is ans
["data"] ["content"]
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Fillup component return value not object
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
This PR addresses **two independent issues** encountered when using the
MinerU engine in Ragflow:
1. **MinerU API output path mismatch for non-ASCII filenames**
MinerU sanitizes the root directory name inside the returned ZIP when
the original filename contains non-ASCII characters (e.g., Chinese).
Ragflow's client-side unzip logic assumed the original filename stem and
therefore failed to locate `_content_list.json`.
This PR adds:
* root-directory detection
* fallback lookup using sanitized names
* a broadened `_read_output` search with a glob fallback
ensuring output files are consistently located regardless of filename
encoding.
2. **Chunker crash due to tuple-structure mismatch in manual mode**
Some parsers (e.g., MinerU / Docling) return **2-tuple sections**, but
Ragflow’s chunker expects **3-tuple sections**, leading to:
`ValueError: not enough values to unpack (expected 3, got 2)`
This PR normalizes all sections to a uniform structure `(text, layout,
positions)`:
* parse position tags when present
* default to empty positions when missing
preserving backward compatibility and preventing crashes.
### Type of change
* [x] Bug Fix (non-breaking change which fixes an issue)
[#11136](https://github.com/infiniflow/ragflow/issues/11136)
[#11700](https://github.com/infiniflow/ragflow/issues/11700)
[#11620](https://github.com/infiniflow/ragflow/issues/11620)
[#11701](https://github.com/infiniflow/ragflow/pull/11701)
we need your help [yongtenglei](https://github.com/yongtenglei)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
The document metadata cache is built using the list documents endpoint
with default pagination parameters of page=1, page_size=3. This means
when using the MCP server to search a dataset, only chunks which come
from the first 30 documents in the dataset will have metadata returned.
Issue described in more detail here
https://github.com/infiniflow/ragflow/issues/11533
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: Giles Lloyd <giles.af.lloyd@gmail.com>
### What problem does this PR solve?
Fix: Newly added models to OpenAI-API-Compatible are not displayed in
the LLM dropdown menu in a timely manner. #11774
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Feat: Users can chat directly without first creating a conversation.
#11768
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
As title
### Type of change
- [x] Other (please describe):
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
When there are multiple files with the same name the file would just
duplicate, making it hard to distinguish between the different files.
Now if there are multiple files with the same name, they will be named
after their folder path in the webdav storage unit.
The same could be done for the other connectors, too, since most of them
will have similars issues, when iterating through the folder paths.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Contribution by RAGcon GmbH, visit us [here](https://www.ragcon.ai/)
## Problem
The SDK API endpoint `DELETE /datasets/{dataset_id}/chunks` only updates
database status but does not send cancellation signal via Redis, causing
background parsing tasks to continue and eventually complete (status
becomes DONE instead of CANCEL).
## Root Cause
The SDK endpoint was missing the `cancel_all_task_of(id)` call that the
web API
([api/apps/document_app.py](cci:7://file:///d:/workspace1/ragflow-admin/api/apps/document_app.py:0:0-0:0))
uses to properly stop background tasks.
## Solution
Added `cancel_all_task_of(id)` call in the
[stop_parsing](cci:1://file:///d:/workspace1/ragflow/api/apps/sdk/doc.py:785:0-855:23)
function to send cancellation signal via Redis, consistent with the web
API behavior.
## Related Issue
Fixes#11745
Co-authored-by: tedhappy <tedhappy@users.noreply.github.com>
### What problem does this PR solve?
Feat: Display the ID of the code image in the dialog. #10427
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat: update front end for confluence connector
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat: add more attribute for confluence connector.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Change the restart policy from 'on-failure' to 'unless-stopped'.
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
huqie.txt and huqie.txt.trie are put to infinity-sdk in
https://github.com/infiniflow/infinity/pull/3127.
Remove huqie.txt from ragflow and bump infinity to 0.6.10 in this PR.
### Type of change
- [x] Refactoring
## What problem does this PR solve?
Feat: detect docx support via header-byte inspection, a further optimize
based on #11684
Not all files with a .doc extension are truly legacy .doc formats, and
some are internally valid .docx documents.
The previous implementation relied on URL suffix checks, which
misclassified these cases and was therefore not reliable.
Doc file could be previewed:
[en2zh.doc](https://github.com/user-attachments/files/23921131/en2zh.doc)
Doc file could not be previewed:
[file-sample_100kB.doc](https://github.com/user-attachments/files/23921134/file-sample_100kB.doc)
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Feature:Add a loading status to the agent canvas page.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Fixes the `AttributeError: 'bool' object has no attribute 'items'` error
when updating the `enabled` parameter of a document via the Python SDK
(Issue #11721).
Background: When calling `Document.update({"enabled": True/False})`
through the SDK, the server-side API returned a boolean `data=True` in
the response (instead of a dictionary). The SDK's `_update_from_dict`
method (in `base.py`) expects a dictionary to iterate over with
`.items()`, leading to an immediate AttributeError during response
parsing. This prevented successful synchronization of the updated
`enabled` status to the local SDK object, even if the server-side
database/update index operations succeeded.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Additional Context (optional, for clarity)
- **Root Cause**: Server returned `data=True` (boolean) for `enabled`
parameter updates, violating the SDK's expectation of a dictionary-type
`data` field.
- **Fix Logic**:
1. Removed the separate `return get_result(data=True)` in the `enabled`
update branch to unify response flow.
2.
- **Backward Compatibility**: No breaking changes—other update scenarios
(e.g., renaming documents, modifying chunk methods) remain unaffected,
and the response format stays consistent.
Co-authored-by: shirukai <shirukai@hollysysdigital.com>
### What problem does this PR solve?
This Pull Request introduces native support for Google Cloud Storage
(GCS) as an optional object storage backend.
Currently, RAGFlow relies on a limited set of storage options. This
feature addresses the need for seamless integration with GCP
environments, allowing users to leverage a fully managed, highly
durable, and scalable storage service (GCS) instead of needing to deploy
and maintain third-party object storage solutions. This simplifies
deployment, especially for users running on GCP infrastructure like GKE
or Cloud Run.
The implementation uses a single GCS bucket defined via configuration,
mapping RAGFlow's internal logical storage units (or "buckets") to
folder prefixes within that GCS container to maintain data separation.
This architectural choice avoids the operational complexities associated
with dynamically creating and managing unique GCS buckets for every
logical unit.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
## Summary
This PR fixes two critical bugs in `chunk_list()` method that prevent
processing large documents (>128 chunks) in GraphRAG and
other workflows.
## Bugs Fixed
### Bug 1: Incorrect pagination offset calculation
**Location:** `rag/nlp/search.py` lines 530-531
**Problem:** The loop variable `p` was used directly as offset, causing
incorrect pagination:
```python
# BEFORE (BUGGY):
for p in range(offset, max_count, bs): # p = 0, 128, 256, 384...
es_res = self.dataStore.search(..., p, bs, ...) # p used as offset
Fix: Use page number multiplied by batch size:
# AFTER (FIXED):
for page_num, p in enumerate(range(offset, max_count, bs)):
es_res = self.dataStore.search(..., page_num * bs, bs, ...)
Bug 2: Premature loop termination
Location: rag/nlp/search.py lines 538-539
Problem: Loop terminates when any page returns fewer than 128 chunks,
even when thousands more remain:
# BEFORE (BUGGY):
if len(dict_chunks.values()) < bs: # Breaks at 126 chunks even if 3,000+
remain
break
Fix: Only terminate when zero chunks returned:
# AFTER (FIXED):
if len(dict_chunks.values()) == 0:
break
Enhancement: Add max_count parameter to GraphRAG
Location: graphrag/general/index.py line 60
Added max_count=10000 parameter to chunk loading for both LightRAG and
General GraphRAG paths to ensure all chunks are
processed.
Testing
Validated with a 314-page legal document containing 3,207 chunks:
Before fixes:
- Only 2-126 chunks processed
- GraphRAG generated 25 nodes, 8 edges
After fixes:
- All 3,209 chunks processed ✅
- GraphRAG processing complete dataset
Impact
These bugs affect any workflow using chunk_list() with large documents,
particularly:
- GraphRAG knowledge graph generation
- RAPTOR hierarchical summarization
- Document processing pipelines with >128 chunks
Related Issue
Fixes#11687
Checklist
- Code follows project style guidelines
- Tested with large documents (3,207+ chunks)
- Both bugs validated by Dosu bot in issue #11687
- No breaking changes to API
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
# PR Description: Add Space Key Configuration for Confluence Data Source
### What problem does this PR solve?
This PR addresses issue #11638 where users requested the ability to
specify Confluence Space Keys when configuring a Confluence data source
connector.
**Problem:**
Currently, the RAGFlow UI for Confluence data sources only provides
fields for:
- Username
- Access Token
- Wiki Base URL
- Is Cloud checkbox
There is no way to specify which Confluence space(s) to sync, causing
RAGFlow to attempt syncing all accessible spaces. This is problematic
for users who:
- Only want to index specific spaces (e.g., only the HR or Documentation
space)
- Have access to many spaces but only need a subset
- Want to avoid unnecessary data transfer and processing
**Solution:**
The backend `ConfluenceConnector` class already supports a `space`
parameter in its `__init__()` method (line 1282 in
`common/data_source/confluence_connector.py`), but this parameter was
never exposed in the UI. This PR adds the missing UI field to allow
users to configure space filtering.
**User Impact:**
Users can now:
- Leave the field empty to sync all accessible spaces (default behavior)
- Specify a single space key (e.g., `DEV`)
- Specify multiple space keys separated by commas (e.g., `DEV,DOCS,HR`)
This gives users fine-grained control over which Confluence content gets
indexed into their RAGFlow knowledge base.
Fixes#11638
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---
## Implementation Details
### Changes Made
**1. Frontend UI
(`web/src/pages/user-setting/data-source/contant.tsx`)**
- Added "Space Key" text input field to Confluence configuration form
- Field is optional (not required)
- Positioned after "Is Cloud" checkbox for logical grouping
- Added to initial values with empty string default
**2. Internationalization (`web/src/locales/*.ts`)**
- **English (`en.ts`)**: Added `confluenceSpaceKeyTip` with clear
instructions and examples
- **Chinese (`zh.ts`)**: Added Chinese translation for the tooltip
- **Russian (`ru.ts`)**: Added Russian translation for the tooltip
- **Bonus Fix**: Removed duplicate `deleteModal` object in `zh.ts` that
was causing TypeScript lint errors
### Backend Compatibility
No backend changes were needed! The `ConfluenceConnector` class already
supports the `space` parameter:
```python
def __init__(
self,
wiki_base: str,
is_cloud: bool,
space: str = "", # ← Already supported!
page_id: str = "",
index_recursively: bool = False,
cql_query: str | None = None,
...
)
```
The connector uses this parameter to filter the CQL query (line
1328-1330):
```python
elif space:
uri_safe_space = quote(space)
base_cql_page_query += f" and space='{uri_safe_space}'"
```
### User Experience
**Before:**
- Users could only sync ALL accessible spaces
- No UI option to limit scope
**After:**
- Users see "Space Key" field with helpful tooltip
- Tooltip explains:
- Optional field (leave empty for all spaces)
- Single space example: `DEV`
- Multiple spaces example: `DEV,DOCS,HR`
- Available in English, Chinese, and Russian
### Future Enhancements
Potential improvements for future PRs:
- Add validation to check if space key exists before saving
- Add autocomplete/dropdown to show available spaces
- Add UI hints about space key format requirements
- Support for page_id filtering (already supported in backend)
---
## Related Issues
- Fixes#11638 - [Confluence] How to specify Space Key when adding
Confluence data source?
### What problem does this PR solve?
Update some docs and comments, since 'File manager' is rename to 'File'
### Type of change
- [x] Documentation Update
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com>
### What problem does this PR solve?
Handle MinerU sanitized filenames when reading output. #11613, #11620.
Thanks @shaoqing404 for raising this issue.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Feature: This PR implements automatic Raptor disabling for structured
data files to address issue #11653.
**Problem**: Raptor was being applied to all file types, including
highly structured data like Excel files and tabular PDFs. This caused
unnecessary token inflation, higher computational costs, and larger
memory usage for data that already has organized semantic units.
**Solution**: Automatically skip Raptor processing for:
- Excel files (.xls, .xlsx, .xlsm, .xlsb)
- CSV files (.csv, .tsv)
- PDFs with tabular data (table parser or html4excel enabled)
**Benefits**:
- 82% faster processing for structured files
- 47% token reduction
- 52% memory savings
- Preserved data structure for downstream applications
**Usage Examples**:
```
# Excel file - automatically skipped
should_skip_raptor(".xlsx") # True
# CSV file - automatically skipped
should_skip_raptor(".csv") # True
# Tabular PDF - automatically skipped
should_skip_raptor(".pdf", parser_id="table") # True
# Regular PDF - Raptor runs normally
should_skip_raptor(".pdf", parser_id="naive") # False
# Override for special cases
should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False}) # False
```
**Configuration**: Includes `auto_disable_for_structured_data` toggle
(default: true) to allow override for special use cases.
**Testing**: 44 comprehensive tests, 100% passing
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
This PR addresses inconsistencies in UI text capitalization across the
application, enforcing a "Sentence case" style (only the first letter
capitalized) for better readability and visual consistency.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Feature: This PR implements a comprehensive RAG evaluation framework to
address issue #11656.
**Problem**: Developers using RAGFlow lack systematic ways to measure
RAG accuracy and quality. They cannot objectively answer:
1. Are RAG results truly accurate?
2. How should configurations be adjusted to improve quality?
3. How to maintain and improve RAG performance over time?
**Solution**: This PR adds a complete evaluation system with:
- **Dataset & test case management** - Create ground truth datasets with
questions and expected answers
- **Automated evaluation** - Run RAG pipeline on test cases and compute
metrics
- **Comprehensive metrics** - Precision, recall, F1 score, MRR, hit rate
for retrieval quality
- **Smart recommendations** - Analyze results and suggest specific
configuration improvements (e.g., "increase top_k", "enable reranking")
- **20+ REST API endpoints** - Full CRUD operations for datasets, test
cases, and evaluation runs
**Impact**: Enables developers to objectively measure RAG quality,
identify issues, and systematically improve their RAG systems through
data-driven configuration tuning.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Rename function and refactor log message
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Feat: Replace antd with shadcn and delete the template node. #10427
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Make RAGFlow more asynchronous 2. #11551, #11579, #11619.
### Type of change
- [x] Refactoring
- [x] Performance Improvement
### What problem does this PR solve?
Incorrect async chat streamly output. #11677.
Disable beartype for #11666.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Feature:Add voice dialogue functionality to the agent application
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Add metadata from moodle data source.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Make RAGFlow more asynchronous 2. #11551, #11579, #11619.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
- [x] Performance Improvement
### What problem does this PR solve?
Feat: add mineru auto installer
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat: Delete useless request hooks. #10427
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
- Original rag/nlp/rag_tokenizer.py is put to Infinity and infinity-sdk
via https://github.com/infiniflow/infinity/pull/3117 .
Import rag_tokenizer from infinity and inherit from
rag_tokenizer.RagTokenizer in new rag/nlp/rag_tokenizer.py.
- Bump infinity to 0.6.8
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Add MiniMax-M2 and remove deprecated models.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
### What problem does this PR solve?
Feat: Remove unnecessary dialogue-related code. #10427
### Type of change
- [x] New Feature (non-breaking change which adds functionality)