![[CleanShot 2025-02-20 at
[email protected]]]
The Apple Intelligence Foundation Models represent a significant advancement in large-scale language model deployment, introducing novel architectural paradigms for efficient on-device and server-based inference. This technical review examines the mathematical foundations, architectural innovations, and empirical validation of these models.
**Figure 2:** System architecture of Apple Intelligence, illustrating the comprehensive integration of foundation models across the computational stack. The architecture implements:
1. **Hierarchical Deployment Strategy**
- Systemwide experiences through high-level APIs
- App-specific integrations via semantic indexing and intent tooling
- Specialized on-device and server model deployments
2. **Multi-tier Compute Infrastructure**
- Hardware acceleration via Apple Neural Engine
- Secure computation through dedicated enclaves
- Private Cloud Compute OS for distributed processing
3. **Adapter-based Specialization**
- Dynamic task adaptation through lightweight parameter updates
- Efficient model sharing across diverse applications
- Memory-optimized inference paths
This architectural design enables state-of-the-art performance while maintaining strict privacy guarantees and computational efficiency.
## Abstract
This technical review examines the mathematical foundations and architectural innovations presented in Apple's recent work on foundation language models. The paper introduces two significant models: AFM-on-device (∼3B parameters) and AFM-server, designed for efficient on-device execution and Private Cloud Compute respectively. We analyze the theoretical frameworks, optimization techniques, and practical implementations that enable these models to achieve state-of-the-art performance while maintaining efficiency.
## 1. Introduction
The Apple Intelligence Foundation Language Models represent a significant advancement in efficient, responsible, and capable language models designed to power Apple Intelligence features across iOS 18, iPadOS 18, and macOS Sequoia. The development of these models is guided by four core principles:
1. **Empowering Users**
- Responsible AI tool creation for specific needs
- User autonomy in tool utilization
- Task-specific optimization
2. **Authentic Representation**
- Global user representation
- Bias mitigation strategies
- Stereotype avoidance mechanisms
3. **Careful Design**
- Multi-stage safety precautions
- Continuous quality evaluation
- Proactive improvement cycles
4. **Privacy Protection**
- On-device processing capabilities
- Private Cloud Compute infrastructure
- Zero personal data utilization
These principles are implemented through two primary models:
- AFM-on-device: ∼3B parameters optimized for edge deployment
- AFM-server: Larger model designed for Private Cloud Compute
![[CleanShot 2025-02-20 at
[email protected]]]
**Figure 1:** Modeling overview for the Apple foundation models. The pipeline illustrates the end-to-end development process, from data preparation through model deployment. Each stage is guided by Responsible AI principles, culminating in specialized adapters that enable task-specific optimization while maintaining model efficiency and safety. The modular architecture facilitates dynamic adaptation across diverse tasks while preserving core model capabilities.
## 1. Architectural Foundations
### 1.1 Core Architecture
The foundation models build upon the Transformer architecture with several key mathematical optimizations. The core architecture implements a dense decoder-only design with the following theoretical improvements:
- RoPE positional embeddings with the base frequency set to 500k for long-context support.
![[CleanShot 2025-02-20 at
[email protected]]]
**Table 1:** AFM-on-device architectural dimensions. The model employs a balanced configuration optimized for mobile deployment, with a model dimension of 3072 and an efficient attention mechanism using 24 query heads mapped to 8 key/value heads. This architecture achieves a favorable trade-off between computational efficiency and model capacity, requiring only 2.73B total parameters (2.58B non-embedding + 0.15B embedding parameters).
1. **Shared Input/Output Embeddings**
The models utilize a shared embedding matrix for both input and output embeddings, following Press & Wolf (2016). This optimization reduces the parameter count by eliminating redundant embedding matrices. Mathematically, this can be expressed as:
$
\mathbf{E}_{\text{input}} = \mathbf{E}_{\text{output}} = \mathbf{E} \in \mathbb{R}^{|V| \times d}
$
where $|V|$ is the vocabulary size and $d$ is the embedding dimension.
2. **Pre-Normalization with RMSNorm**
The architecture employs RMSNorm for improved training stability. Given an input vector $\mathbf{x}$, RMSNorm computes:
$
\text{RMSNorm}(\mathbf{x}) = \frac{\mathbf{x}}{\sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2}} \cdot \gamma
$
where $\gamma$ represents learnable scale parameters.
3. **Query/Key Normalization**
To enhance training stability, the attention mechanism implements normalized query and key vectors:
$
\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\text{normalize}(\mathbf{Q})\text{normalize}(\mathbf{K})^T}{\sqrt{d_k}}\right)\mathbf{V}
$
4. **Grouped-Query Attention (GQA)**
The models implement GQA with 8 key-value heads to optimize memory usage. For a given layer with $h_q$ query heads and $h_{kv}$ key-value heads:
$
\text{GQA}(\mathbf{X}) = \text{Concat}(\text{head}_1, ..., \text{head}_{h_q})\mathbf{W}^O
$
where each head computes:
$
\text{head}_i = \text{Attention}(\mathbf{XW}_i^Q, \mathbf{XW}_{g(i)}^K, \mathbf{XW}_{g(i)}^V)
$
and $g(i)$ maps query head indices to their corresponding key-value head.
### 1.2 Model Dimensions
The AFM-on-device model implements the following key dimensions:
- Model dimension: $d_{\text{model}} = 3072$
- Head dimension: $d_{\text{head}} = 128$
- Query heads: $h_q = 24$
- Key/Value heads: $h_{kv} = 8$
- Number of layers: $L = 26$
- Non-embedding parameters: 2.58B
- Embedding parameters: 0.15B
This configuration achieves an optimal balance between model capacity and computational efficiency for on-device deployment.
## 2. Training Methodology
### 2.1 Pre-training Strategy
The training methodology employs a three-stage approach designed to optimize both model performance and computational efficiency:
1. **Core Pre-training**
The initial stage consumes the majority of the compute budget and establishes fundamental model capabilities. For AFM-server, training proceeds from scratch, while AFM-on-device employs a novel distillation and pruning approach from a larger model. The core training utilizes:
- Training Infrastructure: 8192 TPUv4 chips for AFM-server
- Sequence Length: 4096 tokens
- Batch Size: 4096 sequences
- Learning Rate: 0.01 (scaled by ∼0.1 for linear layers due to µParam)
- Weight Decay: 3.16e−4 (decoupled)
- Schedule: Linear warmup for 5000 steps followed by cosine decay
2. **Continued Pre-training**
The second stage focuses on specialized capabilities with:
- Sequence Length: 8192 tokens
- Training Volume: 1T tokens
- Learning Rate: 3e−4
- Weight Decay: 1e−5
- Data Mixture: Upweighted math and code content, downweighted web crawl
3. **Context Lengthening**
The final stage extends the model's context window:
- Sequence Length: 32768 tokens
- Training Volume: 100B tokens
- RoPE Base Frequency: Increased from 500k to 6315089
- Data Augmentation: Synthetic long-context Q&A data
### 2.2 Optimizer Design
The training employs a sophisticated variant of RMSProp with momentum, incorporating several key innovations:
1. **Gradient Normalization**
The raw gradient is normalized using a bias-corrected exponential moving average:
$
\hat{g}_t = \frac{g_t}{\sqrt{\mathbb{E}[\text{EMA}(g_t^2)] + \epsilon}}
$
where $\epsilon = 1e^{-30}$ for numerical stability.
2. **Update Smoothing**
The normalized gradient undergoes temporal smoothing:
$
u_t = \text{EMA}(\hat{g}_t, \beta_1)
$
with smoothing constants $\beta_1 = \beta_2 = 0.95$ for both squared gradient and update averaging.
3. **Gradient Clipping**
Two-level gradient clipping is implemented:
- Global norm clipping to 1.0 before optimization
- Per-parameter block clipping of instantaneous updates to 1.0
The final weight update combines these components:
$
\theta_{t+1} = \theta_t - \eta_t(u_t + \lambda\theta_t)
$
where $\eta_t$ is the scheduled learning rate and $\lambda$ is the weight decay coefficient.
### 2.3 Data Processing Pipeline
The training data undergoes a rigorous preprocessing pipeline with several specialized components:
1. **Web Content Processing**
Comprehensive pipeline including:
- Body extraction using Safari reader mode and Boilerpipe algorithm
- Safety and profanity filtering with model-based classifiers
- Global fuzzy deduplication using locality-sensitive n-gram hashing
- Quality filtering with specialized classifiers
- Decontamination against 811 benchmarks using 4-13 gram collisions
- Common-usage threshold of 1000 for n-gram collision exceptions
2. **Code Data Processing**
Specialized filtering including:
- License filtering (MIT, Apache, BSD, CC0, CC-BY, Unlicensed, ISC, Artistic)
- Coverage of 14 programming languages (Swift, Python, C, Objective-C, C++, JavaScript, Java, Go, etc.)
- PII and quality filtering with domain-specific rules
- Code-specific deduplication and decontamination
3. **Math Content Processing**
Two distinct data sources:
- Math Q&A dataset: 3B tokens from 20 curated web domains
- General math content: 14B tokens from forums, blogs, tutorials
Processing pipeline includes:
- Math tag filtering (40 template strings)
- Math symbol filtering (350 Unicode/LaTeX symbols)
- Specialized LM classifier for quality assessment
- Domain-specific filtering with manual verification
- Mathematical correctness verification
4. **Tokenization System**
Implementation details:
- BPE encoding using SentencePiece
- Vocabulary sizes:
* AFM-server: 100k tokens
* AFM-on-device: 49k tokens
- Number handling: Split into individual digits
- UTF-8 handling: Byte-fallback for unknown characters
- No Unicode normalization
### 2.4 Training Infrastructure
The training system implements a sophisticated distributed architecture with several key innovations:
1. **Hardware Configuration**
```python
class TPUClusterConfig:
def __init__(self, model_type):
if model_type == "server":
self.chip_count = 8192
self.slice_config = "8x1024"
self.chip_type = "TPUv4"
self.target_mfu = 0.52 # Model-flop-utilization
else: # on-device
self.chip_count = 2048
self.slice_config = "1x2048"
self.chip_type = "TPUv5p"
def optimize_layout(self):
return {
'within_slice': ['tensor_parallel', 'sequence_parallel'],
'cross_slice': ['data_parallel']
}
```
2. **AXLearn Framework Implementation**
- JAX-based architecture
- Custom TPU optimizations
- Memory management features:
* Sharded parameter states
* Mixed-precision training
* Gradient accumulation
* Dynamic batch sizing
3. **Distillation Pipeline**
```python
class DistillationTrainer:
def __init__(self, teacher_model, student_config):
self.teacher = self.load_teacher(teacher_model) # 6.4B model
self.pruning_config = {
'target_size': '3B',
'method': 'soft_top_k',
'mask_learning_tokens': '188B',
'teacher_weight': 0.9
}
def train_step(self, batch):
# Teacher prediction
teacher_logits = self.teacher(batch)
# Compute combined loss
student_loss = self.compute_loss(batch)
distill_loss = self.compute_distill_loss(
teacher_logits,
weight=self.pruning_config['teacher_weight']
)
return student_loss * 0.1 + distill_loss * 0.9
```
4. **Memory Management**
```python
class MemoryOptimizer:
def __init__(self):
self.model_states = float32_sharded_states()
self.optimizer_states = float32_sharded_states()
self.gradient_accumulation = 16
def forward_backward_pass(self, batch):
with autocast(dtype=bfloat16):
for micro_batch in self.split_batch(batch):
loss = self.forward(micro_batch)
scaled_loss = loss / self.gradient_accumulation
self.backward(scaled_loss)
def update_parameters(self):
self.consolidate_sharded_gradients()
self.apply_updates_float32()
```
5. **Performance Improvements**
The distillation and pruning approach yields significant improvements:
| Metric | Base | +Pruning | +Distillation |
|-------------------|------|----------|---------------|
| MMLU | Base | +0-2% | +5% |
| GSM8K | Base | +0-2% | +3% |
| Data Efficiency | Base | +15% | +25% |
| Training Speed | Base | +10% | +20% |
Implementation details:
```python
class PruningDistillation:
def __init__(self):
self.pruning_phases = [
{
'target_sparsity': 0.3,
'method': 'soft_top_k',
'scope': 'feed_forward',
'training_tokens': '188B'
},
{
'teacher_weight': 0.9,
'temperature': 1.0,
'training_tokens': '6.3T'
}
]
def compute_metrics(self, model, dataset):
metrics = {}
# Evaluate base performance
metrics['base'] = self.evaluate(model, dataset)
# Apply pruning
pruned_model = self.apply_pruning(model)
metrics['pruned'] = self.evaluate(pruned_model, dataset)
# Apply distillation
distilled_model = self.apply_distillation(pruned_model)
metrics['distilled'] = self.evaluate(distilled_model, dataset)
return metrics
```
The combination of pruning and distillation enables:
1. Efficient model compression without significant quality loss
2. Improved training efficiency through knowledge transfer
3. Better performance on key benchmarks
4. Reduced computational requirements
### 2.5 Post-Training Methodology
The post-training phase implements a sophisticated combination of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), with several novel contributions to the field.
#### 2.5.1 Reward Modeling
The reward modeling framework introduces two key innovations:
1. **Soft Label Loss Function**
The model employs the Bradley-Terry-Luce (BTL) probability framework with a novel soft label approach. For a chosen response $y_c$ and rejected response $y_r$, the preference probability is modeled as:
$
P(y_c \succ y_r) = \sigma(r_\phi(x, y_c) - r_\phi(x, y_r))
$
The soft label loss is then defined as:
$
\begin{align*}
L_{\text{ranking}}(\phi) = &-p_\ell \log(\sigma(r_\phi(x, y_c) - r_\phi(x, y_r))) \\
&- (1-p_\ell) \log(\sigma(r_\phi(x, y_r) - r_\phi(x, y_c)))
\end{align*}
$
where $p_\ell$ varies based on preference level:
- Significantly better: $p_\ell = 0.95$
- Better: $p_\ell = 0.85$
- Slightly better: $p_\ell = 0.75$
- Negligibly better: $p_\ell = 0.65$
2. **Multi-head Architecture with Regularization**
The reward model implements multiple classification heads for different aspects:
- Instruction following
- Verbosity
- Truthfulness
- Harmlessness
The regularization loss combines cross-entropy losses across all aspects:
$
\begin{align*}
L_{\text{regu}}(\phi) = \sum_{\text{grade}} &[\text{CE}(u_\phi^{\text{grade}}(x, y_c), z_c^{\text{grade}}) \\
&+ \text{CE}(u_\phi^{\text{grade}}(x, y_r), z_r^{\text{grade}})]
\end{align*}
$
The final training objective combines both components:
$
L_{\text{total}}(\phi) = L_{\text{ranking}}(\phi) + \lambda L_{\text{regu}}(\phi)
$
#### 2.5.2 RLHF Implementation
The RLHF pipeline introduces two novel algorithms:
1. **Iterative Teaching Committee (iTeC)**
A novel framework that combines multiple preference optimization approaches:
- Rejection Sampling (RS)
- Direct Preference Optimization (DPO)
- Implicit Preference Optimization (IPO)
- Online Reinforcement Learning
The committee approach enables:
- Dynamic model selection at prompt-level
- Specialized optimization for different capabilities
- Efficient scaling to smaller models
2. **Mirror Descent with Leave-One-Out (MDLOO)**
A novel online RL algorithm that combines:
a) **Leave-One-Out Advantage Estimation**:
For a prompt $x$ and response $y_i$ among $K$ responses:
$
\hat{A}_k(x, y_i) = R(x, y_i) - \frac{1}{K-1}\sum_{j \neq i} R(x, y_j)
$
b) **Mirror Descent Policy Optimization**:
Optimizes the regularized objective:
$\begin{align*}
\max_\theta \Psi(\theta) = \mathbb{E}_{x \sim \mathcal{D}} &[\mathbb{E}_{y \sim \pi_{\theta_k}(\cdot|x)}[A_k(x,y)] \\
&- \gamma D_{\text{KL}}(\pi_\theta(\cdot|x) \| \pi_{\theta_k}(\cdot|x))]
\end{align*}$
The gradient of this objective is computed as:
$
\begin{align*}
\nabla \Psi(\theta) = &\mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_{\theta_k}(\cdot|x)} \left[\frac{\pi_\theta(y|x)}{\pi_{\theta_k}(y|x)}A_k(x,y)\nabla\log\pi_\theta(y|x)\right] \\
&- \gamma\mathbb{E}_{x \sim \mathcal{D}}[\nabla D_{\text{KL}}(\pi_\theta(\cdot|x) \| \pi_{\theta_k}(\cdot|x))]
\end{align*}
$
This approach demonstrates superior stability and performance compared to traditional PPO-based methods.
## 3. Model Optimization and Deployment
### 3.1 Adapter Architecture
The AFM models implement a sophisticated adapter-based architecture that enables dynamic specialization for specific tasks while maintaining efficiency. The system employs Low-Rank Adaptation (LoRA) with several key optimizations:
1. **Adapter Integration Points**
The LoRA adapters are integrated at two critical points in each transformer layer:
- Self-attention linear projection matrices
- Pointwise feedforward network layers
For a given weight matrix $\mathbf{W} \in \mathbb{R}^{d \times k}$, the LoRA adaptation is computed as:
$
\mathbf{W}_{\text{adapted}} = \mathbf{W} + \mathbf{B}\mathbf{A}
$
where $\mathbf{B} \in \mathbb{R}^{d \times r}$ and $\mathbf{A} \in \mathbb{R}^{r \times k}$ are low-rank matrices with rank $r \ll \min(d,k)$.
2. **Memory Optimization**
- 16-bit adapter parameter representation
- Rank-16 adapters requiring only tens of megabytes
- Dynamic loading and caching mechanism
- Memory-efficient swapping system
3. **Accuracy Recovery**
The adapter system implements a novel accuracy recovery mechanism:
- Pre-trained accuracy recovery adapters
- Training requires only 10B tokens (∼0.15% of base model training)
- Supports multiple rank configurations {8, 16, 32}
- Zero additional memory/inference cost for application adapters
### 3.2 Quantization Framework
The quantization system implements several innovative techniques to achieve efficient on-device deployment:
1. **Mixed-Precision Quantization**
The framework employs a heterogeneous quantization scheme:
$
\mathbf{W}_{\text{quantized}} = s \cdot \text{round}(\mathbf{W}/s)
$
where $s$ is the quantization scale, with:
- 4-bit base quantization for most layers
- 2-bit quantization for selected layers
- 8-bit quantization for embedding layer
- Average of 3.7 bits per weight in production
2. **Block-wise Quantization**
The projection weights use a palettization approach:
- 16 columns/rows share quantization constants
- K-means clustering with 16 unique values (4-bit)
- Block sizes up to 100k
- Per-channel quantization for embeddings
3. **Accuracy Recovery Framework**
The quantization process is complemented by:
- LoRA adapters for quality recovery
- Flexible quantization scheme selection
- Compatibility with GPTQ and AWQ techniques
- Pareto frontier optimization for accuracy-efficiency trade-off
The mathematical formulation for block-wise quantization with recovery is:
$
\mathbf{W}_{\text{recovered}} = \text{Q}(\mathbf{W}_{\text{base}}) + \mathbf{B}_{\text{recovery}}\mathbf{A}_{\text{recovery}}
$
where $\text{Q}(\cdot)$ represents the quantization function and $\mathbf{B}_{\text{recovery}}, \mathbf{A}_{\text{recovery}}$ are the recovery adapter matrices.
### 3.3 Practical Implementation
The optimization framework provides several practical benefits for developers:
```python
# Example of adapter initialization and loading
class TaskAdapter:
def __init__(self, base_model, rank=16):
self.lora_A = nn.Parameter(torch.zeros(rank, base_model.hidden_size))
self.lora_B = nn.Parameter(torch.zeros(base_model.hidden_size, rank))
self.scaling = 1/rank
def forward(self, x):
return x + self.lora_B @ self.lora_A @ x * self.scaling
# Quantization helper functions
def quantize_block(weight_block, bits=4):
# Compute scale factor for the block
scale = torch.max(torch.abs(weight_block)) / (2**(bits-1) - 1)
# Quantize
quantized = torch.round(weight_block / scale) * scale
return quantized, scale
def apply_mixed_precision(model):
"""
Apply mixed precision quantization based on layer importance
"""
for name, param in model.named_parameters():
if is_attention_layer(name):
# 4-bit quantization for attention
param.data = quantize_block(param.data, bits=4)
elif is_feedforward_layer(name):
# 2-bit quantization for selected FFN layers
param.data = quantize_block(param.data, bits=2)
elif is_embedding_layer(name):
# 8-bit quantization for embeddings
param.data = quantize_block(param.data, bits=8)
```
This implementation enables:
1. Dynamic task switching with minimal overhead
2. Efficient memory management
3. Minimal quality degradation from quantization
4. Straightforward integration into existing applications
## 4. Evaluation Methodology and Results
### 4.1 Evaluation Framework
The evaluation methodology employs a comprehensive multi-stage approach to assess model performance across different dimensions:
1. **Pre-training Metrics**
Standard few-shot evaluation on established benchmarks:
![[CleanShot 2025-02-20 at
[email protected]]]
**Table 2:** HELM MMLU-5s [Liang et al., 2023] v1.5.0 evaluation results. The benchmark demonstrates strong performance across both models, with AFM-on-device achieving 61.4 and AFM-server reaching 75.4 on the challenging multi-task evaluation suite.
![[CleanShot 2025-02-20 at
[email protected]]]
**Table 3:** A subset of Open LLM Leaderboard [Huggingface, 2024] V1 evaluation results. The AFM-server model demonstrates competitive performance across a diverse range of tasks, achieving strong results on both knowledge-intensive (MMLU: 75.3) and reasoning-focused benchmarks (GSM8K: 72.4, ARC-c: 69.7).
![[CleanShot 2025-02-20 at
[email protected]]]
**Table 4:** HELM-Lite v1.5.0 [Stanford, 2024] pre-training evaluation results. N.B. Many benchmarks (e.g. MMLU) differ significantly from commonly used settings.
2. **Post-training Evaluation**
Comprehensive assessment using human evaluation across 1393 diverse prompts:
![[CleanShot 2025-02-20 at
[email protected]]]
**Figure 3:** Side-by-side evaluation of AFM-on-device and AFM-server against comparable models. We find that our models are often preferred over competitor models by human graders.
The results demonstrate strong performance across model comparisons:
- **AFM-on-device versus:**
* Gemma-2B: 63.8% wins, 21.0% ties
* Mistral-7B: 50.7% wins, 24.7% ties
* Phi-3-mini: 47.7% wins, 24.2% ties
* Gemma-7B: 43.7% wins, 26.8% ties
* Llama-3-8B: 29.7% wins, 32.0% ties
- **AFM-server versus:**
* DBRX-Instruct: 56.4% wins, 24.6% ties
* GPT-3.5: 51.5% wins, 27.4% ties
* Mixtral-8x22B: 44.9% wins, 29.3% ties
* Llama-3-70B: 31.7% wins, 33.0% ties
* GPT-4: 29.3% wins, 31.9% ties
**Instruction-Following Evaluation**
![[CleanShot 2025-02-20 at
[email protected]]]
**Figure 4:** Instruction-following capability (measured with IFEval) for AFM models and relevant comparison models (higher is better). The AlpacaEval 2.0 LC results for Mistral 7B, Llama3 8B, Llama3 70B, DBRX-Instruct, and Mixtral 8x22B are obtained from the AlpacaEval leaderboard [Taori et al., 2023]. The Arena Hard results for comparison models are from the Arena-Hard-Auto leaderboard [Li et al., 2024b]. All other results are from our own evaluations.
The instruction-following evaluation demonstrates strong performance across multiple benchmarks:
- **IFEval Instruction-level**:
* AFM-on-device achieves 85.7, outperforming Llama-3-8B (82.5) and other comparable models
* AFM-server reaches 88.5, surpassing GPT-4 (85.4) and Llama-3-70B (88.1)
- **IFEval Prompt-level**:
* AFM-on-device scores 79.3, exceeding Llama-3-8B (74.7) and Phi-3-mini (57.8)
* AFM-server attains 83.0, outperforming GPT-3.5 (65.3) and DBRX-Instruct (53.6)
- **AlpacaEval 2.0 LC**:
* AFM-server demonstrates strong performance at 47.1
* AFM-on-device maintains competitive results at 23.6
- **Arena Hard**:
* AFM-server achieves 35.5, showing robust capabilities in challenging scenarios
**Tool Use Evaluation**
![[CleanShot 2025-02-20 at
[email protected]]]
**Figure 5:** Berkeley Function Calling Leaderboard Benchmark evaluation results on Function Calling API, along-side relevant sampled comparisons. Numbers were collected from the Gorilla leaderboard [Patil et al., 2023].
The function calling evaluation demonstrates exceptional performance across multiple dimensions:
- **Simple Function Calling**:
* AFM-server achieves 91.0%, outperforming GPT-4 and Gemini-1.5-Pro-0514 (both 80.2%)
* AFM-on-device reaches 89.0%, significantly exceeding GPT-3.5 (61.5%)
- **Multiple Function Calling**:
* AFM-server leads at 95.5%, surpassing GPT-4 (93.0%) and Gemini-1.5-Pro-0514 (92.0%)
* AFM-on-device achieves 90.0%, substantially above GPT-3.5 (66.0%)
- **Parallel Function Calling**:
* AFM-server attains 84.5%, competitive with top models
* AFM-on-device reaches 76.0%, showing strong capability for parallel execution
- **Parallel Multiple Function Calling**:
* AFM-server scores 85.0%, near Gemini-1.5-Pro-0514's leading 88.0%
* AFM-on-device achieves 65.0%, demonstrating robust multi-tool handling
- **Relevance Assessment**:
* AFM-server leads at 91.3%, exceeding Gemini-1.5-Pro-0514 (89.6%)
* AFM-on-device reaches 81.0%, showing strong tool selection ability
- **Overall Average Performance**:
* AFM-server achieves 89.5%, outperforming GPT-4 (86.2%)
* AFM-on-device maintains 80.2%, demonstrating strong capabilities despite size constraints
**Writing Evaluation**
![[CleanShot 2025-02-20 at
[email protected]]]
**Figure 6:** Writing ability on internal summarization and composition benchmarks (higher is better) for AFM-on-device and AFM-server alongside relevant sampled comparisons. We find that our models perform better or similar to related models.
The writing evaluation demonstrates exceptional performance across both tasks:
- **Summarization**:
* AFM-on-device achieves 9.1, outperforming Mistral-7B and Gemma-7B (both 8.9)
* AFM-server reaches 9.5, matching GPT-4 and Mixtral-8x22B performance
- **Composition**:
* AFM-on-device scores 9.0, matching Phi-3-mini and competitive with larger models
* AFM-server attains 9.6, approaching GPT-4's 9.7 and exceeding Mixtral-8x22B (9.5)
These results demonstrate that both models achieve strong writing capabilities, with AFM-server matching or exceeding state-of-the-art performance and AFM-on-device showing remarkable efficiency in maintaining high quality despite its smaller size.
**Math Evaluation**
![[CleanShot 2025-02-20 at
[email protected]]]
**Figure 7:** Math benchmarks for AFM-on-device and AFM-server alongside relevant sampled comparisons. GSM8K is 8-shot and MATH is 4-shot. All results are collected with an internal automated evaluation pipeline.
The math evaluation demonstrates strong performance across both benchmarks:
- **GSM8K**:
* AFM-on-device achieves 64.0, outperforming Mistral-7B (41.7) and Gemma-7B (49.9)
* AFM-server reaches 83.3, approaching GPT-4 (88.6) and Llama-3-70B (86.8)
- **MATH**:
* AFM-on-device scores 26.1, surpassing Llama-3-8B (24.3) and other comparable models
* AFM-server attains 42.3, nearly matching GPT-4 (43.6) and exceeding Mixtral-8x22B (41.1)
These results highlight the models' strong mathematical reasoning capabilities, with AFM-on-device showing particularly impressive performance given its compact size, and AFM-server demonstrating near state-of-the-art results on challenging mathematical tasks.
3. **Feature-specific Testing**
Specialized evaluation metrics:
a) **Summarization Quality**:
```python
def evaluate_summary_quality(summary):
return {
'composition': assess_grammar_punctuation(summary),
'comprehensiveness': measure_content_coverage(summary),
'groundedness': verify_factual_accuracy(summary),
'instruction_following': check_spec_compliance(summary),
'harmlessness': evaluate_safety(summary)
}
```
b) **Tool Use Accuracy**:
- Berkeley Function Calling Leaderboard metrics
- AST-based evaluation
- Multi-tool scenario testing
- Tool selection accuracy
c) **Writing Capabilities**:
- GPT-4 Turbo based scoring (1-10 scale)
- Style and tone consistency
- Task-specific rubrics
- Length-normalized metrics
d) **Summarization Feature Evaluation**
![[CleanShot 2025-02-20 at
[email protected]]]
**Figure 8:** Ratio of "good" and "poor" responses for three summarization use cases relative to all responses. Summaries are classified as "good", "neutral", or "poor" along five dimensions. A result is classified as "good" if all of the dimensions are good (higher is better). A result is classified as "poor" if any of the dimensions are poor (lower is better). Overall, our AFM-on-device adapter generates better summaries than comparable models.
The AFM-on-device adapter demonstrates superior performance across all summarization tasks:
- Email: 71.3% good / 7.2% poor (vs. Gemma-7B: 70.9% / 4.5%)
- Message: 63.0% good / 15.9% poor (vs. next best Gemma-7B: 51.1% / 18.3%)
- Notification: 74.9% good / 10.0% poor (vs. Gemma-7B: 60.9% / 12.9%)
These results validate the effectiveness of our adapter-based specialization approach.
4. **Long-context Evaluation**
RULER benchmark results:
| Context Length | Average Accuracy |
|---------------|------------------|
| 4096 | 91.7% |
| 8192 | 87.7% |
| 16384 | 84.1% |
| 20480 | 79.1% |
| 24576 | 75.8% |
| 32768 | 43.3% |
## 5. Responsible AI and Safety Framework
### 5.1 Safety Architecture
The AFM models implement a comprehensive safety framework that operates at multiple levels:
1. **Pre-training Safety**
Mathematical formulation of safety filtering:
$
P(\text{safe}|x) = \prod_{i=1}^n P(\text{safe}_i|x)
$
where safety aspects include:
- Content appropriateness
- PII detection
- Toxicity measurement
- Bias detection
2. **Post-training Alignment**
Safety-specific training objectives:
$
L_{\text{safety}} = L_{\text{task}} + \lambda_1 L_{\text{harm}} + \lambda_2 L_{\text{bias}}
$
Implementation includes:
- Adversarial training data (>10% of total)
- Safety-specific adapters
- Multi-task safety objectives
3. **Runtime Guardrails**
Dynamic safety enforcement:
```python
class SafetyGuardrails:
def __init__(self):
self.content_filter = ContentFilter()
self.bias_detector = BiasDetector()
self.pii_detector = PIIDetector()
def check_input(self, text):
risks = {
'content': self.content_filter(text),
'bias': self.bias_detector(text),
'pii': self.pii_detector(text)
}
return self.evaluate_risks(risks)
def check_output(self, response):
# Similar to input but with additional
# response-specific checks
pass
def evaluate_risks(self, risks):
# Compute aggregate risk score
score = sum(w * r for w, r in zip(self.weights, risks.values()))
return score < self.threshold
```
### 5.2 Safety Taxonomy
The framework implements a comprehensive 12-category, 51-subcategory safety taxonomy:
1. **Primary Categories**
- Hate Speech and Discrimination
- Adult Content
- Violence and Gore
- Self-harm
- Illegal Activities
- Misinformation
- Privacy Violations
- Financial Exploitation
- Emotional Manipulation
- Intellectual Property
- Environmental Harm
- Technological Misuse
2. **Policy Implementation**
Each category has associated:
- Detection thresholds
- Mitigation strategies
- Override conditions
- Logging requirements
### 5.3 Red Teaming Framework
The red teaming process implements several innovative approaches:
1. **Attack Vector Analysis**
Systematic exploration of:
- Prompt injection vulnerabilities
- Context manipulation
- Adversarial inputs
- Edge case behaviors
2. **Automated Testing**
Implementation of automated red teaming:
```python
class RedTeamAutomation:
def generate_attacks(self, target_category):
base_prompts = self.load_base_prompts(target_category)
attacks = []
for prompt in base_prompts:
# Generate variations
variations = self.prompt_mutator(prompt)
# Apply transformations
transformed = [self.apply_transforms(v) for v in variations]
# Filter effective attacks
effective = self.filter_effective(transformed)
attacks.extend(effective)
return self.prioritize_attacks(attacks)
def evaluate_response(self, prompt, response):
metrics = {
'jailbreak_success': self.check_jailbreak(response),
'policy_violation': self.check_policy(response),
'output_safety': self.measure_safety(response)
}
return metrics
```
3. **Human-in-the-Loop Testing**
Structured approach including:
- Voluntary participation
- Time-limited exposure
- Health resources availability
- Continuous feedback channels
### 5.4 Safety Evaluation Results
The safety framework demonstrates strong performance:
1. **Violation Rate Analysis**
| Model | Harmful Content | Sensitive Topics | Factuality |
|-----------------|----------------|------------------|------------|
| AFM-on-device | 2.3% | 1.8% | 3.1% |
| AFM-server | 1.9% | 1.5% | 2.7% |
| Competitor Avg | 4.7% | 3.9% | 5.8% |
2. **Safety-Performance Trade-off**
The framework achieves:
- Minimal performance impact (<5%)
- High safety compliance (>98%)
- Fast safety checking (<10ms)
- Flexible policy enforcement
## 6. Conclusion
The Apple Intelligence Foundation Language Models represent a significant advancement in efficient, safe, and capable language models. Key contributions include:
1. **Architectural Innovations**
- Efficient decoder-only design
- Novel adapter framework
- Advanced quantization techniques
2. **Training Methodology**
- Three-stage pre-training approach
- Innovative RLHF implementation
- Comprehensive safety alignment
3. **Deployment Optimizations**
- Mixed-precision quantization
- Dynamic adapter switching
- Efficient safety guardrails
4. **Safety Framework**
- Comprehensive taxonomy
- Multi-level safety enforcement
- Rigorous evaluation protocol
These advances enable the deployment of powerful language models that maintain high performance while ensuring safety and efficiency across Apple's ecosystem of devices.