- Input: Starts with raw text.
- Language Detection: LanguageDetectionService identifies the language (can use TYPO3 context).
- Text Analysis Core: TextAnalysisService takes the raw text and detected language to:
- Clean the text.
- Tokenize it into words.
- Remove language-specific stop words (using StopWordsFactory).
- Stem the words (using wamania/php-stemmer).
- The output is processed text (usually a list of stemmed tokens).
- Advanced Processing (using processed text and language):
- TextVectorizerService: Converts processed text from multiple documents into numerical vectors (TF-IDF or DTM) and calculates cosine similarity between them.
- TextClusteringService: Uses the vectors (from TextVectorizerService) to group similar documents together using algorithms like K-Means.
- TopicModelingService: Uses processed text or vectors to extract representative terms, topics (often via clustering), or key phrases.
- Direct Outputs: Basic results like tokens or n-grams can also be directly obtained from TextAnalysisService.
- Cross-Cutting Concerns:
- All services are typically obtained and used via TYPO3's Dependency Injection.
- Most calculation-intensive services (TextAnalysisService, TextVectorizerService, TextClusteringService, TopicModelingService) can leverage the TYPO3 Caching Framework to store and reuse results, improving performance.
- Output: The results (similarity scores, clusters, topics, etc.) are then available for use by the calling application or extension (like Semantic Suggestion or Page Link Insights).
<?php
// Example of injection and usage in another TYPO3 service
use Cywolf\NlpTools\Service\TextAnalysisService;
use Cywolf\NlpTools\Service\LanguageDetectionService;
class MyContentProcessor
{
private TextAnalysisService $textAnalyzer;
private LanguageDetectionService $languageDetector;
public function __construct(
TextAnalysisService $textAnalyzer,
LanguageDetectionService $languageDetector
) {
$this->textAnalyzer = $textAnalyzer;
$this->languageDetector = $languageDetector;
}
public function analyze(string $rawText): array
{
$language = $this->languageDetector->detectLanguage($rawText);
// Cleans, tokenizes, and removes stop words
$cleanedText = $this->textAnalyzer->removeStopWords($rawText, $language);
// Reduces words to their root (stemming)
$stemmedWords = $this->textAnalyzer->stem($cleanedText, $language);
// Returns an array of stemmed words
return [
'language' => $language,
'processed_text' => implode(' ', $stemmedWords)
// ... other possible analyses
];
}
}
Thanks to this solid foundation and its extensive features (going beyond simple stop word removal to include vectorization, clustering, and topic modeling), NLP Tools enables other extensions like Semantic Suggestion to perform complex and relevant semantic analyses.
Multilingual Support
A particularly important aspect of NLP Tools is its support for multiple languages. The extension includes stop word dictionaries and specific rules for several European languages:
- French
- English
- German
- Spanish
Automatic language detection allows for correct processing of multilingual sites without additional configuration.
Integration with Solr for Enhanced Search
In addition to optimizing internal linking, the extension suite integrates seamlessly with Apache Solr to improve search results.
Weighting Search Results
Metrics calculated by Page Link Insights (PageRank, centrality) are used to weight search results:
plugin.tx_solr {
search {
relevance {
multiplier {
pagerank = 2.0
inboundLinks = 1.5
}
formula = sum(
mul(queryNorm(dismax(v:1)), 1.0),
mul(fieldValue(pagerank_f), 2.0),
mul(fieldValue(inbound_links_i), 1.5)
)
}
}
}
This configuration increases the relevance of pages that are important within the site structure during user searches.
Practical Use Cases for the Extensions
For News Websites
On news websites, the Semantic Suggestion extension can automatically generate "Related Articles" sections by identifying articles sharing similar themes. This helps keep readers engaged longer on the site by offering them additional relevant content.
For E-commerce Sites
In an e-commerce context, the extension suite can improve product recommendations by analyzing descriptions and categories to suggest complementary or alternative products, thereby increasing cross-selling opportunities.
For Institutional Websites
For institutional sites with numerous informational pages, thematic analysis helps create coherent "See also" sections, facilitating user navigation to related information without manual intervention.
Performance and Optimization
One of the major challenges of semantic processing is performance management, especially on large sites. The extension suite uses several strategies to maintain optimal performance:
- Database Storage: Similarity calculations are performed by a scheduled task and stored in the database.
- Caching: Intermediate results are cached to avoid unnecessary calculations.
- Asynchronous Processing: Intensive calculations are performed in the background.
- Algorithm Optimization: Similarity algorithms are optimized for large data volumes.
Future Development Perspectives
The next development steps include:
- Integration of external similarity calculations: Adding the possibility to use an external module.
- More comprehensive support from nlp_tools to lighten semantic_suggestion and page_link_insight.