remorhaz/php-unilex
最新稳定版本:v0.5.3
Composer 安装命令:
composer require remorhaz/php-unilex
包简介
Unilex: lexical analyzer generator with Unicode support written in PHP
README 文档
README
UniLex is lexical analyzer generator (similar to lex and flex) with Unicode support.
It's written in PHP and generates code in PHP.
[WIP] Work in progress
Requirements
- PHP 8
License
UniLex library is licensed under MIT license.
Installation
Installation is as simple as any other composer library's one:
composer require remorhaz/php-unilex
Usage
Quick start in example
Let's imagine we want to write a simple calculator and we need a lexer (lexical analyzer) that provides a stream of IDs, numbers and operators. Create a new Composer project and execute following command from project directory:
composer require --dev remorhaz/php-unilex
Next step is creating a lexer specification in LexerSpec.php file. We use @lexToken tag in comments to specify regular expression for a token:
<?php /** * @var \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface $context * @lexTargetClass TokenMatcher * @lexHeader */ const TOKEN_ID = 1; const TOKEN_OPERATOR = 2; const TOKEN_NUMBER = 3; /** @lexToken /[a-zA-Z][0-9a-zA-Z]*()/ */ $context->setNewToken(TOKEN_ID); /** @lexToken /[+\-*\/]/ */ $context->setNewToken(TOKEN_OPERATOR); /** @lexToken /[0-9]+/ */ $context->setNewToken(TOKEN_NUMBER);
Next step is building a token matcher from specification:
vendor/bin/unilex LexerSpec.php > TokenMatcher.php
Now we have a compiled token matcher in TokenMatcher.php file. Let's use it and read all tokens from the buffer:
<?php use Remorhaz\UniLex\Lexer\TokenFactory; use Remorhaz\UniLex\Lexer\TokenReader; use Remorhaz\UniLex\Unicode\CharBufferFactory; require_once "vendor/autoload.php"; require_once "TokenMatcher.php"; $buffer = CharBufferFactory::createFromString("x+2*3"); $tokenReader = new TokenReader($buffer, new TokenMatcher, new TokenFactory(0xFF)); do { $token = $tokenReader->read(); echo "Token ID: {$token->getType()}\n"; } while (!$token->isEoi());
On execution this script outputs:
Token ID: 1
Token ID: 2
Token ID: 3
Token ID: 2
Token ID: 3
Token ID: 255
Let's go a bit further and make it possible to retrieve text presentation of every token from input buffer. We need to modify a lexer specification to attach the result to each non-EOI token as an attribute:
<?php /** * @var \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface $context * @lexTargetClass TokenMatcher * @lexHeader */ const TOKEN_ID = 1; const TOKEN_OPERATOR = 2; const TOKEN_NUMBER = 3; /** @lexToken /[a-zA-Z][0-9a-zA-Z]*()/ */ $context ->setNewToken(TOKEN_ID) ->setTokenAttribute('text', $context->getSymbolString()); /** @lexToken /[+\-*\/]/ */ $context ->setNewToken(TOKEN_OPERATOR) ->setTokenAttribute('text', $context->getSymbolString()); /** @lexToken /[0-9]+/ */ $context ->setNewToken(TOKEN_NUMBER) ->setTokenAttribute('text', $context->getSymbolString());
After rebuilding token matcher with CLI utility we need to modify output cycle of our example program to make it print text with token IDs:
do { $token = $tokenReader->read(); echo "Token ID: {$token->getType()}", $token->isEoi() ? "\n" : " / '{$token->getAttribute('text')}'\n"; } while (!$token->isEoi());
And now program prints:
Token ID: 1 / 'x'
Token ID: 2 / '+'
Token ID: 3 / '2'
Token ID: 2 / '*'
Token ID: 3 / '3'
Token ID: 255
CLI
You can use command-line utility to build token matcher from specification:
vendor/bin/unilex path/to/spec/LexerSpec.php path/to/target/TokenMatcher.php --desc="My example matcher."
Specification
Specification is a PHP file that is split in parts by DocBlock comments with special tags. There is a special variable $context that contains context object with \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface interface. Current implementation also uses int variable $char that contains current symbol (TODO: should be moved into context object).
@lexHeader
This block can contain namespace and use statements that will be used during matcher generation.
@lexBeforeMatch
This block is executed before the beginning of matching procedure and can be used to initialize some additional variables.
@lexOnTransition
This block is executed on each symbol matched by token's regular expression.
@lexToken /regexp/
This block is executed on matching given regular expression from the input buffer. Most commonly it just setups new token in context object.
@lexMode 'mode_name'
This tag tells parser that matching @lexToken expression matches only if current lexical mode is mode_name. Lexical mode can be switched with $context->setMode('mode_name') method. Using lexical modes allows to have several "sub-grammars" in one specification (i. e. some tokens can be recognized only in comments or strings).
@lexOnError
This block is executed if matcher fails to match any of token's regular expressions. By default it just returns false.
统计信息
- 总下载量: 150.74k
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 6
- 点击次数: 2
- 依赖项目数: 3
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 未知