|
楼主 |
发表于 2023-4-9 03:30:29
|
显示全部楼层
PHP试图将段落拆分成句子。保持标点符号
基本上我正在填充各种标点符号 如 ! ? 。 ; “并将它们分解成句子。 我面临的问题是想办法将它们分解成标点符号完整的句子,同时考虑对话中的引语 例如该段落:
One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. "What has happened!?" he asked himself. "I... don't know." said Samsa, "Maybe this is a bad dream." He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.
需要像这样分裂
[0] One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
[1] "What has happened!?" he asked himself.
[2] "I... don't know." said Samsa, "Maybe this is a bad dream."
等等。 目前我只是使用爆炸
$sentences = explode(".", $sourceWork);
并且只在期间分割并在末尾附加一个。我所知道的远不是我想要的,但我不太确定哪里可以开始处理。如果有人能够至少指出我寻找想法的正确方向,那将是惊人的。 提前致谢!
3 个回复
网友1:
preg_split('/[.?!]/',$sourceWork);
这是非常简单的正则表达式,但我认为你的任务是不可能的。
网友2:
你需要手动浏览你的String并做爆炸。跟踪报价计数,如果是奇数不打破,这里有一个简单的想法:
<?
//$str = 'AAA. BBB. "CCC." DDD. EEE. "FFF. GGG. HHH".';
$str = 'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. "What has happened!?" he asked himself. "I... don\'t know." said Samsa, "Maybe this is a bad dream." He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.';
$last_dot=0;
$quotation=0;
$explode_list = Array();
for($i=0;$i < strlen($str);$i++)
{
$char = substr($str,$i,1);//get the currect character
if($char == '"') $quotation++;//track quotation
if($quotation%2==1) continue;//nothing to do so go back
if($char == '.')
{
echo "char is $char $last_dot<br/>";
$explode_list[]=(substr($str,$last_dot,$i+1-$last_dot));
$last_dot = $i+1;
}
}
echo "testing:<pre>";
print_r($explode_list);;
网友3:
这就是我所拥有的:
<?php
/**
* @param string $str String to split
* @param string $end_of_sentence_characters Characters which represent the end of the sentence. Should be a string with no spaces (".,!?")
*
* @return array
*/
function split_sentences($str, $end_of_sentence_characters) {
$inside_quotes = false;
$buffer = "";
$result = array();
for ($i = 0; $i < strlen($str); $i++) {
$buffer .= $str[$i];
if ($str[$i] === '"') {
$inside_quotes = !$inside_quotes;
}
if (!$inside_quotes) {
if (preg_match("/[$end_of_sentence_characters]/", $str[$i])) {
$result[] = $buffer;
$buffer = "";
}
}
}
return $result;
}
$str = <<<STR
One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. "What has happened!?" he asked himself. "I... don't know." said Samsa, "Maybe this is a bad dream." He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.
STR;
var_dump(split_sentences($str, "."));
|
|