paddlenlp文本分类实践
Healthy Mind Lv3

环境安装文档

这里我使用的docker 安装比较方便,在 NVIDIA 提供的 cuda-12.2 docker镜像基础上安装 paddlenlp 2.6 一开始是自己安装,折腾了好久,各种版本不匹配。后来就改用docker 镜像了

环境准备好了后就需要提供训练数据 数据格式和训练步骤参考GitHub文档

这里需要把label.txt和train.txt准备好格式正确,然后就是使用命令训练数据了,在模型 训练完成后还需要导出模型

1
python export_model.py --params_path ./checkpoint/ --output_path ./export

其实不导出也可以用,但是项目中都会把分类做成web服务api所以需要提供web服务能里,

paddlenlp 有提供自己的web服务 PaddleNLP SimpleServing 的服务 这个服务的参数输入基本是固定了,在使用的时候我们只需要实现 from paddlenlp.server import BaseTaskflowHandler 实现 BaseTaskflowHandler的 process 方法即可 例子如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# coding:utf-8
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from paddlenlp.server import BaseTaskflowHandler
from paddlenlp.utils.log import logger

RESP_SUCCESS = "SUCCESS"
RESP_FAILURE = "FAILURE"
class TrsTaskflowHandler(BaseTaskflowHandler):
def __init__(self):
self._name = "trs_taskflow_handler"

@classmethod
def process(cls, predictor, data, parameters):
if data is None:
return {}
text = None
if len(data) > 0:
text = data
else:
return {}
if parameters is not None :
schema = parameters
predictor.set_schema(schema)
try:
result = predictor(text)
format_result = format_response(result)
return format_result
except Exception as e:
return {"result": RESP_FAILURE, "response": e, "history": []}


def format_response(result,schema):
rw={}
for r in result:
if len(r) == 0:
continue
for w in r.values():
for t in w:
if t['text'] == '':
continue
rw[t['text']]=t
return {"result": RESP_SUCCESS, "response": list(rw.keys()), "history":[]}


web服务类 TrsUieServer.py 代码如下

1
2
3
4
5
6
7
8
9
10
11
from paddlenlp import  Taskflow
from trs_server import TrsSimpleServer
from trs_taskflow_handler import TrsTaskflowHandler
# The schema changed to your defined schema
schema = ["人物"]
# The task path changed to your best model path
uie = Taskflow("information_extraction", schema=schema, task_path="../../checkpoint/model_best/")
# If you want to define the finetuned uie service
trsApp = TrsSimpleServer()
trsHandler = TrsTaskflowHandler()
trsApp.register_taskflow("chat", uie,trsHandler)

启动就只需要 执行 如下命令

1
paddle server TrsUieServer:trsApp --host 0.0.0.0 --port 9090

完整nlp server 服务代码GitHub代码

整个项目paddlenlp 的文档还是比较全面工具成套,唯一麻烦的数据,需要手动打标。这里使用的百度的doccano